class: center, middle, inverse, title-slide .title[ # Getting Started with Data Visualisation ] .author[ ### Nicholas Sim ] .date[ ### 26 March 2024 ] --- class: center, middle, inverse # Introduction --- ### Topics * Basic syntax of `ggplot` * Global and local declarations * A first look at using `ggplot` --- ### Grammar of Graphics Ref: Chapter 2 KH We will use the `ggplot2` package to produce our data visualisations. `gg` stands for the **G**rammar of **G**raphics. The three mandatory Grammar of Graphics layers to produce a visualisation are: - **Data**, i.e. the data frame - **Aesthetic**, i.e. a mapping of a variable from the data frame as a graph element - **Geometry**, i.e. the type of visualisation --- ### Basic Syntax The basic `ggplot2` syntax requires two functions - the `ggplot()` function and a *geometry* function together. For example, to construct a scatter plot, the most basic syntax is `ggplot() + geom_point()` Depending on the desired plot to be produced, we will use various geometry functions. The ones we will explore are - `geom_point` (scatter plot) - `geom_bar` (bar chart) - `geom_col` (column chart) - `geom_line` (line plot) - `geom_path` (path line plot) - `geom_polygon` (spatial polygons, i.e. maps) --- ## Aesthetics In the Grammar of Graphics, an aesthetic is a **mapping** of a variable to an element in a graph. The graph is called a **geometry** (or *geom* for convenience). For example, a scatter plot has two necessary graph elements, an x-variable and y-variable. To construct a scatter plot, we must map two variables from the data frame into the variables `x` and `y` on the geometry. To do so, we pass the variables from the data frame into the aesthetic function, `aes()`. --- ## Aesthetics For example, to plot life expectancy on the y-axis against GDP per capita on the x-axis, we pass them into the `aes()` function as follows: ```r aes(x = GDP.PerCap, y = LifeExp) ``` This aesthetic function, in turn, is passed in as an input into the `ggplot()` function (if declared globally) or a geometry function (if declared locally). --- ### Global vs. Local Recall that basic syntax structure to construct a scatter plot is `ggplot() + geom_point()`. If `aes(x = gdp.Percap, y = lifeExp)` is passed into `ggplot()`, the aesthetic is declared **globally** and will be applied throughout unless we override it. If `aes(x = gdp.Percap, y = lifeExp)` is passed into `geom_point()`, it is declared **locally** and will only be applied by `geom_point()`, and not outside of it. --- ### The Data Frame The data frame can be declared globally or locally using the same principles. Suppose `aes(x = gdp.Percap, y = lifeExp)` is declared globally, i.e. `ggplot(aes(x = gdp.Percap, y = lifeExp)) + geom_point()` To declare our data frame `df` globally, we pass it into the `ggplot()` function: `ggplot(data = df, aes(x = gdp.Percap, y = lifeExp)) + geom_point()` To declare our data frame locally, we pass it into the `geom_point()` function: `ggplot( aes(x = gdp.Percap, y = lifeExp)) + geom_point(data = df)` --- ### Why Global vs. Local? We declare a data frame or an aesthetic globally if we want to the rest of the code chunk to inherit the data frame or aesthetic. If we prefer to use a different data frame or aesthetic for different parts of the plot (say, we overlay two scatter plots on the same graph), we may declare them locally. --- ### Argument Names Considering declaring the names of the arguments for better readability. For instance, in `ggplot(data = df, mapping = aes(x = gdp.Percap, y = lifeExp)) + geom_point()` it can be seen that `df` is a data frame, `aes(x = gdp.Percap, y = lifeExp)` is a mapping. --- ### Further Remarks Other aesthetics can be employed to increase the richness of our data visualisations. For instance, in the context of a scatter plot, the following aesthetics may be added to the plot 1. `color`, the outline of the scatter points, 2. `fill`, the fill of the scatter points with hollow centers 3. `shape`, the shape of the scatter points 4. `size`, the size of the scatter points --- class: center, middle, inverse # A First Look at `ggplot` --- ### Tasks Install and load the `gapminder` library: ```r #install.packages("gapminder") library(gapminder) ``` 1. Take a quick look at the data frame using the `head()` and `str()` function. What do you observe? 2. To present a scatter plot of life expectancy against GDP per capita, which variables from `gapminder` should be mapped to `\(x\)` and `\(y\)`? 3. The *geom* function for scatter plot is `geom_point()`. To declare the `gapminder` data frame globally, which function should we pass it into? `ggplot()` or `geom_point()`? --- ### Before We Plot Install and load the `tidyverse`, `socviz` and `gapminder` packages. We will use the `gapminder` dataset and plot life expectancy against GDP per capita using `ggplot2`. ```r #install.packages("tidyverse") #install.packages("socviz") library(tidyverse) library(gapminder) library(socviz) ``` --- ### A Look at the Data Let's first explore what `gapminder` looks like:
--- ### Declaring the Data Frame and Aesthetics Globally Let's declare the `gapminder` data frame and the `x` and `y` aesthetics **globally**. To do so, we pass the `gapminder` data frame and the `aes()` function into the `ggplot()` function: .pull-left[ ```r ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() ``` ] .pull-right[  ] --- ### Declaring the Data Frame Globally and Aesthetics Locally We may also declare the data and aesthetic layers **locally**. For example, the `x` and `y` aesthetics are declared locally here. .pull-left[ ```r ggplot(data = gapminder) + geom_point(mapping = aes(x = gdpPercap, y = lifeExp)) ``` ] .pull-right[  ] --- ### Saving the Global Data Frame and Global Aesthetic The earlier portions of the `ggplot` codes can be saved and recycled later. For example, let's declare the data frame and `x` and `y` aesthetics globally and save this part as an object called `p`. ```r # Save the data frame and aesthetics globally as p p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) ``` .pull-left[ After saving `p`, we add `geom_point()` to construct the scatter plot. ```r p + geom_point() # a shorter code ``` ] .pull-right[  ] --- ### Overlaying a Nonlinear Trend Line (LOESS) We may overlay a nonlinear trend line, based on LOESS, by **adding** `geom_smooth()`. Note that by "adding", the new feature is visualized as the top layer in the figure. .pull-left[ ```r p + geom_point() + geom_smooth() # We have saved p in the previous slide ``` ] .pull-right[  ] --- ### Linear Trend Line To change the nonlinear trend line into a linear trend, we use the option `method = "lm"` in `geom_smooth()`. .pull-left[ ```r p + geom_point() + geom_smooth(method = "lm") ``` ] .pull-right[  ] --- ### Suppressing the Confidence Bands To simplify the plot appearance, we may remove the confidence bands around the linear trend line by using the option `se = FALSE` in `geom_smooth()` ("se" stands for standard errors). .pull-left[ ```r p + geom_point() + geom_smooth(method = "lm", se = FALSE) ``` ] .pull-right[  ] --- ### Adjusting the Axis Scale Since the values of the x-variable are very large and difficult to view, we transform the x-axis scale into a log scale by adding `scale_x_log10()` to the syntax. .pull-left[ ```r p + geom_point() + geom_smooth(method = "lm", se = FALSE) + scale_x_log10() ``` ] .pull-right[  ] --- ### Changing the Units on an Axis As the x-axis represents GDP per capita, we declare "dollar" as the unit of measurement for the axis by using the option `labels = scales::dollar` in `scale_x_log10()`. .pull-left[ ```r p + geom_point() + geom_smooth(method = "lm", se = FALSE) + scale_x_log10(labels = scales::dollar) ``` ] .pull-right[  ] --- ### Visualising a Third Variable using a Color Aesthetic By declaring the variable `continent` as a color aesthetic (i.e. `aes(color = continent)`) in `geom_point()`, we may use colors to differentiate the scatter points based on which continents they are from. A color aesthetic maps values in the variable `continent` as colors on the scatter points. .pull-left[ ```r p + geom_point(mapping = aes(color = continent)) + geom_smooth(method = "lm", se = FALSE) + scale_x_log10(labels = scales::dollar) ``` ] .pull-right[  ] --- ### Adding Titles and Labels We add the plot title, subtitle, axis titles, etc. by using the `labs()` function: .pull-left[ ```r p + geom_point(mapping = aes(color = continent)) + geom_smooth(method = "lm", se = FALSE) + scale_x_log10(labels = scales::dollar) + labs(x = "GDP Per Capita", y = "Life Expectancy", title = "Economic Growth and Life Expectancy", subtitle = "Data points are country-years", color = "Continent", caption = "Source: Gapminder") ``` ] .pull-right[  ] --- ### Declaring the Theme The theme is the overall plot appearance, which can be controlled using the `theme()` function. There are pre-determined theme settings we may use. Here, we use `theme_minimal()` for a simple theme (other themes include is `theme_bw()`, which generates a white background). .pull-left[ ```r p + geom_point(mapping = aes(color = continent)) + geom_smooth(method = "lm", se = FALSE) + scale_x_log10(labels = scales::dollar) + labs(x = "GDP Per Capita", y = "Life Expectancy", title = "Economic Growth and Life Expectancy", subtitle = "Data points are country-years", color = "Continent", caption = "Source: Gapminder") + theme_minimal() ``` ] .pull-right[  ] --- ### Adjusting the Title and Axis Font Size Using the `theme()` function, we may control the appearance of our figure such as the font size of the title, axis titles, etc. .pull-left[ ```r p + geom_point(mapping = aes(color = continent)) + geom_smooth(method = "lm", se = FALSE) + scale_x_log10(labels = scales::dollar) + labs(x = "GDP Per Capita", y = "Life Expectancy", title = "Economic Growth and Life Expectancy", subtitle = "Data points are country-years", color = "Continent", caption = "Source: Gapminder") + theme_minimal() + theme(title = element_text(size=16), axis.text.x = element_text(size = 12), # x-axis grid text axis.text.y = element_text(size = 12), # y-axis grid text axis.title = element_text(size = 14)) # axis title text ``` ] .pull-right[  ] --- ### Setting the Plot Features Rather than using a color aesthetic (a mapping of a variable to the color element on the plot), we may set the color in a more deliberate manner. For example, let's differentiate Asia's data from the rest of the world by setting the color of Asia's scatter points as red. .pull-left[ ```r p + geom_point(color = "grey") + geom_point(data = subset(gapminder, continent == "Asia"), color = "red", size= 2) + geom_smooth(method = "lm", se = FALSE) + scale_x_log10(labels = scales::dollar) + labs(x = "GDP Per Capita", y = "Life Expectancy", title = "Economic Growth and Life Expectancy of Asia", subtitle = "Data points are country-years \nAsia is highlighted in red ", caption = "Source: Gapminder") + theme_minimal() + theme(title = element_text(size=16), axis.text.x = element_text(size = 12), # x-axis grid text axis.text.y = element_text(size = 12), # y-axis grid text axis.title = element_text(size = 14)) # axis title text ``` ] .pull-right[  ] --- ### Adding Animation `ggplot2` has many interesting extensions, such as plot animation, which reveals changes in a variable over time. Below, we animate the figure using the `gganimate` package. See https://gganimate.com/ and https://www.youtube.com/watch?v=SnCi0s0e4Io To do so, install the following packages: ```r library(pkgbuild) library(gifski) library(gganimate) library(png) ``` --- ### Adding Animation **Note**: If you are using a "group" aesthetic (see Seminar 4), replace `transition_time(year)` below with `transition_reveal(year)`. .panelset[ .panel[.panel-name[R Code] ```r p.anim <- p + geom_point(mapping = aes(color = continent)) + # Remove geom_smooth scale_x_log10(labels = scales::dollar) + labs( x = "GDP Per Capita", y = "Life Expectancy", subtitle = "Data points are country-years", color = "Continent", caption = "Source: Gapminder") + theme_minimal() + theme(title = element_text(size=16), axis.text.x = element_text(size = 12), # x-axis grid text axis.text.y = element_text(size = 12), # y-axis grid text axis.title = element_text(size = 14)) + # axis title text transition_time(year) + labs(title = "GDP Per Capita and Life Expectancy, Year: {frame_time}") # put the title here # Rendering animate(p.anim, renderer = gifski_renderer()) # Saving anim_save("example.gif", animation = p.anim) p.anim ``` ``` ] .panel[.panel-name[Plot] <img src="GettingStartedwithDataVisualisation_files/figure-html/anim.1-out-1.gif" style="display: block; margin: auto;" /> ] ] --- ### Exercise (Try it Yourself) Using the `airquality` data (type in `datasets::airquality` into your console), construct a scatter plot with `Ozone` on the y-axis and `Temp` on the x-axis. Display the LOESS trendline on the plot. Use `theme_minimal` for the plot appearance and include the title "Ozone and Temperature at JFK Airport, 1973". You should replicate the figure below (see the RMarkdown file for the solution). <img src="GettingStartedwithDataVisualisation_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ### Hands-On Activity Activity 1 to 4 in `Seminar3_demo_part1.r`