class: center, middle, inverse, title-slide .title[ # Plotting Distributions for Comparisons: Boxplots, Dotplots and Tree Maps ] .author[ ### Nicholas Sim ] .date[ ### 10 April 2024 ] --- class: center, middle, inverse # Introduction --- ### Topics * Box plot with `geom_boxplot()` * Other distribution plots - dot and jitter plots with `geom_point()` and `geom_jitter()` * Treemaps --- ### Required Libraries ```r library(tidyverse) library(socviz) library(ggrepel) library(treemapify) theme_set(theme_bw()) # Set the map theme to black and white ``` --- ### Introduction In this discussion, we consider various approaches for comparing distributions of a numerical variable across observational units. For instance, we may not only be interested in the distribution of inflation rates (over a certain time period) for a certain country, but also to compare how these distributions differ across a set of countries. To compare distributions, we may construct the boxplots, dotplots, jitter plots and ridge plots. It turns out such comparison distribution plots are straightforward to construct: specify the countries or units you want to compare across as the x-aesthetic, and the variable whose distribution is to be constructed as the y-aesthetic. Finally, we may plot the distribution of a variable that shows a hierarchical structure, such as, the ranking of the importance of the different attributes in a variable. To show such information, we may consider plotting a treemap. --- class: center, middle, inverse # Boxplot --- ### Boxplot A boxplot displays 5 summary statistics: 1st quartile (Q1), median, 3rd quartile (Q3), as well as two boundary points, the "minimum" and "maximum". The "minimum" is defined as Q1-1.5(Q3-Q1) and the "maximum" as Q3+1.5(Q3-Q1), where Q3-Q1 is known as the interquartile range (IQR). Data points outside the minimum and maximum boundaries may be viewed as outliers. These are represented by dots. --- ### Boxplot For example, consider the dataset `organdata`, which contains longitudinal data on countries' organ donation rates across years. The main variable of interest is `donors`, which is the organ donation rate per million population. .panelset[ .panel[.panel-name[R Code] ```r dplyr::glimpse(organdata) ``` ``` ## Rows: 238 ## Columns: 21 ## $ country <chr> "Australia", "Australia", "Australia", "Australia", "… ## $ year <date> NA, 1991-01-01, 1992-01-01, 1993-01-01, 1994-01-01, … ## $ donors <dbl> NA, 12.09, 12.35, 12.51, 10.25, 10.18, 10.59, 10.26, … ## $ pop <int> 17065, 17284, 17495, 17667, 17855, 18072, 18311, 1851… ## $ pop_dens <dbl> 0.2204433, 0.2232723, 0.2259980, 0.2282198, 0.2306484… ## $ gdp <int> 16774, 17171, 17914, 18883, 19849, 21079, 21923, 2296… ## $ gdp_lag <int> 16591, 16774, 17171, 17914, 18883, 19849, 21079, 2192… ## $ health <dbl> 1300, 1379, 1455, 1540, 1626, 1737, 1846, 1948, 2077,… ## $ health_lag <dbl> 1224, 1300, 1379, 1455, 1540, 1626, 1737, 1846, 1948,… ## $ pubhealth <dbl> 4.8, 5.4, 5.4, 5.4, 5.4, 5.5, 5.6, 5.7, 5.9, 6.1, 6.2… ## $ roads <dbl> 136.59537, 122.25179, 112.83224, 110.54508, 107.98096… ## $ cerebvas <int> 682, 647, 630, 611, 631, 592, 576, 525, 516, 493, 474… ## $ assault <int> 21, 19, 17, 18, 17, 16, 17, 17, 16, 15, 16, 15, 14, N… ## $ external <int> 444, 425, 406, 376, 387, 371, 395, 385, 410, 409, 393… ## $ txp_pop <dbl> 0.9375916, 0.9257116, 0.9145470, 0.9056433, 0.8961075… ## $ world <chr> "Liberal", "Liberal", "Liberal", "Liberal", "Liberal"… ## $ opt <chr> "In", "In", "In", "In", "In", "In", "In", "In", "In",… ## $ consent_law <chr> "Informed", "Informed", "Informed", "Informed", "Info… ## $ consent_practice <chr> "Informed", "Informed", "Informed", "Informed", "Info… ## $ consistent <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes… ## $ ccode <chr> "Oz", "Oz", "Oz", "Oz", "Oz", "Oz", "Oz", "Oz", "Oz",… ``` ] .panel[.panel-name[Data]
] ] --- ### Preliminary Plot Before we construct a boxplot for each country, let's examine `organdata` for each country in the sample. The figure below shows that the organ donation rates of several countries have been somewhat stable over time (e.g. Germany). In other countries, the donation spread spread across years can be quite large (e.g. Spain and Italy). .pull-left[ ```r p <- ggplot(data = organdata, mapping = aes(x = year, y = donors)) p + geom_line(aes(group = country)) + facet_wrap(~country) ``` ] .pull-right[ <img src="PlottingDistributionsBoxplots_files/figure-html/box.1-out-1.png" style="display: block; margin: auto;" /> ] --- ### Constructing a Boxplot Let's construct a boxplot for each country. To do so, we map `country` as an `x`-variable aesthetic and `donors` as `y`-variable aesthetic. `country` is a categorical variable where each country is a category and `donors` contains the donor donation rates for each country across years. .pull-left[ ```r p <- ggplot(data = organdata, mapping = aes(x = country, y = donors)) p + geom_boxplot() ``` ] .pull-right[ <img src="PlottingDistributionsBoxplots_files/figure-html/box.2-out-1.png" style="display: block; margin: auto;" /> ] --- ### Flipping the Coordinates As it is visually easier to view from left to right (than from down to up), let's flip the coordinates below. .pull-left[ ```r p <- ggplot(data = organdata, mapping = aes(x = country, y = donors)) p + geom_boxplot() + coord_flip() ``` ] .pull-right[ <img src="PlottingDistributionsBoxplots_files/figure-html/box.3-out-1.png" style="display: block; margin: auto;" /> ] --- ### Reordering a Variable To clean up the boxplots, we may order the countries' boxplots based on the mean of the organ donation rates across years. The country with the smallest mean is ordered at the bottom of the plot and the country with largest mean is ordered at the top. This can be done by using the `reorder()` function from the `stats` library (see https://stat.ethz.ch/R-manual/R-devel/library/stats/html/reorder.factor.html). The `reorder()` function passes in the categorical variable to be reordered (i.e. `country`), then the criterion by which the categories are to be reordered. --- ### Reordering a Variable Let's reorder the countries (i.e. categories) based on the mean donation rate (i.e. criterion). The command is `reorder(country, donors, mean, na.rm = TRUE)`, where the 1st argument takes in the categories to be reordered, the 2nd argument takes in the **variable** by which the categories are to be reordered, the 3rd takes in the **function** to be applied to the second variable for the reordering. The default function is `mean()`. --- ### Reordering a Variable .panelset[ .panel[.panel-name[R Code] ```r ## We can leave "mean" out of the reorder() because the mean function is the default. p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, mean, na.rm = TRUE), y = donors)) p + geom_boxplot() + labs(x = NULL) + coord_flip() ``` ] .panel[.panel-name[Plot] <img src="PlottingDistributionsBoxplots_files/figure-html/box.4-out-1.png" style="display: block; margin: auto;" /> ] ] --- ### Differenting by Government Types Let's differentiate the countries by filling in the boxplots with different colors based on the types of government. This information is contained in the variable `world`. --- ### Differenting by Government Types .panelset[ .panel[.panel-name[R Code] ```r p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm=TRUE), y = donors, fill = world)) p + geom_boxplot() + labs(x = NULL) + coord_flip() + theme(legend.position = "top") ``` ] .panel[.panel-name[Plot] <img src="PlottingDistributionsBoxplots_files/figure-html/box.5-out-1.png" style="display: block; margin: auto;" /> ] ] --- class: center, middle, inverse # Dotplots --- ### Dotplots Besides the boxplot, another useful presentation of distribution is the **dotplot**. The dotplot shows how values of the observations are distributed for each category. The dotplot can be constructed using `geom_point()` where the `x`-variable aesthetic is a factor/categorical variable and the `y`-variable aesthetic is a numerical variable whose distribution is to be displayed. --- ### Dotplots .panelset[ .panel[.panel-name[R Code] ```r p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm=TRUE), y = donors, color = world)) ## use color instead of fill p + geom_point() + labs(x = NULL) + coord_flip() + theme(legend.position = "top") ``` ] .panel[.panel-name[Plot] <img src="PlottingDistributionsBoxplots_files/figure-html/dot.1-out-1.png" style="display: block; margin: auto;" /> ] ] --- class: center, middle, inverse # Jitter Plots --- ### Jitter Plots The scatter points in a dotplot may heavily overlap or bunch up. This may make it difficult to see how dense the values of a variable is for each category in a categorical variable. To separate the points, we add a very small random noise to these data points. This procedure is known as **jittering**. To jitter a point plot, we use `geom_jitter()`. The size of the jitter can be controlled using the setting `position_jitter(width = 0.13)`. --- ### Jitter Plots .panelset[ .panel[.panel-name[R Code] ```r p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm=TRUE), y = donors, color = world)) ## use color instead of fill p + geom_jitter(position = position_jitter(width=0.13)) + labs(x = NULL) + coord_flip() + theme(legend.position = "top") ``` ] .panel[.panel-name[Plot] <img src="PlottingDistributionsBoxplots_files/figure-html/dot.2-out-1.png" style="display: block; margin: auto;" /> ] ] --- class: center, middle, inverse # Ridge Plots --- ### Ridge Plots A ridge plot is a plot that compares the density of a numerical variable across groups. It can be constructed using `geom_density_ridges()` from the `ggridges` package (install it first). In the example below, the x-aesthetic represents the groups (i.e. countries) and the y-aesthetic represents the numerical variable whose density is to be shown. `coord_flip` is not required as it is understood that densities should be shown horizontally. .panelset[ .panel[.panel-name[R Code] ```r library(ggridges) p <- ggplot(data = organdata, mapping = aes(y = reorder(country, donors, na.rm=TRUE), x = donors, fill = world)) ## use color instead of fill p + geom_density_ridges(alpha = 0.4) + labs(y = NULL) + theme(legend.position = "top") ``` ] .panel[.panel-name[Plot] <img src="PlottingDistributionsBoxplots_files/figure-html/dense.1-out-1.png" style="display: block; margin: auto;" /> ] ] --- class: center, middle, inverse # Treemaps --- ### Treemaps Nesting is an approach of putting something smaller within something larger. Hierarchical data have nested structures, as items higher up in the hierarchy contain items below it. To visualize such data, we may construct a **treemap**, which displays hierarchical structures by using nested rectangles. To illustrate a simple treemap, let's plot GDP per capita using data fetched from the World Development Indicators API. For more details on fetching the data, refer to the handout on RMarkdown in Seminar 4. --- ### Treemaps Let's view the data on GDP per Capita for 2022 ``` ## # A tibble: 10 × 3 ## country.value date GDPperCap ## <chr> <chr> <dbl> ## 1 Brunei Darussalam 2022 37152. ## 2 Cambodia 2022 1760. ## 3 Indonesia 2022 4788. ## 4 Lao PDR 2022 2054. ## 5 Malaysia 2022 11993. ## 6 Myanmar 2022 1149. ## 7 Philippines 2022 3499. ## 8 Singapore 2022 82808. ## 9 Thailand 2022 6910. ## 10 Viet Nam 2022 4164. ``` --- ### Treemaps The following steps are implemented to construct a treemap to visualize GDP per Capita. 1. Install and import the package, `treemapify`. 2. The required aesthetic is `area`, which is `GDPperCapita`. Declare it in `ggplot()` and add `geom_treemap()` 3. To differentiate the colors of the boxes, use `GDPerCapita` as a `fill` aesthetic. 4. To label the rectangles, include `country.value` as a label aesthetic. 5. To display the labels, use the following codes `geom_treemap_text(family = "AppleGothic", colour = "white", place = "centre", grow = TRUE)` 6. Clean the plots by changing the `fill` scale to `scale_fill_viridis_c` and change the titles. --- ### Exercise: Constructing a Treemap .pull-left[ <img src="PlottingDistributionsBoxplots_files/figure-html/gdppercap.3-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="PlottingDistributionsBoxplots_files/figure-html/gdppercap.4-1.png" style="display: block; margin: auto;" /> ]