class: center, middle, inverse, title-slide .title[ # Improving Data Storytelling: Labels and Annotations ] .author[ ### Nicholas Sim ] .date[ ### 08 August 2025 ] --- class: center, middle, inverse # Introduction --- ### Topics * Label aesthetic and repelling labels with `geom_text_repel()` * Annotation with areas and text with `annotate()` --- ### Required Libraries ``` r library(tidyverse) library(socviz) library(ggrepel) theme_set(theme_bw()) # Set the map theme to black and white ``` --- ### Introduction In this discussion, we explore the use of labels and annotations to improve the clarity and context of data visualizations. Labels are aesthetics, i.e. text values mapped from a data frame variable, that help in the identification of points in a scatter plot or values represented by bars in a bar chart. Annotations, on the other hand, act as explanatory elements within a data visualization to provide descriptions of specific features in a plot. Unlike labels, annotations are not derived from the data frame and therefore are not an aesthetic. Instead, they are specified as text to explain certain features in a plot or as shaded areas that highlight particular relationships or events of significance. The focus in this discussion is to use various elements to improve data storytelling. We will consider how labels and annotations can be used and how we may clean up the plot for more effective data storytelling. --- class: center, middle, inverse # The Label Aesthetic --- ### Adding Labels Using the `label` aesthetic, labels can be added to visualizations such as scatter plots, bar charts, etc. Labels can be included as a label aesthetic using `geom_label`. However, the labels may overlapped if they are too crowded, and this makes it challenging to read these labels in the first place. This is the case when we use the `geom_label()` function that comes standard with the `ggolot2` package. To avoid overcrowding, a useful package for adding labels that are automatically positioned nicely is the `ggrepel` package. The function `geom_text_repel()` from the package adds labels to the data points by calculating the best positions to place these labels in the figure. --- ### Example: How Were The US Presidential Elections Won? As an example, let's use the `elections_historic` dataset from the `socviz` library. We will plot `ec_pct` against `popular_pct`. We will be labelling the scatter points using `winner_label` as the label aesthstic. Let's explore the dataset here. .panelset[ .panel[.panel-name[R Code] ``` r glimpse(elections_historic) ``` ``` ## Rows: 49 ## Columns: 19 ## $ election <int> 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,… ## $ year <int> 1824, 1828, 1832, 1836, 1840, 1844, 1848, 1852, 1856, 1… ## $ winner <chr> "John Quincy Adams", "Andrew Jackson", "Andrew Jackson"… ## $ win_party <chr> "D.-R.", "Dem.", "Dem.", "Dem.", "Whig", "Dem.", "Whig"… ## $ ec_pct <dbl> 0.3218, 0.6820, 0.7657, 0.5782, 0.7959, 0.6182, 0.5621,… ## $ popular_pct <dbl> 0.3092, 0.5593, 0.5474, 0.5079, 0.5287, 0.4954, 0.4728,… ## $ popular_margin <dbl> -0.1044, 0.1225, 0.1781, 0.1420, 0.0605, 0.0145, 0.0479… ## $ votes <int> 113142, 642806, 702735, 763291, 1275583, 1339570, 13602… ## $ margin <int> -38221, 140839, 228628, 213384, 145938, 39413, 137882, … ## $ runner_up <chr> "Andrew Jackson", "John Quincy Adams", "Henry Clay", "W… ## $ ru_part <chr> "D.-R.", "N. R.", "N. R.", "Whig", "Dem.", "Whig", "Dem… ## $ turnout_pct <dbl> 0.269, 0.573, 0.570, 0.565, 0.803, 0.792, 0.728, 0.695,… ## $ winner_lname <chr> "Adams", "Jackson", "Jackson", "Buren", "Harrison", "Po… ## $ winner_label <chr> "Adams 1824", "Jackson 1828", "Jackson 1832", "Buren 18… ## $ ru_lname <chr> "Jackson", "Adams", "Clay", "Harrison", "Buren", "Clay"… ## $ ru_label <chr> "Jackson 1824", "Adams 1828", "Clay 1832", "Harrison 18… ## $ two_term <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F… ## $ ec_votes <dbl> 84, 178, 219, 170, 234, 170, 163, 254, 174, 180, 212, 2… ## $ ec_denom <dbl> 261, 261, 286, 294, 294, 275, 290, 296, 296, 303, 233, … ``` ] .panel[.panel-name[Data]
] ] --- ### Basic Plot Let's declare the titles and labels in advance. ``` r p_title <- "Presidential Elections: Popular & Electoral College Margins" p_subtitle <- "1824-2016" p_caption <- "Data for 2016 are provisional." x_label <- "Winner's share of Popular Vote" y_label <- "Winner's share of Electoral College Votes" ``` --- ### Basic Plot A scatter plot of `ec_pct` against `popular_pct` is shown here. .panelset[ .panel[.panel-name[R Code] ``` r p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct)) p + geom_point() + labs(title = p_title, subtitle = p_subtitle, caption = p_caption, x = x_label, y = y_label) ``` ] .panel[.panel-name[Data] <img src="LabelsAndAnnotations_files/figure-html/elections.2-out-1.png" style="display: block; margin: auto;" /> ] ] --- ### Adding Labels Without labeling the scatter points, it is impossible to tell who had won the elections represented by these points. To do so, we pass in name labels as a `label` aesthetic into the function `geom_text_repel()` (from the `ggrepel` package). In the `elections_historic` dataset, the name labels are contained in the variable `winner_label`. --- ### Adding Labels .panelset[ .panel[.panel-name[R Code] ``` r p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct, label = winner_label)) p + geom_point() + geom_text_repel() + labs(title = p_title, subtitle = p_subtitle, caption = p_caption, x = x_label, y = y_label) ``` ] .panel[.panel-name[Data] <img src="LabelsAndAnnotations_files/figure-html/elections.2a-out-1.png" style="display: block; margin: auto;" /> ] ] --- ### Cleaning Up the Plot: Displaying Reference Lines Let's create a 50% reference line on both x- and y-axis to see if a point has crossed 50% mark. To draw a *horizontal* line, we use `geom_hline()` and specify the value of `yintercept` (i.e. where the horizontal line will intercept the y-axis). To draw a *vertical* line, we use `geom_vline()` and specify the value of `xintercept` (i.e. where the vertical line will intercept the x-axis). --- ### Cleaning Up the Plot: Displaying Reference Lines .panelset[ .panel[.panel-name[R Code] ``` r p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct, label = winner_label)) p + geom_point() + geom_text_repel() + geom_hline(yintercept = 0.5, size = 1.1, color = "gray80", alpha=0.4) + geom_vline(xintercept = 0.5, size = 1.1, color = "gray80", alpha=0.4) + labs(title = p_title, subtitle = p_subtitle, caption = p_caption, x = x_label, y = y_label) ``` ] .panel[.panel-name[Data] <img src="LabelsAndAnnotations_files/figure-html/elections.3-out-1.png" style="display: block; margin: auto;" /> ] ] --- ### Cleaning Up the Plot: Adjusting the Axes Labels The axes represent the percentage of votes won. As these percentages are currently shown as numbers between 0 to 1, we should display them as actual percentages (say, 65\% than 0.65) by passing in `labels = scales::percent` into `scale_x_continuous()`. --- ### Cleaning Up the Plot: Adjusting the Axes Labels .panelset[ .panel[.panel-name[R Code] ``` r p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct, label = winner_label)) p + geom_point() + geom_text_repel() + geom_hline(yintercept = 0.5, size = 1.1, color = "gray80", alpha=0.4) + geom_vline(xintercept = 0.5, size = 1.1, color = "gray80", alpha=0.4) + scale_x_continuous(labels = scales::percent) + scale_y_continuous(labels = scales::percent) + labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle, caption = p_caption) ``` ] .panel[.panel-name[Data] <img src="LabelsAndAnnotations_files/figure-html/elections.4-out-1.png" style="display: block; margin: auto;" /> ] ] --- ### Cleaning Up the Plot: Customizing the Axis Labels Rather than using `labels = scales::percent`, we may customize our labels using `labels = scales::unit_format(scale = 100, unit = "%", sep = "")`. Here, we multiply (or scale up) the original variable by 100 and use the percentage sign (%) to convey the unit of measurement. We also use `sep = ""` to ensure that there is no space between the value label and the % symbol (otherwise, we will have something like 65 % instead of 65%). **Note**: `unit_format()` has been retired. We may use other alternatives, like `label_number()` or `comma_format()`. Please see the documentation for details. --- ### Cleaning Up the Plot: Customizing the Axis Labels .panelset[ .panel[.panel-name[R Code] ``` r p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct, label = winner_label)) p + geom_point() + geom_text_repel() + geom_hline(yintercept = 0.5, size = 1.1, color = "gray80", alpha=0.4) + geom_vline(xintercept = 0.5, size = 1.1, color = "gray80", alpha=0.4) + scale_x_continuous(labels = scales::unit_format(scale = 100, unit="%", sep="")) + scale_y_continuous(labels = scales::unit_format(scale=100, unit="%")) + labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle, caption = p_caption) ``` ] .panel[.panel-name[Data] <img src="LabelsAndAnnotations_files/figure-html/elections.4a-out-1.png" style="display: block; margin: auto;" /> ] ] --- ### Cleaning Up the Plot: Adjusting the Color Scales We may adjust the scales of other settings such as color or fill. For example, to adjust the color scales, we use `scales_color_<kind>`. See chapter 5 KH for more details. To illustrate, let's display the party colors using the `color` aesthetic. In `elections_historic`, the `win_party` variable shows the party affiliations of the election winners. --- ### Cleaning Up the Plot: Adjusting the Color Scales .panelset[ .panel[.panel-name[R Code] ``` r p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct, label = winner_label, color = win_party)) p + geom_point() + geom_text_repel() + geom_hline(yintercept = 0.5, size = 1.1, color = "gray80", alpha=0.4) + geom_vline(xintercept = 0.5, size = 1.1, color = "gray80", alpha=0.4) + scale_x_continuous(labels = scales::percent) + scale_y_continuous(labels = scales::percent) + labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle, caption = p_caption) ``` ] .panel[.panel-name[Data] <img src="LabelsAndAnnotations_files/figure-html/elections.5-out-1.png" style="display: block; margin: auto;" /> ] ] --- ### Cleaning Up the Plot: Adjusting the Color Scales Let's differentiate the scatter points using the party colors. From the legend, notice that the Democratic party is represented by the 2nd element and the Republic party is represented by the 3rd element. Therefore, we set the color codes for the Democratic and Republican party as the 2nd and 3rd elements in the vector `party_colors`. ``` r party_colors <- c("#000000", "#2E74C0", "#CB454A", "#000000") ``` Let's use `party_colors` as a color aesthetic to distinguish the scatter points. --- ### Cleaning Up the Plot: Adjusting the Color Scales .panelset[ .panel[.panel-name[R Code] ``` r p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct, label = winner_label, color = win_party)) p + geom_point() + geom_text_repel() + geom_hline(yintercept = 0.5, size = 1.1, color = "gray80", alpha=0.4) + geom_vline(xintercept = 0.5, size = 1.1, color = "gray80", alpha=0.4) + scale_x_continuous(labels = scales::percent) + scale_y_continuous(labels = scales::percent) + scale_color_manual(values = party_colors) + ## Adjust the color scales labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle, caption = p_caption) ``` ] .panel[.panel-name[Data] <img src="LabelsAndAnnotations_files/figure-html/elections.7-out-1.png" style="display: block; margin: auto;" /> ] ] --- class: center, middle, inverse # Annotation --- ### Adding Text As Annotation The `annotate()` function can be used to highlight some important information on the plot itself. For example, we may add a text label next to a data point on the plot by passing in the setting `geom = "text"` into `annotate()` and specifying the `x`, `y`, and `label` settings. The `x` and `y` settings indicate the position of the text label and the `label` setting provides the annotated text iself. We may also use settings such as `size` and `color` as well as `hjust` and `vjust` to adjust the position of the text. To introduce a line break, we use `\n`. .pull-left[ ``` r p <- ggplot(data = organdata, mapping = aes(x = roads, y = donors)) # \n is the break command (i.e. new line). # A positive hjust moves the text to the left. p + geom_point() + annotate(geom = "text", x = 157, y = 33, hjust = 0, label = "A surprisingly high \nrecovery rate.") ``` ] .pull-right[ <img src="LabelsAndAnnotations_files/figure-html/annotate.1-out-1.png" style="display: block; margin: auto;" /> ] ] --- ### Adding a Shaded Area as Annotation To shade an area in the plot, we pass `geom = "rect"` into `annotate()`. Like `geom_rect()`, we must specify the values for `xmin`, `xmax` `ymin`, and `ymax` to indicate the size and position of the (rectangular) shaded area to be displayed,. To shade areas, it is better to use `annotate()` than `geom_rect()` as you may not get the desired color/opacity with the latter. .pull-left[ ``` r p <- ggplot(data = organdata, mapping = aes(x = roads, y = donors)) p + geom_point() + annotate(geom = "text", x = 157, y = 33, label = "A surprisingly high \nrecovery rate.", hjust = 0) + annotate(geom = "rect", xmin = 125, xmax = 155, ymin = 30, ymax = 35, fill = "red", alpha = 0.2) ``` ] .pull-right[ <img src="LabelsAndAnnotations_files/figure-html/annotate.2-out-1.png" style="display: block; margin: auto;" /> ] ] --- ### Exercise: Movie Ratings For this exercise, use the dataset `MovieRatings` and save it into `df`. These ratings are taken from IMDB. Use RMarkdown to generate a short report. 1. Summarize what you observe about the dataset in 60 words. Rename your variables into some more manageable. Use `colnames()`. 2. Plot the audience ratings against ratings from Rotten Tomatoes. Fit a OLS regression line without standard errors. Use budget as a size aesthetic and genre as a color aesthetic. Use `theme_minimal()` <img src="LabelsAndAnnotations_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> 3. Using `geom_text_repel()`, identify the films with a budget of more than 200 million (i.e. label the data points with the name of the film). Use `nudge_y = 10, nudge_x = 1, segment.size = 1` as settings in `geom_text_repel()`. Hint: For the data argument, pass in `filter(df, Budget.Million>200)` into `geom_text_repel()`. Use `Film` as a label aesthetic. Discuss the results with about 60 words. <img src="LabelsAndAnnotations_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" />