class: center, middle, inverse, title-slide .title[ # Column Plots and Data Preprocessing ] .author[ ### Nicholas Sim ] .date[ ### 15 February 2024 ] --- class: center, middle, inverse # Introduction --- ### Topics * Use `geom_bar()` to present table information * Use `geom_col()` to present table information * Changing the plot orientation * Use `group_by()` to preprocess data for column plots * Streetfighting data visualizations * Using Python in R --- ### Required Libraries ```r library(tidyverse) library(socviz) library(ggrepel) library(gapminder) theme_set(theme_bw()) # Set the theme to black and white ``` --- ### Table Data Ref: Chapters 4, 5 KH When `geom_bar()` is applied, the counts (i.e. frequencies) or proportions (i.e. relative frequencies) of each class in a categorical variable are first computed. Then, these summarized values are displayed as bars, along the y-axis, in a bar chart. What if, the counts or proportions of categorical variable we want to plot are already computed? .pull-left[ For instance, in the Titanic dataset here, we do not have the records on individuals, except a table showing the counts and percentages of the sample belonging to various categories. How, then, do we plot the table values?] .pull-right[ ```r titanic ``` ``` ## fate sex n percent ## 1 perished male 1364 62.0 ## 2 perished female 126 5.7 ## 3 survived male 367 16.7 ## 4 survived female 344 15.6 ``` ] --- ## Table Data To plot table data, we cannot simply use `geom_bar()` as it will, by default, attempt to compute the y-values before they are displayed. To override this "pre-processing" part, we need to declare the setting, `stat = "identity"`, to tell the function to take in y-values. Alternatively, we may use `geom_col()`, which uses a y-aesthetic, by default. --- ### Visualizing Tables .pull-left[ For the Titanic dataset, let's construct a bar chart for the categorical variable `fate` using `percent` as the height of the bars. This dataset is an example of table data. The fate of individuals, i.e. perished or survived, has already been summarized in a data form. ] .pull-right[
] --- ### The "identity" Setting To plot `percent` on the y-axis against the categories in `fate` on the x-axis, we need to suppress the default "count" operation in `geom_bar()` and force it to take `percent` as y-values for the bar chart. To do so, we use the setting `stat = "identity"`. .pull-left[ ```r p <- ggplot(data = titanic, mapping = aes(x = fate, y = percent)) p + geom_bar(stat = "identity") ``` ] .pull-right[ <img src="ColumnPlotAndDataPreprocessing_files/figure-html/bars.1-out-1.png" style="display: block; margin: auto;" /> ] --- ### Cleaning Up We use the variable `sex` as a fill aesthetic to differentiate males and females in the bars. To plot the variable `fate` for males and females side-by-side, we use `dodge` as a position setting. .pull-left[ ```r p <- ggplot(data = titanic, mapping = aes(x = fate, y = percent, fill = sex)) p + geom_bar(position = "dodge", stat = "identity") + theme(legend.position = "top")# place the legend at the top ``` ] .pull-right[ <img src="ColumnPlotAndDataPreprocessing_files/figure-html/bars.2-out-1.png" style="display: block; margin: auto;" /> ] --- ### Plotting Tables with `geom_col()` Instead of using `geom_bar()` with a `stat = "identity"` setting, we may plot table data by using `geom_col()`, which takes in a y-aesthetic for the height of the bars. .pull-left[ ```r p <- ggplot(data = titanic, mapping = aes(x = fate, y = percent, fill = sex)) p + geom_col(position = "dodge") + theme(legend.position = "top") ``` ] .pull-right[ <img src="ColumnPlotAndDataPreprocessing_files/figure-html/bars.3-out-1.png" style="display: block; margin: auto;" /> ] --- class: center, middle, inverse # Data Pre-Processing --- ### Data Pre-Processing A `geom` function may call an underlying `stat` function for data processing before plotting. For example, `geom_bar()`, by default, will first count the frequency for each class in a categorical variable. To plot proportions, we pass in `y = ..prop..` as an aesthetic and the underlying `stat` function will compute the class proportions first. Thus, a `geom` function has a corresponding `stat` function that could also be used to pre-process the data and produce the same plot. --- ### Data Pre-Processing This emphasizes data pre-processing as an important step before data visualization. Typical data pre-processing steps may include * Feature Engineering * Data Reshaping Feature engineering is the creation of new variables. Data reshaping may involve the aggregation or pivoting of data. To provide certain visualizations, we may need to do both as we cannot feasibly generate a plot we wish to show using standard ggplot functions. --- ### Example: Defining Groups for Distribution Plots For instance, suppose we wish to plot the distribution (proportion) of religion within each region. Recall that we cannot simply force each region as a group by using `bigregion` as a group aesthetic (See Seminar 4). To plot the distribution of religion within each region, we will need to compute the distribution ourselves first. Let's use the `gss_sm` dataset, calculate the distribution (i.e. proportion) of people across different religions for each region, and then plot the distribution.
--- ### Constructing Grouped Data To do so, we first group the observations by `bigregion`, then by `religion`. This creates a `bigregion`-`religion` group-level structure. Then, we calculate the number of observations for each `religion` category (e.g. Protestant, Catholic, etc.) within each `bigregion` category (e.g. Northwest, Midwest, etc.). This is achieved by the command `summarize(N = n())`, where `n()` returns the count contained in `N` for each `bigregion`-`religion` level. --- ### Constructing Grouped Data To compute the total number of observations for each `bigregion` level, we sum the number of counts for each `bigregion`-`religion` level, which is captured by `N`, i.e. `sum(N)`. Note that `sum(N)` does not return the sample size of the dataset; instead, it sums the counts over the second group level (i.e. religion) for each region. To compute the relative frequency (or proportion) of people across the different religions for each region, we generate `N/sum(N)`. The codes to preprocess the data into relative frequencies of religions per region is shown below. ```r rel_by_region <- gss_sm %>% group_by(bigregion, religion) %>% ## Group by bigregion first, then religion summarize(N = n())%>% #Count the number of observations in each bigregion-religion group. This is represented by N. mutate(freq = N/ sum(N), pct = round((freq*100), 0)) ## Using the count for each bigregion-religion group, calculate the proportion N/ sum(N) and then percentage. # Note that sum(N) totals up the number of observations across religion for each bigregion category. ``` --- ### Constructing Grouped Data Let's take a look at the final structure of the grouped dataset. ```r head(rel_by_region,12) ``` ``` ## # A tibble: 12 × 5 ## # Groups: bigregion [2] ## bigregion religion N freq pct ## <fct> <fct> <int> <dbl> <dbl> ## 1 Northeast Protestant 158 0.324 32 ## 2 Northeast Catholic 162 0.332 33 ## 3 Northeast Jewish 27 0.0553 6 ## 4 Northeast None 112 0.230 23 ## 5 Northeast Other 28 0.0574 6 ## 6 Northeast <NA> 1 0.00205 0 ## 7 Midwest Protestant 325 0.468 47 ## 8 Midwest Catholic 172 0.247 25 ## 9 Midwest Jewish 3 0.00432 0 ## 10 Midwest None 157 0.226 23 ## 11 Midwest Other 33 0.0475 5 ## 12 Midwest <NA> 5 0.00719 1 ``` --- ### Plotting Grouped Data We plot the distributions of religions for each `bigregion` using `geom_col()`. Recall that `geom_col()` requires a `y`-axis aesthetic (i.e. containing the y-values for the bar chart). For each region, we use `religion` as a fill aesthetic and use the `position = 'dodge'` setting in `geom_col()`. .pull-left[ ```r p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion)) p + geom_col(position = "dodge2") + labs(x = "Region", y = "Percent", fill = "Religion") + labs(title = "The Break Down (Proportion) of Religion by Region", subtitle = 'The proportion within each region adds up to 1' )+ theme(plot.title = element_text(size=16), plot.subtitle = element_text(size = 14)) ``` ] .pull-right[ <img src="ColumnPlotAndDataPreprocessing_files/figure-html/group.1-out-1.png" style="display: block; margin: auto;" /> ] --- ### Flipping the Coordinates As it is easier to read the scales of the bars horizontally, consider changing the orientation of the bar charts by switching the x-y axes, which can be done using `coord_flip()`. Below, we plot the distribution of religion by region using `bigregion` to facet. .pull-left[ ```r p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion)) p + geom_col(position = "dodge2") + labs(x = NULL, y = "Percent", fill = "Religion") + guides(fill = FALSE) + ## We don't need coord_flip() + facet_grid(~ bigregion) ``` ] .pull-right[ <img src="ColumnPlotAndDataPreprocessing_files/figure-html/group.2-out-1.png" style="display: block; margin: auto;" /> ] --- class: center, middle, inverse # Streetfighting Visualizations --- ### Streetfighting Visualizations Some visualizations cannot be constructed using ready-made functions. To construct the desired data visualizations, we may need to organize our data into a certain shape, or construct the visualizations a layer at a time. Here, we consider two examples: + How to display bars side-by-side for two or more groups + How to construct an extended lollipop chart --- ### Example: Streetfighting Bar Placements Survey data often contain repeated questions for different context. For instance, in the Pew Research Centre Spring Survey, participants were asked: "Please tell me if you have a very favorable, somewhat favorable, somewhat unfavorable, or very unfavorable opinion of…". In the data shown below, there is one column for the US and another for China. How do we display the bars for each response category for the US and China side-by-side? .pull-left[ ```r # Need the following packages library(readr) library(httr) # Read the csv file from my github page mylink <- "https://raw.githubusercontent.com/nicholas-sim/ANL501-Data-Visualisation-and-Storytelling/Data/Pew2022.csv" df.pew <- read.csv(mylink) # Inspect the data df.pew.select <- df.pew %>% select("age","sex", "fav_us","fav_china") df.pew.select %>% glimpse() ``` ] .pull-right[ ``` ## Rows: 1,001 ## Columns: 4 ## $ age <chr> "18", "82", "63", "72", "40", "56", "73", "72", "68", "82", … ## $ sex <chr> "Male", "Female", "Male", "Female", "Female", "Male", "Male"… ## $ fav_us <chr> "Somewhat unfavorable", "Somewhat unfavorable", "Somewhat un… ## $ fav_china <chr> "Very unfavorable", "Somewhat unfavorable", "Somewhat favora… ``` ] --- ### Example: Streetfighting Bar Placements One approach is to pivot the values of `fav_us` and `fav_china` into a single variable, called `response`, and save the column names as another variable, called `country`. Then, we use `country` as a `fill` aesthetic and unstack the countries' responses using `position = 'dodge'`. Let's pivot the data to construct `country`. .panelset[ .panel[.panel-name[R Code] ```r # Pivot the data df.fav <- df.pew.select %>% pivot_longer(cols = contains("fav"), names_to = "country", values_to = "response") %>% mutate(age = as.numeric(age)) ``` ] .panel[.panel-name[Data]
] ] --- ### Example: Streetfighting Bar Placements Let's clean up the `response` variable by: * declaring it as a factor and specifying the order of the responses * removing the non-response To save space later, we save the plot titles first. ```r # Declare the variable, response, as a factor df.fav$response <- factor(df.fav$response, levels = c("Very favorable", "Somewhat favorable", "Somewhat unfavorable", "Very unfavorable")) # Removed the non-response df.fav.filtered <- df.fav %>% filter(!is.na(response), !response %in% c("Refused (DO NOT READ)","Don’t know (DO NOT READ)")) # Declare the titles p.title = "Opinions of Singaporeans on China and the US" p.subtitle = "Please tell me if you have a very favorable, somewhat favorable, \nsomewhat unfavorable, or very unfavorable opinion of…" p.caption = "Source: Pew Research Center Global Attitudes Spring 2022 survey" ``` --- ### Example: Streetfighting Bar Placements Although the opinions on China and the US are contained in two different columns, we may plot them side-by-side by pivoting them into a single column, then use a `fill` aesthetic to differentiate China and the US. Below, we facet the plot by sex. .panelset[ .panel[.panel-name[R Code] ```r # Plot the Opinions of Singaporeans on China and the US ggplot(df.fav.filtered, aes(x = response, fill = country)) + geom_bar(position = "dodge",mapping = aes(y = ..prop.., group = country)) + scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) + scale_y_continuous(labels = scales::percent) + scale_fill_manual(values=c("#de2910","#3b3b6d"), breaks = c("fav_china","fav_us"), labels = c("fav_china" = "China", "fav_us" = "US")) + labs(y = NULL, x = NULL, fill = NULL, title = p.title, subtitle = p.subtitle, caption = p.caption)+ facet_wrap(~sex) + coord_flip() + theme_minimal() ``` ] .panel[.panel-name[Plot] <img src="ColumnPlotAndDataPreprocessing_files/figure-html/pew.4-out-1.png" style="display: block; margin: auto;" /> ] ] --- ### Example: Streetfighting Bar Placements Here is another look where we use `geom_bar(position = position_fill())` and `geom_text()` (details suppressed). The percentage displayed is out of the total sample size. .panelset[ .panel[.panel-name[R Code] ```r # Plot the Opinions of Singaporeans on China and the US ggplot(df.fav.filtered, aes(x = response, fill = country)) + geom_bar(position = position_fill()) + geom_text(aes(label = paste0(round(..count../(sum(..count..)*0.5)*100,1), "%")), stat = "count", colour = "white", position = position_fill(vjust = 0.5)) + scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) + scale_fill_manual(values=c("#de2910","#3b3b6d"), breaks = c("fav_china","fav_us"), labels = c("fav_china" = "China", "fav_us" = "US")) + labs(y = NULL, x = NULL, fill = NULL, title = p.title, subtitle = p.subtitle, caption = p.caption)+ facet_wrap(~sex) + coord_flip() + theme_minimal() + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) ``` ] .panel[.panel-name[Plot] <img src="ColumnPlotAndDataPreprocessing_files/figure-html/pew.5-out-1.png" style="display: block; margin: auto;" /> ] ] --- ### Example: A Basic Lollipop Chart A lollipop chart presents information in a bar chart by showing the level of a group as a scatter point that is connected by a line segment. To construct the lollipop chart, we combine `geom_point` and `geom_segment`. Here is an example of a basic lollipop chart of the life expectancy of several countries in Southeast Asia using the gapminder dataset. .panelset[ .panel[.panel-name[R Code] ```r # Filter out the data df.sea <- gapminder %>% filter(year %in% c("1972", "2002"), country %in% c("Brunei", "Indonesia", "Malaysia", "Singapore", "Philippines", "Thailand") ) ggplot() + coord_flip() + geom_point(data = subset(df.sea, year == 1972), aes(x = country, y = lifeExp), color = "gray50", size = 3, alpha = 0.5) + geom_segment(data = subset(df.sea, year == 1972), aes(x = country, xend = country, y =0, yend = lifeExp), color = "gray50") + labs(x = "", y = "", title = "Life Expectancy at Birth, 1972") + theme_minimal() + theme(axis.text = element_text(size = 12), title = element_text(size = 16)) ``` ] .panel[.panel-name[Plot] <img src="ColumnPlotAndDataPreprocessing_files/figure-html/lollipop.1-out-1.png" style="display: block; margin: auto;" /> ] .panel[.panel-name[Data]
] ] --- ### Example: Streetfighting an Extended Lollipop Chart Let's extend the lollipop chart to 2002. Notice that we need to plot the grey and red segments separately. To do so, we need to construct a data frame, **df.sea.wide**, that contains life expectancy data in a wide format, .panelset[ .panel[.panel-name[R Code] ```r ## Pivot-wide to represent life expectancy in 1972 and 2002 in separate columns df.sea.wide <- df.sea %>% select(country, year, lifeExp) %>% pivot_wider(names_from = year, names_glue = "{.value}{year}", values_from = lifeExp) # Plot - I plot the red line segment first. Why? ggplot(df.sea.wide) + coord_flip() + geom_point(mapping = aes(x= country, y = lifeExp2002), color = "red", size = 3, alpha = 0.6) + geom_segment(aes(x = country, xend = country, y= lifeExp1972, yend = lifeExp2002), color = "red") + geom_point(mapping = aes(x= country, y = lifeExp1972), color = "gray60", size = 3) + geom_segment(aes(x = country, xend = country, y = 0, yend = lifeExp1972), color = "gray60") + annotate("text", label = c("1972"), x = 4.2, y =64, color = "gray50", size = 5) + annotate("text", label = c("2002"), x = 4.2, y =84, color = "red", size = 5) + labs(x = "", y = "", title = "Life Expectancy at Birth, from 1972 to 2002") + theme_minimal() + theme(axis.text = element_text(size = 12), title = element_text(size = 16)) ``` ] .panel[.panel-name[Plot] <img src="ColumnPlotAndDataPreprocessing_files/figure-html/lollipop.2-out-1.png" style="display: block; margin: auto;" /> ] .panel[.panel-name[Data]
] ] --- ### Example: Streetfighting an Extended Lollipop Chart with Bubbles Let's make bubbles for the lollipop endpoints based GDP Per Capita. To do so, we construct a wide data frame, **df.sea.wide2** that contains GDP Per Capita for 1972 and 2002 in the wide format and joined with the data frame **df.sea.wide**. Then, we use GDP Per Capita in each column (i.d. 1972 and 2002) to generate the size of the bubbles corresponding to 1972 and 2002. .panelset[ .panel[.panel-name[R Code] ```r ## Pivot-wide to represent GDP Per Capita in 1972 and 2002 in separate columns and join to df.sea.wide df.sea.wide2 <- df.sea %>% select(country, year, gdpPercap) %>% pivot_wider(names_from = year, names_glue = "{.value}{year}", values_from = gdpPercap) %>% left_join(df.sea.wide) # Plot - I plot the red line segment first. Why? ggplot(df.sea.wide2) + coord_flip() + geom_point(mapping = aes(x= country, y = lifeExp2002, size = gdpPercap2002), color = "red", alpha = 0.6) + geom_segment(aes(x = country, xend = country, y= lifeExp1972, yend = lifeExp2002), color = "red") + geom_point(mapping = aes(x= country, y = lifeExp1972, size = gdpPercap1972), color = "gray60") + geom_segment(aes(x = country, xend = country, y = 0, yend = lifeExp1972), color = "gray60") + annotate("text", label = c("1972"), x = 4.2, y =64, color = "gray50", size = 5) + annotate("text", label = c("2002"), x = 4.2, y =84, color = "red", size = 5) + labs(x = "", y = "", title = "Life Expectancy at Birth, from 1972 to 2002", size = "GDP Per Capita") + theme_minimal() + theme(axis.text = element_text(size = 12), title = element_text(size = 16)) ``` ] .panel[.panel-name[Plot] <img src="ColumnPlotAndDataPreprocessing_files/figure-html/lollipop.3-out-1.png" style="display: block; margin: auto;" /> ] .panel[.panel-name[Data]
] ] --- class: center, middle, inverse # Using Python in R --- ### Writing Python Codes in RMarkdown We may write Python code blocks in RMarkdown using the `reticulate` package. Rather than using `{r}` as the header in the code block, we use `{python}` instead to declare a Python code block. When running Python codes, the prompt in the console will change to >>>, from >. To quit Python and return to r, simply type `quit` in the console. --- ### Which Country Has the Largest Number of Islands? Let's use Python to extract a table from a Wikipedia page containing the number of islands each country has. We save the data as a `pandas` data frame called `island`. ```r library(reticulate) # Use Python in R # reticulate::py_install("lxml") # install this # reticulate::py_install("numpy") # reticulat:: py_install("pandas") # Set up the system environment (May not be required) Sys.setenv(RETICULATE_PYTHON = "C:/Users/nicho/anaconda3/python.exe") ``` ```python # Import numpy and pandas import numpy as np import pandas as pd # Define the URL of the Wikipedia page url = "https://en.wikipedia.org/wiki/List_of_countries_by_number_of_islands" # Read the Wikipedia Page tables = pd.read_html(url) # Extract the Data island = pd.DataFrame(tables[1]) ``` --- ### Which Country Has the Largest Number of Islands? All items in Python are stored as a list named `py`. To reference the `island` data frame in Python, we use `py$island` and save it as an R data frame. All the variables are stored as strings and the numeric variables must be converted as such. We plot the top 10 countries by the number of islands. .panelset[ .panel[.panel-name[Data Cleaning] ```r # Referencing the `island` data frame from Python island <- py$island colnames(island) <- c("Country", "Islands", "Inhabited.Islands", "Notes", "List.link", "Source") # Converting string to numeric variables (Please check the variable names) island$Islands <- as.numeric(island$Islands) island$Inhabited.Islands<- as.numeric(island$Inhabited.Islands) ``` ] .panel[.panel-name[R Code] ```r island %>% mutate(Scandinavia = ifelse(island$Country%in% c("Sweden", "Norway", "Finland"), "y","n")) %>% top_n(10, Islands) %>% ggplot(aes(x = reorder(Country, Islands), y = Islands, label = Islands, fill = Scandinavia)) + geom_col(color = "gray50") + coord_flip() + scale_fill_manual(values = c( "gray90","lightblue")) + scale_y_continuous(labels = scales::comma) + labs(x = NULL) + geom_label() + theme_minimal() + theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14), legend.position = "none") ``` ] .panel[.panel-name[Plot] <img src="ColumnPlotAndDataPreprocessing_files/figure-html/python.2-out-1.png" style="display: block; margin: auto;" /> ] ]