class: center, middle, inverse, title-slide .title[ # Data Management with Tidyverse ] .author[ ### Nicholas Sim ] .date[ ### 25 January 2024 ] --- class: center, middle, inverse # Introduction --- ### Topics 1. `select` for variable selection 1. `filter` for data subsetting 1. `mutate` for generating new variables and adding them immediately to the data frame 1. `summarise` for data aggregation 1. `group_by` for data grouping 1. `%>%`, pipe operator 1. Scope verbs and the `across` function 1. `left_join` for joining data --- class: center, middle, inverse # Data Wrangling with *Tidyverse* --- ### What is Tidyverse? `tidyverse` is a collection of packages for data wrangling, visualisation, among others. We will be using the `dplyr` package from `tidyverse` for data wrangling here. --- ### Functions for Data Wrangling There are four main functions in `dplyr`: 1. `select`: for selecting variables 2. `filter`: for filtering/subsetting data 3. `mutate`: for generating new variables and adding them immediately to the data frame 4. `summarise`: an aggregator for summarizing information about the variables/data frame. --- ### Functions for Data Wrangling Along with these functions, we will learn how to use the `group_by` function to organize data into groups for group-level operations. We will also learn to use the pipe operator `%>%` (from the `magrittr` package), which breaks down codes involving applying functions on functions to make them more readable. Next, we will explore using the `across` helper function to define the scope of the operations (i.e. which variables to apply to) in `summarise` and `mutate`. Finally, we will explore using `left_join` function to merge datasets. There are cheat sheets for various packages in `tidyverse` such as the `dplyr` package (see https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf). --- ### Load `tidyverse` Let's load the `tidyverse` library first ```r #install.packages("tidyverse") library(tidyverse) ``` which automatically loads the `dplyr` package. --- ### Organizing a Data Frame with `tibble` A `tibble` is a data frame that is print friendly and easier to read. There are other convenient features and operations involving tibbles, such as adding rows and columns, naming of variables, etc. (see https://www.r-bloggers.com/2018/09/introduction-to-tibbles/). For example, let's first convert the `iris` dataset into a tibble using the `as_tibble` function ```r iris.tibble <- as_tibble(iris) str(iris.tibble) # notice that is now says tibble [150 x 5] ``` ``` ## tibble [150 × 5] (S3: tbl_df/tbl/data.frame) ## $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... ``` --- class: center, middle, inverse # The Select Function --- ### Selecting Columns The `select` function is use for extracting variables (i.e. columns) from a data frame. To use `select`, we pass in the data frame into its first argument, and the conditions (i.e. what we want to select) into its second, third, fourth, etc. arguments. --- ### Using Negation for Exclusion Let's first compare `select` with the base R approach for variable selection using the `mtcars` dataset. Imagine we want to select all variables except displacement (i.e. Column 3). In base R, the command would be ```r head(mtcars[,-3], 3) ``` ``` ## mpg cyl hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 93 3.85 2.320 18.61 1 1 4 1 ``` (Note: we use the `head` function to report the first 6 rows. If not, hundreds of pages of output will be printed out on this pdf document.) --- ### Using Negation for Exclusion .pull-left[ However, we cannot apply negation on a variable's name to select everything but this variable:] .pull-right[ ```r head(mtcars[,-'disp']) ``` ``` ## Error in -"disp": invalid argument to unary operator ``` ] .pull-left[ By contrast, this can be achieved using the `select` function, which makes it less fussy. For instance,] .pull-right[ ```r head(select(mtcars, -3), 1) ``` ``` ## mpg cyl hp drat wt qsec vs am gear carb ## Mazda RX4 21 6 110 3.9 2.62 16.46 0 1 4 4 ``` ```r head(select(mtcars, -disp), 1) ``` ``` ## mpg cyl hp drat wt qsec vs am gear carb ## Mazda RX4 21 6 110 3.9 2.62 16.46 0 1 4 4 ``` ```r head(select(mtcars, -'disp'), 1) ``` ``` ## mpg cyl hp drat wt qsec vs am gear carb ## Mazda RX4 21 6 110 3.9 2.62 16.46 0 1 4 4 ``` ] --- ### Example: Selecting Multiple Columns .pull-left[ In base R, we may select multiple columns in sequence using the slicing notation `:`, or, not in sequence using the combine function `c()`. The `select` function is more flexible than the base R approach. For example, the following are equivalent:] .pull-right[ ```r head(select(mtcars, c(1, 2, 4)), 1) ``` ``` ## mpg cyl hp ## Mazda RX4 21 6 110 ``` ```r head(select(mtcars, c(1:2, 4)), 1) ``` ``` ## mpg cyl hp ## Mazda RX4 21 6 110 ``` ```r head(select(mtcars, 1:2, 4), 1) # c() function not needed ``` ``` ## mpg cyl hp ## Mazda RX4 21 6 110 ``` ] --- ### Example: Selecting Columns Containing Specific Expressions .pull-left[ Let's select the variables in `mtcars` with names containing the expressions 'cyl' or 'disp' by using the `contains` helper function (which comes with `dplyr`) to search for expressions containing 'cyl' (cylinder) or 'disp' in the variables' names.] .pull-right[ ```r head(select(mtcars,contains('cyl'), contains('disp')), 2) ``` ``` ## cyl disp ## Mazda RX4 6 160 ## Mazda RX4 Wag 6 160 ``` ```r # head(select(mtcars,contains('cy'), contains('di'))) # This works too ``` ] .pull-left[ We may also select variables with names starting with certain characters. For instance, let's select all variables with names starting with `'c'` ] .pull-right[ ```r head(select(mtcars, starts_with('c')),2) ``` ``` ## cyl carb ## Mazda RX4 6 4 ## Mazda RX4 Wag 6 4 ``` ] --- ### Exercise Using the `iris` dataset, do the following: 1. Construct summary statistics using `summary` function. 2. Use the `select` function to select `Petal.Length` and `Petal.Width`. 3. Use the `starts_with` helper function to select columns that starts with `p`. --- class: center, middle, inverse # The Filter Function --- ### Subsetting Data While the `select` function is used for selecting columns, the `filter` function is used for selecting rows. The `filter` function passes the data frame into its first argument. The subsetting conditions are then passed in as the 2nd, 3rd, ..., arguments. Thus, we may keep adding subsetting conditions to the `filter` function using an extra argument. --- ### Why Use `filter` over `subset` ? In base R, we use the `subset` function to extract rows from a dataset. Recall that the 2nd argument of `subset` is the filtering/subsetting condition (the 3rd argument of subset() specifies the columns to be selected). If there are multiple subsetting conditions, using the `subset` function becomes complicated as we must them together using the `&` conjunction. Furthermore, the `subset` function cannot apply subsetting conditions **sequentially**. --- ### Example: Filtering on Multiple Conditions As an example, let's import `Seminar4_data.csv` from our working directory. ```r WDI.data <- read_csv("WDI_data.csv") ``` Before we proceed, notice that the column names are very long: spaces in the original variable names have been filled in with a dot. Let's rename the variables into something simpler. To do so, let's check the column names of `WDI.data` first. ```r colnames(WDI.data) ``` ``` ## [1] "Country.Code" ## [2] "Country.Name" ## [3] "Year" ## [4] "GDP growth (annual %)" ## [5] "GDP per capita (constant 2010 US$)" ## [6] "GINI index (World Bank estimate)" ## [7] "Mortality rate, under-5 (per 1,000 live births)" ## [8] "Poverty headcount ratio at $1.90 a day (2011 PPP) (% of population)" ## [9] "Income Group" ## [10] "Region" ``` ```r # Let's overwrite the column names colnames(WDI.data)[1:9] <- c("Code", "Country", "Year", "GDP.Growth", "GDP.PerCap", "Gini", "Child.Mortality", "Poverty.Under.2", "Income.Group" ) ``` --- ### Example: Filtering on Multiple Conditions To demonstrate how the `filter` function is applied, suppose we want to extract the data for 'High income' countries corresponding to the year '2005'. Using the base R command, we would have written the following scripts, which look very long. ```r head(WDI.data[WDI.data$Income.Group=='High income' & WDI.data$Year == 2005,],2) ``` ``` ## # A tibble: 2 × 10 ## Code Country Year GDP.Growth GDP.PerCap Gini Child.Mortality ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ABW Aruba 2005 1.21 26980. NA NA ## 2 AND Andorra 2005 7.40 48832. NA 5.3 ## # ℹ 3 more variables: Poverty.Under.2 <dbl>, Income.Group <chr>, Region <chr> ``` ```r #or head(subset(WDI.data, Income.Group =='High income' & WDI.data$Year == 2005),2) ``` ``` ## # A tibble: 2 × 10 ## Code Country Year GDP.Growth GDP.PerCap Gini Child.Mortality ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ABW Aruba 2005 1.21 26980. NA NA ## 2 AND Andorra 2005 7.40 48832. NA 5.3 ## # ℹ 3 more variables: Poverty.Under.2 <dbl>, Income.Group <chr>, Region <chr> ``` Although there are only have two subsetting conditions here, the codes already look very long. --- ### Example: Filtering on Multiple Conditions Now, compare the command in base R with ```r head(filter(WDI.data, Income.Group == 'High income', Year == 2005),2) ``` ``` ## # A tibble: 2 × 10 ## Code Country Year GDP.Growth GDP.PerCap Gini Child.Mortality ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ABW Aruba 2005 1.21 26980. NA NA ## 2 AND Andorra 2005 7.40 48832. NA 5.3 ## # ℹ 3 more variables: Poverty.Under.2 <dbl>, Income.Group <chr>, Region <chr> ``` The `filter` function applies the conditions `Income.Group == 'High income` and `Year == 2005` sequentially (i.e. the condition `Income.Group == 'High income` first, and `Year == 2005` next). (In this example, it doesn't matter which condition is applied first.) The `filter` function interprets the comma (`,`) between the second, third, etc. arguments as an "AND" conjunction. --- ### Example: Filtering on Multiple Conditions To use the "OR" conjunction, we need to separate the conditions with the vertical pipe `|` and pass in the combined statement as a single argument: ```r head(filter(WDI.data, Income.Group == 'High income' | Year==2005), 3) ``` ``` ## # A tibble: 3 × 10 ## Code Country Year GDP.Growth GDP.PerCap Gini Child.Mortality ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ABW Aruba 1995 2.55 26705. NA NA ## 2 ABW Aruba 1996 1.19 26088. NA NA ## 3 ABW Aruba 1997 7.05 27191. NA NA ## # ℹ 3 more variables: Poverty.Under.2 <dbl>, Income.Group <chr>, Region <chr> ``` --- ### Filtering with the Membership Operator, `%in%` Like the `subset` function, we may also employ the membership, i.e. the `%in%` operator, in the `filter` function. .pull-left[ For example, to filter out the rows containing the values "Upper middle income" and "Lower middle income" under the `Income.Group` variable, we may use the `%in%` operator, so that R will extract all observations *containing* the attributes "Upper middle income" or "Lower middle income" **in** the variable `Income.Group`.] .pull-right[ ```r # Income.Group %in% c("Upper middle income","Lower middle income" ) ``` ] To verify this: ```r head(filter(WDI.data, Income.Group %in% c("Upper middle income","Lower middle income" ))) ``` ``` ## # A tibble: 6 × 10 ## Code Country Year GDP.Growth GDP.PerCap Gini Child.Mortality ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 AGO Angola 1995 15.0 1922. NA 223. ## 2 AGO Angola 1996 13.5 2114. NA 221. ## 3 AGO Angola 1997 7.27 2196. NA 219. ## 4 AGO Angola 1998 4.69 2226. NA 215. ## 5 AGO Angola 1999 2.18 2202. NA 211. ## 6 AGO Angola 2000 3.05 2196. 52 206. ## # ℹ 3 more variables: Poverty.Under.2 <dbl>, Income.Group <chr>, Region <chr> ``` --- ### Exercise Use the `iris` dataset. 1. Using `filter`, extract all observations with petal length greater than 5cm and sepal length greater than 6cm. 2. Using `filter` and the `%in%` operator, extract all observations corresponding the species "virginica" and "versicolor", and select `Petal.Length` and `Petal.Width`: --- class: center, middle, inverse # The Mutate Function --- ### Transforming Variables The `mutate` function is used for generating new variables. As with the `dplyr` functions already discussed, the first argument in `mutate` is the data frame and the second, third, fourth, etc. arguments are the variables you want to generate. --- ### Example: Creating Multiple New Variables .pull-left[ As an example, the variable `Child.Mortality` contains data on under 5 mortality per 1000 live births. To transform this variable into percentages, we divide it by 1000 and name this variable `Mortality.Rate`. Notice that a new variable, i.e. `Child.Mortality`, is added to the database of WDI.data2.] .pull-right[ ```r WDI.data2 <- mutate(WDI.data, Mortality.Rate = Child.Mortality/1000) str(WDI.data2) ``` ``` ## tibble [5,425 × 11] (S3: tbl_df/tbl/data.frame) ## $ Code : chr [1:5425] "ABW" "ABW" "ABW" "ABW" ... ## $ Country : chr [1:5425] "Aruba" "Aruba" "Aruba" "Aruba" ... ## $ Year : num [1:5425] 1995 1996 1997 1998 1999 ... ## $ GDP.Growth : num [1:5425] 2.55 1.19 7.05 1.99 1.24 ... ## $ GDP.PerCap : num [1:5425] 26705 26088 27191 27152 26954 ... ## $ Gini : num [1:5425] NA NA NA NA NA NA NA NA NA NA ... ## $ Child.Mortality: num [1:5425] NA NA NA NA NA NA NA NA NA NA ... ## $ Poverty.Under.2: num [1:5425] NA NA NA NA NA NA NA NA NA NA ... ## $ Income.Group : chr [1:5425] "High income" "High income" "High income" "High income" ... ## $ Region : chr [1:5425] "Latin America & Caribbean" "Latin America & Caribbean" "Latin America & Caribbean" "Latin America & Caribbean" ... ## $ Mortality.Rate : num [1:5425] NA NA NA NA NA NA NA NA NA NA ... ``` ] **Note**: The new data frame with the mutated variable will not be saved automatically, unless we do so as shown in the above. --- ### Example: Creating Multiple New Variables We may construct additional variables by utilizing additional arguments in `mutate`. .panelset[ .panel[.panel-name[Mutate Function] For example, let's first construct `Mortality.Rate`, then a measure called `Welfare` defined as `\(Welfare = GDP.Growth^2 Mortality.Rate^2\)` `mutate` constructs variables sequentially. First, it constructs `Mortailty.Rate` (which didn't exist earlier). Then, using `Mortailty.Rate`, it constructs `Welfare`. ] .panel[.panel-name[Output]
] ] --- ### Example: Creating Multiple New Variables The `mutate` function enables us to call functions while we mutate the variables. For example, suppose we want to transform `GDP.PerCap` into logs, call it `Log.GDP.PerCap`. We may do so by using the `log` function (Note: we won't be saving the results) ```r str(mutate(WDI.data, Log.GDP.PerCap = log(GDP.PerCap))) ``` ``` ## tibble [5,425 × 11] (S3: tbl_df/tbl/data.frame) ## $ Code : chr [1:5425] "ABW" "ABW" "ABW" "ABW" ... ## $ Country : chr [1:5425] "Aruba" "Aruba" "Aruba" "Aruba" ... ## $ Year : num [1:5425] 1995 1996 1997 1998 1999 ... ## $ GDP.Growth : num [1:5425] 2.55 1.19 7.05 1.99 1.24 ... ## $ GDP.PerCap : num [1:5425] 26705 26088 27191 27152 26954 ... ## $ Gini : num [1:5425] NA NA NA NA NA NA NA NA NA NA ... ## $ Child.Mortality: num [1:5425] NA NA NA NA NA NA NA NA NA NA ... ## $ Poverty.Under.2: num [1:5425] NA NA NA NA NA NA NA NA NA NA ... ## $ Income.Group : chr [1:5425] "High income" "High income" "High income" "High income" ... ## $ Region : chr [1:5425] "Latin America & Caribbean" "Latin America & Caribbean" "Latin America & Caribbean" "Latin America & Caribbean" ... ## $ Log.GDP.PerCap : num [1:5425] 10.2 10.2 10.2 10.2 10.2 ... ``` --- ### Example: Creating Multiple New Variables .pull-left[ Using `mutate`, we may also generate a new column with the mean and standard deviation of `GDP.Growth`, called `GDP.Growth.mean` and `GDP.Growth.sd` **Note**: So far, we have been working with a data frame containing `NA`s. Unless we are performing numerical computations, we usually do not have to worry about `NA`s as R recognizes them as missing values. Nonetheless, it is a good practice to tell R not to include the `NA`s. The argument `na.rm = TRUE` tells R to remove the `NA`s when calculating the mean and standard deviation of `GDP.Growth`. Otherwise, you will receive `NaN` (not-a-number) as an output. ] .pull-right[ ```r mutate(WDI.data, GDP.Growth.mean = mean(GDP.Growth, na.rm = TRUE), GDP.Growth.sd = sd(GDP.Growth, na.rm = TRUE) ) ``` ``` ## # A tibble: 5,425 × 12 ## Code Country Year GDP.Growth GDP.PerCap Gini Child.Mortality ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ABW Aruba 1995 2.55 26705. NA NA ## 2 ABW Aruba 1996 1.19 26088. NA NA ## 3 ABW Aruba 1997 7.05 27191. NA NA ## 4 ABW Aruba 1998 1.99 27152. NA NA ## 5 ABW Aruba 1999 1.24 26954. NA NA ## 6 ABW Aruba 2000 7.62 28417. NA NA ## 7 ABW Aruba 2001 -2.97 26966. NA NA ## 8 ABW Aruba 2002 -3.27 25508. NA NA ## 9 ABW Aruba 2003 1.98 25469. NA NA ## 10 ABW Aruba 2004 7.91 27006. NA NA ## # ℹ 5,415 more rows ## # ℹ 5 more variables: Poverty.Under.2 <dbl>, Income.Group <chr>, Region <chr>, ## # GDP.Growth.mean <dbl>, GDP.Growth.sd <dbl> ``` ] --- ### Example: Creating Multiple New Variables Another example on creating multiple variables using `mutate`: ```r iris %>% mutate(sum.Petal = Petal.Length + Petal.Width, sum.Sepal = Sepal.Length + Sepal.Width, mean.Sepal = mean(Sepal.Length), log.Petal = log(Petal.Length)) %>% head() ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum.Petal sum.Sepal ## 1 5.1 3.5 1.4 0.2 setosa 1.6 8.6 ## 2 4.9 3.0 1.4 0.2 setosa 1.6 7.9 ## 3 4.7 3.2 1.3 0.2 setosa 1.5 7.9 ## 4 4.6 3.1 1.5 0.2 setosa 1.7 7.7 ## 5 5.0 3.6 1.4 0.2 setosa 1.6 8.6 ## 6 5.4 3.9 1.7 0.4 setosa 2.1 9.3 ## mean.Sepal log.Petal ## 1 5.843333 0.3364722 ## 2 5.843333 0.3364722 ## 3 5.843333 0.2623643 ## 4 5.843333 0.4054651 ## 5 5.843333 0.3364722 ## 6 5.843333 0.5306283 ``` ```r head(iris, 3) ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ``` --- class: center, middle, inverse # The Summarise Function --- ### Aggregating or Rolling Up Data The `summarise` function allows us to aggregate or roll up our data by summarizing them using operations such as counts, the mean, the standard deviation, etc. When a data frame is passed into `summarise`, it will be collapsed into the output that summarize the variables (say, into the count, mean, standard deviation, etc). As before, the first argument in `summarise` takes in the data frame. The second, third, fourth, etc. arguments specify what you wish to summarize and how. --- ### Example: Generating Summary Statistics .pull.left[ For example, let's compute the mean and standard deviation of `GDP.Growth`. Because this involves numerical computation, we use `na.rm=TRUE` to remove the `NA` values while computing the mean and standard deviation. ] .pull-right[ ```r summarise(WDI.data, GDP.Growth.mean = mean(GDP.Growth, na.rm=TRUE), GDP.Growth.sd = sd(GDP.Growth, na.rm=TRUE)) ``` ``` ## # A tibble: 1 × 2 ## GDP.Growth.mean GDP.Growth.sd ## <dbl> <dbl> ## 1 3.82 5.99 ``` ] --- ### Example: Generating Summary Statistics We may also count the number of observations. Let's call it Data.Count ```r summarise(WDI.data, Data.Count = n()) ``` ``` ## # A tibble: 1 × 1 ## Data.Count ## <int> ## 1 5425 ``` --- ### Exercise Use the `iris` dataset. 1. Restrict the data to `Species == "setosa"` using the `filter` function. Save it as `df.iris`. 2. Using the `summarise` function, compute the mean and standard deviation of `Petal.Length` of setosa irises based on `df.iris`. Save the mean and standard deviation as `m.setosa` and `sd.setosa`. 3. Let's compute the mean for `Petal.Length` and `Petal.Width` in `df.iris`. To do so, we will use the `across` helper function, i.e. `across(c("Petal.Length", "Petal.Width"), mean)` in `summarise`. --- ### Exercise 4. Let's repeat the above exercise. Instead of specifying `c("Petal.Length", "Petal.Width")` in the `across` function, use `starts_with("P")`. 5. Finally, let's compute the mean and standard deviation for `Petal.Length` and `Petal.Width`. To do so, we need to specify two functions: `mean` and `sd`. To combine these functions, we use the list function in `across(starts_with("P"), list(mean = mean,sd = mean))`. --- ### Imposing a Group Structure on the Dataset with `group_by` Sometimes, we may need to perform group-level operations. The `group_by` function allows us to first group our data, and then apply functions to the data at the group level. To do so, we pass the data frame into the 1st argument of `group_by` (as always). Then, in the second argument of `group_by`, we pass in the variable, i.e. `Country`, that we want R to group at. --- ### Example: Computing Country-by-Country Summary Statistics For example, let's summarize GDP growth by countries, i.e. summarize GDP growth country-by-country. Let's compare the results by using tibble sum, `tbl_sum`, a tidy version of the summary function ```r tbl_sum(WDI.data) # It just says "5,425 x 10" ``` ``` ## A tibble ## "5,425 × 10" ``` ```r # Group the data by country WDI.data.group <- group_by(WDI.data, Country) tbl_sum(WDI.data.group) # Notice that it says Groups "Country [217]" ``` ``` ## A tibble Groups ## "5,425 × 10" "Country [217]" ``` R has now grouped the data frame by countries. If you view the data frames `WDI.data` and `WDI.data.group`, they look identical. However, the functions will be applied differently when the data are grouped (i.e. the functions will be applied group-by-group) than when the data are un-grouped (in this case, the entire dataset is just a single group). --- ### Example: Computing Country-by-Country Summary Statistics Let's summarise GDP Growth from `WDI.data.group` (remember to use `na.rm = TRUE`). Since we have grouped the dataset by country, instead of having a single output, we now have an output for each country. ```r WDI.data.sum <- summarise(WDI.data.group, GDP.Growth.mean=mean(GDP.Growth, na.rm = TRUE)) head(WDI.data.sum) ``` ``` ## # A tibble: 6 × 2 ## Country GDP.Growth.mean ## <chr> <dbl> ## 1 Afghanistan 6.83 ## 2 Albania 4.84 ## 3 Algeria 3.43 ## 4 American Samoa -1.16 ## 5 Andorra 2.29 ## 6 Angola 6.37 ``` --- ### Exercise Use the `iris` dataset. 1. Use the `group_by` function to group the dataset by `Species`. Save the data frame as `df.iris.g`. 2. Compute the mean of all variables, i.e. across columns 1:4, in `df.iris.g` using the `summarise` function. Hint: use `across(1:4, mean, .names = "{.col}_mean")`. --- class: center, middle, inverse # The Pipe Operator --- ### The Pipe Operator The pipe operator `%>%` enables us to break up nested function operations (i.e.functions on functions) to improve readability. Consider the following: Instead of generating a new data frame `WDI.data.group`, we may directly enter `group_by(WDI.data, Country)` into the `summarise` command, and then into the `head` function all at once: ```r head(summarise(group_by(WDI.data, Country), GDP.Growth.mean=mean(GDP.Growth, na.rm = TRUE)),3) ``` ``` ## # A tibble: 3 × 2 ## Country GDP.Growth.mean ## <chr> <dbl> ## 1 Afghanistan 6.83 ## 2 Albania 4.84 ## 3 Algeria 3.43 ``` In the above, the function `head` was applied to the function `summarise`, which was then applied to the function `group_by` so on. Although we may reduce the number of lines of codes, such codes are very difficult to read. --- ### The Pipe Operator The pipe operator enables us to break these functions down into a chain of functions. The rational of how it works lies in the fact that the 1st argument of the function takes in the data frame. Therefore, the pipe operator "pipes" a data frame into a function, obtains the resulting data frame, then pipes it into the next function, obtains the resulting data frame, then pipes it into the next function, and so on. --- ### Example: Simplifying a Complicated Command Line Using the Pipe Operator As example, consider the following command again. ```r head(summarise(group_by(WDI.data, Country), GDP.Growth.mean=mean(GDP.Growth, na.rm = TRUE))) ``` To break the above into readable parts, this is what we may do. - Step 1: We pipe in `WDI.data` into `group_by` to get WDI.data.group. <br>`WDI.data.group <- group_by(WDI.data, Country)` - Step 2: We pipe in `WDI.data.group` into `summarise` to get WDI.data.sum. <br>`WDI.data.sum <- summarise(WDI.data.group, GDP.Growth.mean=mean(GDP.Growth, na.rm = TRUE))` - Step 3: We pipe in `WDI.data.sum` into `head`. <br>`head(WDI.data.sum)` --- ### Example: Simplifying a Complicated Command Line Using the Pipe Operator The equivalent commands using the pipe operator `%>%` are the following: ```r # Step 1: We pipe in WDI.data into group_by() to get WDI.data.group WDI.data %>% group_by(Country) -> WDI.data.group #save into WDI.data.group with the right assignment # Step 2: We pipe in WDI.data.group into summarise() to get WDI.data.sum WDI.data.group %>% summarise(GDP.Growth.mean=mean(GDP.Growth, na.rm = TRUE)) -> WDI.data.sum # Step 3: We pipe in WDI.data.sum into head() WDI.data.sum %>% head() ``` ``` ## # A tibble: 6 × 2 ## Country GDP.Growth.mean ## <chr> <dbl> ## 1 Afghanistan 6.83 ## 2 Albania 4.84 ## 3 Algeria 3.43 ## 4 American Samoa -1.16 ## 5 Andorra 2.29 ## 6 Angola 6.37 ``` --- ### Example: Simplifying A Complicated Command Line Using the Pipe Operator When using pipe operators, we do not need to pass the data frame into the 1st argument of the function. Thus, when pipe operators are used, the first argument of a function will take in the condition to be imposed. With the pipe operator, there is no need to generate the intermediate data frames `WDI.data.group` and `WDI.data.sum`. We may just keep piping: ```r WDI.data %>% group_by(Country)%>% summarise(GDP.Growth.mean=mean(GDP.Growth, na.rm = TRUE)) %>% head() ``` ``` ## # A tibble: 6 × 2 ## Country GDP.Growth.mean ## <chr> <dbl> ## 1 Afghanistan 6.83 ## 2 Albania 4.84 ## 3 Algeria 3.43 ## 4 American Samoa -1.16 ## 5 Andorra 2.29 ## 6 Angola 6.37 ``` --- ### Example: A More Complicated Case Let's attempt a more complicated example. First, let's select the `Country` and `GDP.PerCap` variables, then filter the data for countries with GDP per capita greater than USD 30,000. After which, let's use the `head` function to view the first few rows of the resulting dataset. Without piping, our codes will look like the following: ```r df1 <- select(WDI.data, Year, Country, GDP.PerCap, Region) df2 <-filter(df1,Year > 1999) df3 <- group_by(df2, Region) df4 <- summarise(df3, GDP.PerCap.mean = mean(GDP.PerCap, na.rm = TRUE)) head(df4, 3) ``` ``` ## # A tibble: 3 × 2 ## Region GDP.PerCap.mean ## <chr> <dbl> ## 1 East Asia & Pacific 13143. ## 2 Europe & Central Asia 28660. ## 3 Latin America & Caribbean 11441. ``` --- ### Example: A More Complicated Case In the above, we want `df4` but not the intermediate data frames. To avoid generating these intermediate data frames, we may pass functions into functions, but this is barely readable: .pull-left[ Here, we first we pass `WDI.data` into `select( ,Year, Country, GDP.PerCap, Region)` to obtain `df1`. Then, we pass `d1` into `filter( , Year > 1999)` to obtain `df2`, a data frame with data from 2000 and later. Then, we pass `df2` into `group_by( , Region)` to group the data according to different regions, represented by `df3`. Then, we pass `df3` into `summarise( ,GDP.PerCap.mean = mean(GDP.Percap)` to find the average GDP per capita for each region, which is captured by `df4`. Finally, we pass `df4` into `head` - all very messy!] .pull-right[ ```r head(summarise(group_by(filter(select(WDI.data, Year, Country, GDP.PerCap, Region),Year > 1999), Region), GDP.PerCap.mean = mean(GDP.PerCap, na.rm= TRUE))) ``` ``` ## # A tibble: 6 × 2 ## Region GDP.PerCap.mean ## <chr> <dbl> ## 1 East Asia & Pacific 13143. ## 2 Europe & Central Asia 28660. ## 3 Latin America & Caribbean 11441. ## 4 Middle East & North Africa 16227. ## 5 North America 58287. ## 6 South Asia 2011. ``` ] --- ### Example: A More Complicated Case Let's use the pipe operator to break these codes up: ```r WDI.data %>% select(Year, Country, GDP.PerCap, Region) %>% filter(Year > 1999) %>% group_by(Region) %>% summarise(GDP.PerCap.mean = mean(GDP.PerCap, na.rm = TRUE)) %>% head() ``` ``` ## # A tibble: 6 × 2 ## Region GDP.PerCap.mean ## <chr> <dbl> ## 1 East Asia & Pacific 13143. ## 2 Europe & Central Asia 28660. ## 3 Latin America & Caribbean 11441. ## 4 Middle East & North Africa 16227. ## 5 North America 58287. ## 6 South Asia 2011. ``` --- ### Exercise Simplify the following command using the `%>%` operator. 1. `select(mtcars,mpg, cyl, disp)` 2. `filter(select(mtcars,mpg, cyl, disp), mpg > 20)` 3. `head(filter(select(mtcars,mpg, cyl, disp), mpg > 20), 10)` Save the results into `df` --- class: center, middle, inverse # Scoping --- ### Scoping The scope of operation of functions such as `summarise` and `mutate` can be defined. For example, we may want to summarise the mean of all variables, or mutate all variables. If we have a data frame with many variables, it will be inefficient to summarise or mutate procedure repeatedly for each variable. By defining the scope of the operation, we may specify the variables that we wish to apply the operations to. .pull-left[ To illustrate, let's construct the subset, `WDI.data.small`, based on the first 5 columns of 'WDI.data':] .pull-right[ ```r WDI.data.small <- select(WDI.data, 1:5) WDI.data.small %>% str() ``` ``` ## tibble [5,425 × 5] (S3: tbl_df/tbl/data.frame) ## $ Code : chr [1:5425] "ABW" "ABW" "ABW" "ABW" ... ## $ Country : chr [1:5425] "Aruba" "Aruba" "Aruba" "Aruba" ... ## $ Year : num [1:5425] 1995 1996 1997 1998 1999 ... ## $ GDP.Growth: num [1:5425] 2.55 1.19 7.05 1.99 1.24 ... ## $ GDP.PerCap: num [1:5425] 26705 26088 27191 27152 26954 ... ``` ] --- ### The `across` Function The `across()` function can be used to define the scope. The basic syntax, where `.cols` defines the columns to apply the operations to, `.fns` defines the functions to apply to these columns, is ```r across(.cols, .fns, ..., .names = NULL) ``` .pull-left[ For example, suppose we want to compute the mean for the 4th and 5th columns in `WDI.data.small`. ] .pull-right[ ```r WDI.data.small %>% summarise(across(4:5, mean, na.rm=T)) ``` ``` ## # A tibble: 1 × 2 ## GDP.Growth GDP.PerCap ## <dbl> <dbl> ## 1 3.82 14366. ``` ] --- ### Applying Multiple Functions in `across` Multiple functions can be applied to the columns specified inside the `across()` function. To do so, we combine the function names using the `list()` function. .pull-left[ For example, suppose we want to compute both the mean and standard deviation for the 4th and 5th columns in `WDI.data.small`. ] .pull-right[ ```r WDI.data.small %>% summarise(across(4:5, list(mean, sd), na.rm=T)) ``` ``` ## # A tibble: 1 × 4 ## GDP.Growth_1 GDP.Growth_2 GDP.PerCap_1 GDP.PerCap_2 ## <dbl> <dbl> <dbl> <dbl> ## 1 3.82 5.99 14366. 21679. ``` ] --- ### Specifying the New Variable Names We may specify the names of the new columns generated by using the `.names` option in the `across()` function. The column names can be extracted via `{.col}`. To extract the function names, we must first define them in the `list()` function. Then, we may extract the function names using `{.fn}`. .pull-left[ Again, suppose we want to name the mean and standard deviation for the 4th and 5th columns by tagging `mean` and `sd`. Let's name the `mean()` function as `Average` and the `sd()` function as `StDev`. To generate the names of the new variables, we use the `.names` option in `across()`, but do remember to enclose the specified name with `" "`. ] .pull-right[ ```r WDI.data.small %>% summarise(across(4:5, list(Average = mean, StDev = sd), na.rm=T, .names = "{.col}_{.fn}")) ``` ``` ## # A tibble: 1 × 4 ## GDP.Growth_Average GDP.Growth_StDev GDP.PerCap_Average GDP.PerCap_StDev ## <dbl> <dbl> <dbl> <dbl> ## 1 3.82 5.99 14366. 21679. ``` ] --- ### Using a Test Condition We may use a test condition to select the columns to apply `summarise` or `mutate` on. For instance, ```r WDI.data.small %>% summarise(across(where(is.numeric), mean, na.rm = TRUE)) ``` ``` ## # A tibble: 1 × 3 ## Year GDP.Growth GDP.PerCap ## <dbl> <dbl> <dbl> ## 1 2007 3.82 14366. ``` ```r WDI.data.small%>% summarise(across(where(is.numeric), list(~mean(., na.rm = TRUE), ~sd(., na.rm = TRUE)))) ``` ``` ## # A tibble: 1 × 6 ## Year_1 Year_2 GDP.Growth_1 GDP.Growth_2 GDP.PerCap_1 GDP.PerCap_2 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2007 7.21 3.82 5.99 14366. 21679. ``` --- ### Using the Functions Themselves Rather than providing the function name in `across()`, we may use to use the functions themselves. This might be necessary if the functions have different options that you wish to specify. To do so, aggregate up the functions using `list` and call each function using the lambda sign `~`, as shown here. ```r summarise_if(WDI.data.small, is.numeric, list(~mean(., na.rm = TRUE), ~sd(., na.rm = TRUE))) ``` ``` ## # A tibble: 1 × 6 ## Year_mean GDP.Growth_mean GDP.PerCap_mean Year_sd GDP.Growth_sd GDP.PerCap_sd ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2007 3.82 14366. 7.21 5.99 21679. ``` The dot (.) in the `mean` and `sd` functions tells R that you want to pass in all the variables in the data frame. --- class: center, middle, inverse # Joining Data Frames --- ### Joining Data Frames Let's generate two artificial datasets, `df1` and `df2`. ```r Country <- c("BRU", "KHM", "IDR", "MNR", "MYS" , "PHI", "SGP", "THA", "VNM") Country<- rep(Country,2) %>% sort() # repeat twice, then sort by country Year <- c(2000, 2010) gdp.growth = rnorm(18, mean=2, sd=2) unemploy.rate = rnorm(18, mean=3, sd=1) df1 <- data.frame(Country, Year, gdp.growth) # construct data frame df1 df1 <- df1[order(Country),] # Sort df1 according to countries df2 <- data.frame(Country, Year, unemploy.rate) # construct data frame df2 df2 <- df2[order(Country),] # Sort df2 according to countries ``` --- ### Joining Data Frames The datasets are .pull-left[ ```r head(df1,2) ``` ``` ## Country Year gdp.growth ## 1 BRU 2000 1.663512 ## 2 BRU 2010 1.133855 ``` ] .pull-right[ ```r head(df2,2) ``` ``` ## Country Year unemploy.rate ## 1 BRU 2000 2.237101 ## 2 BRU 2010 1.999034 ``` ] --- ### Joining Data Frames Let's merge `df1` with `df2` using `left_join`. To do so, we need a common variable (a key) to match on from both datasets. The common variable is `"Country"`. The `left_join` function then matches each row in `df1` with a matching row in `df2`. In other words, `left_join` takes each row in the first (left) dataset and match it on the matching row or rows in the second dataset. In this context, this is *unsatisfactory* as each country is repeated twice in each dataset, one corresponding to 2000 and another to 2010. Therefore, upon matching on `"Country"`, there will be 4 rows for each country, rather than 2. ```r merged.df1 <- left_join(x = df1, y = df2, by = "Country") head(merged.df1) ``` ``` ## Country Year.x gdp.growth Year.y unemploy.rate ## 1 BRU 2000 1.663512 2000 2.237101 ## 2 BRU 2000 1.663512 2010 1.999034 ## 3 BRU 2010 1.133855 2000 2.237101 ## 4 BRU 2010 1.133855 2010 1.999034 ## 5 IDR 2000 4.742255 2000 4.892738 ## 6 IDR 2000 4.742255 2010 3.021712 ``` --- ### Joining Data Frames Because our data have a longitudinal structure, instead of joining the data frames by `"Country"` along, we should join them by `"Country"` and `"Year"`: ```r merged.df2 <- left_join(x = df1, y = df2, by = c("Country", "Year")) head(merged.df2) ``` ``` ## Country Year gdp.growth unemploy.rate ## 1 BRU 2000 1.6635122 2.237101 ## 2 BRU 2010 1.1338552 1.999034 ## 3 IDR 2000 4.7422547 4.892738 ## 4 IDR 2010 1.7623208 3.021712 ## 5 KHM 2000 1.0802088 4.560416 ## 6 KHM 2010 0.6045959 4.197642 ``` --- ### Joining Data Frames Consider a third data frame `df3` containing the size of the countries, measured in 1000km$^2$. .pull-left[ ```r Country <- c("BRU", "KHM", "IDR", "MNR", "MYS" , "PHI", "SGP", "THA", "VNM") country.size <- c(5.77, 181.04, 1905, 676.58, 329.85, 300, 0.72, 513.12, 331.21) df3 <- data.frame(Country, country.size) # construct data frame df1 ``` ] .pull-right[ ``` ## Country country.size ## 1 BRU 5.77 ## 2 KHM 181.04 ## 3 IDR 1905.00 ``` ] .pull-left[ Let's do a `left_join` of `merged.df2` with `df3`. Notice that the country size is repeated across years. This makes sense as country size is a time invariant variable. ] .pull-right[ ```r merged.df3 <- left_join(x = merged.df2, y = df3, by = "Country") glimpse(merged.df3) ``` ``` ## Rows: 18 ## Columns: 5 ## $ Country <chr> "BRU", "BRU", "IDR", "IDR", "KHM", "KHM", "MNR", "MNR", … ## $ Year <dbl> 2000, 2010, 2000, 2010, 2000, 2010, 2000, 2010, 2000, 20… ## $ gdp.growth <dbl> 1.6635122, 1.1338552, 4.7422547, 1.7623208, 1.0802088, 0… ## $ unemploy.rate <dbl> 2.237101, 1.999034, 4.892738, 3.021712, 4.560416, 4.1976… ## $ country.size <dbl> 5.77, 5.77, 1905.00, 1905.00, 181.04, 181.04, 676.58, 67… ``` ] --- ### Getting Data into the "Long Form" Before we plot, the data must be arranged in the "long form", meaning that the columns must represent variables and the rows represent observations. Certain raw data could be arranged in "wide" form, as shown below:
--- class: center, middle, inverse # Reshaping Data --- ### Getting Data into the "Long Form" We need to clean it up by getting it to look something that looks like ```r head(WDI.data) ``` ``` ## # A tibble: 6 × 10 ## Code Country Year GDP.Growth GDP.PerCap Gini Child.Mortality ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 ABW Aruba 1995 2.55 26705. NA NA ## 2 ABW Aruba 1996 1.19 26088. NA NA ## 3 ABW Aruba 1997 7.05 27191. NA NA ## 4 ABW Aruba 1998 1.99 27152. NA NA ## 5 ABW Aruba 1999 1.24 26954. NA NA ## 6 ABW Aruba 2000 7.62 28417. NA NA ## # ℹ 3 more variables: Poverty.Under.2 <dbl>, Income.Group <chr>, Region <chr> ``` This can be done by using the `pivot_longer` or `pivot_wider` command. This is covered in the slides for Seminar 4 Part 3 (More Data Wrangling). --- class: center, middle, inverse # The stringr Package --- ### Working with Strings The `stringr` package facilitates working with strings, such as replace certain parts of a string within a string variable (i.e. `str_replace`), wrapping a text (useful for wrapping long legend titles and axis text in a plot) (i.e. `str_wrap`), padding a string with a leading 0 (e.g. transform `1` to `01`, `2` to `02`) (i.e. `str_pad`), etc. See https://stringr.tidyverse.org/articles/from-base.html --- # Appendix --- ### The `ifelse` Function In Seminar 1, we employed the `ifelse` function to recode values in a dataset. Here, we show how the `recode` function from `dplyr` can be used instead. The `recode` function has a general syntax `recode(x, "old1" = "new1", "old2" = "new2", ...)` The first argument of the `recode` is the variable `x` whose values are to be re-coded. The second, third, fourth, etc. arguments are the specific values in the variable that you wish to recode to. --- ### The `ifelse` Function As an example, to abbreviate the names of the Iris species from "setosa" to "SE", "virginica" to "VA", and "versicolor" to "VE": ```r #save the iris data into iris.df iris.df <- iris # Save the recoded variables into Species.Short iris.df$Species.Short <- recode(iris.df$Species, "setosa" = "SE", "virginica" = "VA","versicolor" = "VE") head(iris.df) ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species.Short ## 1 5.1 3.5 1.4 0.2 setosa SE ## 2 4.9 3.0 1.4 0.2 setosa SE ## 3 4.7 3.2 1.3 0.2 setosa SE ## 4 4.6 3.1 1.5 0.2 setosa SE ## 5 5.0 3.6 1.4 0.2 setosa SE ## 6 5.4 3.9 1.7 0.4 setosa SE ```