An Introduction To RMarkdown

1. What is RMarkdown?

RMarkdown is an RStudio notebook that seamlessly integrates report writing, statistical or machine learning analysis, figure generation, etc. It enables a report to be completely reproducible in various formats, such as Word, HTML, PDF, etc.. Documents generated by RMarkdown are also fully customizable through the RMarkdown source codes. To learn about RMarkdown capabilities, please refer to the guide here.

We can use RMarkdown to generate various types of outputs:

HTML: HTML output can include slides and documents containing dynamic features like interactive diagrams (e.g., a leaflet widget).
MS Word: RMarkdown source code can be knit into an MS Word document.
PowerPoint: RMarkdown source code can be knit into a PowerPoint presentation.

The benefit of using RMarkdown over a word processor or presentation software is that it integrates data extraction, data wrangling, visualization, statistical, and machine learning computations with report writing. This removes the need for us to use different software for data analysis and report writing.

As an example of how data analysis and report writing can be integrated, the analyses in Sections 3.2 and 3.4 below connect the RMarkdown source code of this document to the data.gov.sg and World Development Indicator data servers through their respective Application Programming Interfaces (APIs). This enables us to download new data from these servers and update our data analysis simply by knitting the RMarkdown source codes of this document again.

Required Packages

For this course, all the slides are knitted into HTML files. Unlike PowerPoint slides, HTML slides can be used to present dynamic visualizations, such as interactive figures, maps, and animated plots. There are primarily three types of HTML slides – ioslides, slidy, and Xaringan. Their difference lies in how customizable they are. For our course, our slides are made by Xaringan as it allows for a higher level of customization than slidy and ioslides permit. To knit the document, we need the following packages. Please install them first before proceeding.

# For data wrangling and plotting
library(tidyverse)
library(ggrepel)

# For generating equations through latex syntax
library(tinytex)

# For extracting data through an GET API call
library(httr) # To use GET() to extract data from the server into the JSON format
library(jsonlite) # To use fromJSON() to extract JSON content into a data frame

# Other packages
library(stats) # For implementing logistic regression via glm()
library(reticulate) # To use Python codes in R

As a final remark, while RMarkdown is a powerful tool for integrating data analysis and report writing, it is not the only option available. Quarto is another format that serves a similar purpose but offers greater flexibility in the types of documents it can generate. Additionally, Quarto depends less on external packages compared to RMarkdown, which makes it an increasingly popular choice for report writing that incorporates data analysis.

2. Writing R Codes in RMarkdown

RMarkdown integrates R scripting tools (such as those for data visualization and analysis) with the creation of reports and presentations. R codes can be implemented using an inline command or an R code chunk.

To generate an inline R code, enclose your script code with a pair of backticks (the backtick can be found below your Esc key), followed immediately by the letter “r”, and then your R code, as shown here: `r codes here`. For instance, to calculate 2 + 2 in the backend, we use the inline codes `r 2 + 2`. This will render 2 + 2 into the computed output, 4.

A code chunk is a block of text that is recognized as an R script. We can do our R scripting work inside a code chunk. To create a code chunk in the RMarkdown code, start with the backtick sequence ```{r} and end with three backticks ```

  ```{r}
  summary(mtcars)

When knitted, the “wrapper” of the code chunk will be hidden in the generated report, but the code summary(mtcars) will be evaluated (unless specified otherwise). For example, I am running these R codes to produce the scatter plot shown below.

# Specify Petal.Length and Sepal.Length as the x,y global aesthetic
p <- ggplot(data = iris, mapping = aes(x = Petal.Length, y =  Sepal.Length)) 

# Construct a scatter plot of Petal Length against Sepal Length
p + geom_point(aes(color = Species)) +  
  labs(title = "Petal vs Sepal Length of Iris (by Species)", 
       x = "Petal Length", y = "Sepal Length", color = "") +
  theme_minimal()

$\label{fig:iris}$

As another example, let’s consider an artificial dataset with 3 observations:

# Set your values here
GDP.Growth.SG <- 3
GDP.Growth.MY <- 5
GDP.Growth.ID <- 4

Based the dataset, the average GDP growth of ASEAN is 4 percent. The number, 4, is calculated in the backend using R, where the actual line of code hidden from the report looks like:

`r (GDP.Growth.SG + GDP.Growth.MY + GDP.Growth.ID)/3`

2.1 Code Chunks Settings

There are several settings for code chunks depending on whether we wish to hide the codes, execute the codes, report messages, etc. Some common options are:

echo = FALSE if you want to hide the codes.
include = FALSE if you want to hide everything (nothing, including the output, will be shown).
warning = FALSE if you want to hide warnings.
message = FALSE if you want to hide messages.
results = "hide" if you want to hide the R output.
eval = FALSE if you do not want R to execute the code chunks.

These settings can be declared globally, applying to all code chunks in the document (unless the specific code chunk specifies a different setting). For example, to hide all messages and warnings, show all the codes, and specify the size and position for all figures throughout the entire document by default, we may apply these settings through the opts_chunk$set function as shown below here. These settings are applied before the introduction section in this RMarkdown file, which is hidden from the generated report.

 ```{r setup, include=FALSE}
 knitr::opts_chunk$set(echo = TRUE)
 knitr::opts_chunk$set(message = FALSE)
 knitr::opts_chunk$set(warning = FALSE)
 knitr::opts_chunk$set(fig.height = 6, fig.width = 8.5, fig.align = "center")

To override the global above setting for a specific code chunk, we need to specify the desired parameters for this chunk. For instance, in the code chunk below, we hide the results but show the codes by using the setting echo = T and results = "hide":

  ```{r hide.1, echo = T, results = "hide"}
  # Run a regression of mpg on disp and cyl
  reg.out <- lm(mpg ~ disp + cyl , data = mtcars)
  
  # Summarize the regression results
  summary(reg.out)

A regression model has been estimated by the codes in the above but the results are suppressed. Nonetheless, we may still reference the results from the regression output. For instance, the coefficient on the displacement variable, disp, is -0.021.

Finally, for debugging purposes, it is a good idea to name each code chunk. For example, the code chunk below is named glm.1. If there is an error, the error message will point to the name of code chunk. Note that the name of each code chunk must be unique; otherwise, the knitting process will fail.

  ```{r glm.1, echo = T, results = "hide"}
  # Run a logistic regression of Species on the petal and sepal length and width
  logistic.out <- glm(Species ~ Petal.Length + Petal.Width + 
                 Sepal.Length + Sepal.Width, 
                 family= binomial(link = "logit"), 
                 data = iris)

  # Summarize the regression results
  summary(logistic.out)

3. Using RMarkdown For Reproducible and Automated Workflows

RMarkdown’s ability to integrate data extraction, data wrangling, data analysis and report writing in a programmatic manner that ensures complete reproducibility of the report. We will consider some examples below.

3.1 Example: Retrenchments in Singapore

Imagine this is 2019 and we have data on retrenchment up to Q4 2018. Let’s call our dataset retrenchment_2018.csv. Now, we are now in 2020 and have data on retrenchment up to Q4 2019, named retrenchment_2019.csv.

Since the name of the datafile changes only in the year (i.e. from retrenchment_2018.csv to retrenchment_2019.csv), we may simply declare the year as a variable and paste it together with the first portion of the datafile’s name (i.e. retrenchment_) using the paste() function.

# To use the as.yearqtr() function to convert data specied as year-qtr into year-month-day.
library(zoo)

# Choose year here - e.g. 2018 or 2019
YEAR = 2018

# Construct the name of the csv file by using paste() with no separator
# The output of paste("retrenchment_", YEAR, ".csv", sep = "") would be retirement_2018.csv
df <- read_csv(paste("retrenchment_", YEAR, ".csv", sep = ""))

To update the above analysis using 2019 data, all we need to do is to specify the year in the above code chunk, say YEAR = 2019. The rest of the codes will paste the different parts of the strings together, i.e. paste("retrenchment_", "Year", ".csv", sep = ""), into retrenchment_2019.csv, which will be read into the data frame df and plotted below.

### Data Cleaning ###

# Convert the data into a data frame
df <- as_tibble(df)

# Clean up the variables by coercing them into the correct types
df$quarter <- as.Date(as.yearqtr(df$quarter, format = "%Y-Q%q"))
df$industry1 <- as.factor(df$industry1)
df$retrench <- as.numeric(df$retrench)

# Reorder the data, filter the data for manufacturing, construction and services, and calculate the mean retrenchment numbers for these sectors
df <- df[order(df$industry1, df$quarter),]
df <- filter(df, df$industry1 %in% c("manufacturing", "construction", "services") )
df <- df %>% group_by(industry1) %>% mutate(mean_retrench = mean(retrench, na.rm = TRUE)) 

### Plotting ###

# Specify the data and aesthetics globally
p <- ggplot(df, aes(x = quarter, y = retrench, color = industry1))

# Visualize retrenchments for the three sectors through a line plot
p + geom_line(size = 0.6) +
  geom_line(aes(x = quarter, y = mean_retrench, color  =industry1),  size = 0.6)+
  geom_ribbon(aes(xmin = as.Date("2008-10-01"), xmax = as.Date("2010-01-01"), y = retrench), fill = "darkred", alpha = 0.1, inherit.aes = F) +
  geom_ribbon(aes(xmin = as.Date("1997-07-01"), xmax = as.Date("1998-10-01"), y = retrench), fill = "darkred", alpha = 0.1, inherit.aes = F) +
  geom_ribbon(aes(xmin = as.Date("2000-04-01"), xmax = as.Date("2001-10-01"), y = retrench), fill = "darkred", alpha = 0.1, inherit.aes = F) +
  geom_ribbon(aes(xmin = as.Date("2003-01-01"), xmax = as.Date("2003-10-01"), y = retrench), fill = "darkred", alpha = 0.1, inherit.aes = F) +
  geom_ribbon(aes(xmin = as.Date("2020-04-01"), xmax = as.Date("2021-07-01"), y = retrench), fill = "darkred", alpha = 0.1, inherit.aes = F) +
   labs( title = "Number of Retrenchments by Industry, Quarterly", subtitle = paste("Q1 1998 to Q4", YEAR, ", Recessions in Red"), y = "", x = "", color = "") +
   theme(legend.position = "top") +
   theme_minimal()

3.2 Updating a Data Analysis Using An Application Programming Interface (API)

To streamline the updating process, instead of manually specifying the YEAR parameter, we can directly access data from the data provider using their Application Programming Interface (API). An API serves as a link between different applications, such as the client (you) and the data provider. By leveraging an API, we can automate tasks like data extraction, cleaning, visualization, and even analysis in the background, while concealing the complexity of code execution when generating reports from RMarkdown.

In R, to fetch data using an API, we typically make an API GET call using the GET() function from the httr package. For instance, websites like https://data.gov.sg offer APIs for data retrieval. The GET call to download the data on retrenchment from https://data.gov.sg looks something like the following:

GET(“https://data.gov.sg/api/action/datastore_search?resource_id=3d180571-81d3-4834-a759-8374806b731e&limit=500”)

There are three main components in the design of the above web API:

Base url: https://data.gov.sg/
Resource path: api/action/datastore_search
Query: ?resource_id=3d180571-81d3-4834-a759-8374806b731e&limit=500

Let’s extract the retrenchment data from https:://data.gov.sg using an API and update the plot on retrenchments if new data are available. First, we extract the data on retrenchments from data.gov.sg. The saved data, raw.df, is in the format called JSON.

# Required libraries: httr, jsonlite
raw.df <- httr::GET("https://data.gov.sg/api/action/datastore_search?resource_id=3d180571-81d3-4834-a759-8374806b731e&limit=500")

To extract the data frame from raw.df, we need to parse the JSON file by extacting the content in text (also known as “character”). There are more than one ways to do so. Here, we use the jsonlite library to parse raw.df into df.out.

# Parse the text content from raw.df 
df.out <- fromJSON(rawToChar(raw.df$content), flatten = T)

Finally, extract the dataset from df.out. Note that different providers store the dataset in different locations. For data.gov.sg, the dataset can be found in $results$records:

df <- df.out$result$records  
# Note: You may also try df <- df.out[[3]].
# Some trial and error is required

After obtaining the data, let’s recycle the earlier codes used for plotting trends in retrenchment. To save space, the codes are hidden. Notice that the dates in the plot subtitle are updated automatically. This is done by using the paste() function to update the start and end period in the subtitle, as shown below:

labs(title = "Number of Retrenchments by Industry, Quarterly", subtitle = paste(date.start[2], date.start[1], "to", date.end[2], date.end[1]), y = "", x ="", color = "")

3.3 Further Comments

Different databases organize API calls differently. In data.gov.sg, the resource itself is queried through the resource_id. To make the correct API call, we should refer to the API documentation. For data.gov.sg, you can find the documentation at this link.

Let’s follow the instructions provided on this webpage, which offers an example of extracting HDB resale housing data from January 2017 onwards.

For the API, the base url is https://data.gov.sg/api/action/datastore_search. To extract the housing dataset, we need to append its resource ID, which is d_8b84c4ee58e3cfc0ece0d773c8ca6abc, by forming ?resource_id=d_8b84c4ee58e3cfc0ece0d773c8ca6abc. In the case for data.gov.sg, the resource ID is “queried” (i.e. it comes after ?).

The complete web API for this example is

"https://data.gov.sg/api/action/datastore_search?resource_id=d_8b84c4ee58e3cfc0ece0d773c8ca6abc"

We may now pass this link to fetch the housing data using a GET call.

# Extracting HDB Housing Data. raw.df2 is in the JSON format
raw.df2 <- GET("https://data.gov.sg/api/action/datastore_search?resource_id=d_8b84c4ee58e3cfc0ece0d773c8ca6abc")  

# Extracting the content from the raw JSON file
df.out2 <- fromJSON(rawToChar(raw.df2$content), flatten = T) 

# Extracting the data frame and saving it
df2 <- df.out2$result$records 

# Look at the extracted file to see if the correct data are extracted
glimpse(df2)

## Rows: 100
## Columns: 12
## $ `_id`               <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ month               <chr> "2017-01", "2017-01", "2017-01", "2017-01", "2017-…
## $ town                <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
## $ flat_type           <chr> "2 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
## $ block               <chr> "406", "108", "602", "465", "601", "150", "447", "…
## $ street_name         <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 4", "ANG MO K…
## $ storey_range        <chr> "10 TO 12", "01 TO 03", "01 TO 03", "04 TO 06", "0…
## $ floor_area_sqm      <chr> "44", "67", "67", "68", "67", "68", "68", "67", "6…
## $ flat_model          <chr> "Improved", "New Generation", "New Generation", "N…
## $ lease_commence_date <chr> "1979", "1978", "1980", "1980", "1980", "1981", "1…
## $ remaining_lease     <chr> "61 years 04 months", "60 years 07 months", "62 ye…
## $ resale_price        <chr> "232000", "250000", "262000", "265000", "265000", …

To fetch other data series from data.gov.sg, follow the sequence below:

Locate the data’s resource ID. To do so, visit https://beta.data.gov.sg/collections and click on the “Dataset” tab.
Search for the data by entering the keywords in the search bar. For instance, enter “Health” as a keyword.
Click the link to the dataset you want. For example, click the the link to “Common health problems of students examined - Overweight, Annual”. This will take us to the dataset’s unique url, https://beta.data.gov.sg/datasets/d_7c3c14c03c4737ffefed396c477cbb94/view.
The resource id is d_7c3c14c03c4737ffefed396c477cbb94.
Use the resource ID to form the web API, "https://data.gov.sg/api/action/datastore_search?resource_id=d_7c3c14c03c4737ffefed396c477cbb94". Extract the data by using the codes below:

# Extracting the common health problems data
raw.df3 <- GET("https://data.gov.sg/api/action/datastore_search?resource_id=d_7c3c14c03c4737ffefed396c477cbb94")  

# Extracting the content from the raw JSON file
df.out3 <- fromJSON(rawToChar(raw.df3$content), flatten = T) 

# Extracting the data frame
df3 <- df.out3$result$records # Saving the data

# Look at the extracted file
glimpse(df3)

## Rows: 48
## Columns: 5
## $ `_id`              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ year               <chr> "2009", "2009", "2009", "2009", "2010", "2010", "20…
## $ age_group          <chr> "Primary 1 and equivalent age groups", "Primary 1 a…
## $ gender             <chr> "Male", "Female", "Male", "Female", "Male", "Female…
## $ per_10000_examined <chr> "1212", "1080", "1787", "1210", "1218", "1059", "17…

3.4 Example: Fetching Data from the World Development Indicators

For the World Development Indicators, the resource is specified in the resource path. For example, the basic API call to extract total population is:

http://api.worldbank.org/v2/country/all/indicator/SP.POP.TOTL?format=json

Below, we extract the population data from 2000:2022, set the limit to 10000 records, and retrieve the data in the json format. We use the flatten = T option to flatten nested data frames. Notice that we didn’t do a GET call.

# Fetching the data from WDI
wdi.df <- fromJSON("https://api.worldbank.org/v2/country/all/indicator/SP.POP.TOTL?date=2000:2022&per_page=10000&format=json", flatten = T)

# Inspecting the data
wdi.df[[2]] %>% head()

##   countryiso3code date     value unit obs_status decimal indicator.id
## 1             AFE 2022 720859132                       0  SP.POP.TOTL
## 2             AFE 2021 702977106                       0  SP.POP.TOTL
## 3             AFE 2020 685112979                       0  SP.POP.TOTL
## 4             AFE 2019 667242986                       0  SP.POP.TOTL
## 5             AFE 2018 649757148                       0  SP.POP.TOTL
## 6             AFE 2017 632746570                       0  SP.POP.TOTL
##     indicator.value country.id               country.value
## 1 Population, total         ZH Africa Eastern and Southern
## 2 Population, total         ZH Africa Eastern and Southern
## 3 Population, total         ZH Africa Eastern and Southern
## 4 Population, total         ZH Africa Eastern and Southern
## 5 Population, total         ZH Africa Eastern and Southern
## 6 Population, total         ZH Africa Eastern and Southern

For more details, see

https://datahelpdesk.worldbank.org/knowledgebase/topics/125589-developer-information

4. Integrating Python with R

We may run Python commands in R by using the reticulate package. Let’s scrape data from Wikipedia using the pandas library in Python. To do so, importing the reticulate package, we import the pandas library.

library(reticulate)

After importing the reticulate package, we may now use Python commands by tagging the code chunk with python, as shown in the example below:


```python
# Install the dependent library for data scraping, lxml.
# Uncheck the code below to install it.
# pip install lxml 

# Import the pandas library under the alias, pd.
import pandas as pd






We now scrape a table titled "List of national capitals by population" from the Wikipedia site,  https://en.wikipedia.org/wiki/List_of_national_capitals_by_population. This is achieved by using the python code below.


```python
# Python code here
# Use the pandas library's read_html function to extract data from the wikipedia link.
pop_data = pd.read_html("https://en.wikipedia.org/wiki/List_of_national_capitals_by_population")

The python data frame, pop_data, is stored as a list item in the R object called py. Therefore, to access pop_data in R, we will referenced it using the code, py$pop_data. The dataset we want is stored as the second item in py$pop_data. To access it, we reference it using double square brackets py$pop_data[[2]] and save it as an R data frame called df.pop.

# Saving data from Python into R
df.pop = py$pop_data[[2]]

# View the data and clean up as needed.
head(df.pop)

##   Country / dependency  Capital Population % of country   Source
## 1              China *  Beijing   21542000         1.5% [1] 2018
## 2              Japan *    Tokyo   14094034        11.3% [2] 2023
## 3             Russia *   Moscow   13104177         9.0% [3] 2023
## 4           DR Congo * Kinshasa   12691000        13.2% [4] 2017
## 5          Indonesia *  Jakarta   10562088         3.9% [5] 2020
## 6               Peru *     Lima   10151000        30.1% [6] 2023