RMarkdown is an RStudio notebook that seamlessly integrates report writing, statistical or machine learning analysis, figure generation, etc. It enables a report to be completely reproducible in various formats, such as Word, HTML, PDF, etc.. Documents generated by RMarkdown are also fully customizable through the RMarkdown source codes. To learn about RMarkdown capabilities, please refer to the guide here.
We can use RMarkdown to generate various types of outputs:
HTML: HTML output can include slides and documents containing dynamic features like interactive diagrams (e.g., a leaflet widget).
MS Word: RMarkdown source code can be knit into an MS Word document.
PowerPoint: RMarkdown source code can be knit into a PowerPoint presentation.
The benefit of using RMarkdown over a word processor or presentation software is that it integrates data extraction, data wrangling, visualization, statistical, and machine learning computations with report writing. This removes the need for us to use different software for data analysis and report writing.
As an example of how data analysis and report writing can be integrated, the analyses in Sections 3.2 and 3.4 below connect the RMarkdown source code of this document to the data.gov.sg and World Development Indicator data servers through their respective Application Programming Interfaces (APIs). This enables us to download new data from these servers and update our data analysis simply by knitting the RMarkdown source codes of this document again.
For this course, all the slides are knitted into HTML files. Unlike PowerPoint slides, HTML slides can be used to present dynamic visualizations, such as interactive figures, maps, and animated plots. There are primarily three types of HTML slides – ioslides, slidy, and Xaringan. Their difference lies in how customizable they are. For our course, our slides are made by Xaringan as it allows for a higher level of customization than slidy and ioslides permit. To knit the document, we need the following packages. Please install them first before proceeding.
# For data wrangling and plotting
library(tidyverse)
library(ggrepel)
# For generating equations through latex syntax
library(tinytex)
# For extracting data through an GET API call
library(httr) # To use GET() to extract data from the server into the JSON format
library(jsonlite) # To use fromJSON() to extract JSON content into a data frame
# Other packages
library(stats) # For implementing logistic regression via glm()
library(reticulate) # To use Python codes in R
As a final remark, while RMarkdown is a powerful tool for integrating data analysis and report writing, it is not the only option available. Quarto is another format that serves a similar purpose but offers greater flexibility in the types of documents it can generate. Additionally, Quarto depends less on external packages compared to RMarkdown, which makes it an increasingly popular choice for report writing that incorporates data analysis.
RMarkdown integrates R scripting tools (such as those for data visualization and analysis) with the creation of reports and presentations. R codes can be implemented using an inline command or an R code chunk.
To generate an inline R code, enclose your script code with a pair of
backticks (the backtick can be found below your Esc key), followed
immediately by the letter “r”, and then your R code, as shown here:
`r codes here`
. For instance, to calculate 2 + 2 in the
backend, we use the inline codes `r 2 + 2`
. This will
render 2 + 2
into the computed output, 4.
A code chunk is a block of text that is recognized as an R script. We
can do our R scripting work inside a code chunk. To create a code chunk
in the RMarkdown code, start with the backtick sequence
```{r}
and end with three backticks ```
```{r}
summary(mtcars)
When knitted, the “wrapper” of the code chunk will be hidden in the
generated report, but the code summary(mtcars)
will be
evaluated (unless specified otherwise). For example, I am running these
R codes to produce the scatter plot shown below.
# Specify Petal.Length and Sepal.Length as the x,y global aesthetic
p <- ggplot(data = iris, mapping = aes(x = Petal.Length, y = Sepal.Length))
# Construct a scatter plot of Petal Length against Sepal Length
p + geom_point(aes(color = Species)) +
labs(title = "Petal vs Sepal Length of Iris (by Species)",
x = "Petal Length", y = "Sepal Length", color = "") +
theme_minimal()
As another example, let’s consider an artificial dataset with 3 observations:
# Set your values here
GDP.Growth.SG <- 3
GDP.Growth.MY <- 5
GDP.Growth.ID <- 4
Based the dataset, the average GDP growth of ASEAN is 4 percent. The number, 4, is calculated in the backend using R, where the actual line of code hidden from the report looks like:
`r (GDP.Growth.SG + GDP.Growth.MY + GDP.Growth.ID)/3`
There are several settings for code chunks depending on whether we wish to hide the codes, execute the codes, report messages, etc. Some common options are:
echo = FALSE
if you want to hide the codes.
include = FALSE
if you want to hide everything
(nothing, including the output, will be shown).
warning = FALSE
if you want to hide
warnings.
message = FALSE
if you want to hide
messages.
results = "hide"
if you want to hide the R
output.
eval = FALSE
if you do not want R to execute the
code chunks.
These settings can be declared globally, applying to all code chunks
in the document (unless the specific code chunk specifies a different
setting). For example, to hide all messages and warnings, show all the
codes, and specify the size and position for all figures
throughout the entire document by default, we may apply
these settings through the opts_chunk$set
function as shown
below here. These settings are applied before the introduction section
in this RMarkdown file, which is hidden from the generated report.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(fig.height = 6, fig.width = 8.5, fig.align = "center")
To override the global above setting for a specific code chunk, we
need to specify the desired parameters for this chunk. For instance, in
the code chunk below, we hide the results but show the codes by using
the setting echo = T
and results = "hide"
:
```{r hide.1, echo = T, results = "hide"}
# Run a regression of mpg on disp and cyl
reg.out <- lm(mpg ~ disp + cyl , data = mtcars)
# Summarize the regression results
summary(reg.out)
A regression model has been estimated by the codes in the above but
the results are suppressed. Nonetheless, we may still reference the
results from the regression output. For instance, the coefficient on the
displacement variable, disp
, is -0.021.
Finally, for debugging purposes, it is a good idea to name each code
chunk. For example, the code chunk below is named glm.1
. If
there is an error, the error message will point to the name of code
chunk. Note that the name of each code chunk must be
unique; otherwise, the knitting process will fail.
```{r glm.1, echo = T, results = "hide"}
# Run a logistic regression of Species on the petal and sepal length and width
logistic.out <- glm(Species ~ Petal.Length + Petal.Width +
Sepal.Length + Sepal.Width,
family= binomial(link = "logit"),
data = iris)
# Summarize the regression results
summary(logistic.out)
RMarkdown’s ability to integrate data extraction, data wrangling, data analysis and report writing in a programmatic manner that ensures complete reproducibility of the report. We will consider some examples below.
Imagine this is 2019 and we have data on retrenchment up to Q4 2018.
Let’s call our dataset retrenchment_2018.csv
. Now, we are
now in 2020 and have data on retrenchment up to Q4 2019, named
retrenchment_2019.csv
.
Since the name of the datafile changes only in the year (i.e. from
retrenchment_2018.csv
to
retrenchment_2019.csv
), we may simply declare the year as a
variable and paste it together with the first portion of the datafile’s
name (i.e. retrenchment_
) using the paste()
function.
# To use the as.yearqtr() function to convert data specied as year-qtr into year-month-day.
library(zoo)
# Choose year here - e.g. 2018 or 2019
YEAR = 2018
# Construct the name of the csv file by using paste() with no separator
# The output of paste("retrenchment_", YEAR, ".csv", sep = "") would be retirement_2018.csv
df <- read_csv(paste("retrenchment_", YEAR, ".csv", sep = ""))
To update the above analysis using 2019 data, all we need to do is to
specify the year in the above code chunk, say YEAR = 2019
.
The rest of the codes will paste the different parts of the strings
together,
i.e. paste("retrenchment_", "Year", ".csv", sep = "")
, into
retrenchment_2019.csv
, which will be read into the data
frame df
and plotted below.
### Data Cleaning ###
# Convert the data into a data frame
df <- as_tibble(df)
# Clean up the variables by coercing them into the correct types
df$quarter <- as.Date(as.yearqtr(df$quarter, format = "%Y-Q%q"))
df$industry1 <- as.factor(df$industry1)
df$retrench <- as.numeric(df$retrench)
# Reorder the data, filter the data for manufacturing, construction and services, and calculate the mean retrenchment numbers for these sectors
df <- df[order(df$industry1, df$quarter),]
df <- filter(df, df$industry1 %in% c("manufacturing", "construction", "services") )
df <- df %>% group_by(industry1) %>% mutate(mean_retrench = mean(retrench, na.rm = TRUE))
### Plotting ###
# Specify the data and aesthetics globally
p <- ggplot(df, aes(x = quarter, y = retrench, color = industry1))
# Visualize retrenchments for the three sectors through a line plot
p + geom_line(size = 0.6) +
geom_line(aes(x = quarter, y = mean_retrench, color =industry1), size = 0.6)+
geom_ribbon(aes(xmin = as.Date("2008-10-01"), xmax = as.Date("2010-01-01"), y = retrench), fill = "darkred", alpha = 0.1, inherit.aes = F) +
geom_ribbon(aes(xmin = as.Date("1997-07-01"), xmax = as.Date("1998-10-01"), y = retrench), fill = "darkred", alpha = 0.1, inherit.aes = F) +
geom_ribbon(aes(xmin = as.Date("2000-04-01"), xmax = as.Date("2001-10-01"), y = retrench), fill = "darkred", alpha = 0.1, inherit.aes = F) +
geom_ribbon(aes(xmin = as.Date("2003-01-01"), xmax = as.Date("2003-10-01"), y = retrench), fill = "darkred", alpha = 0.1, inherit.aes = F) +
geom_ribbon(aes(xmin = as.Date("2020-04-01"), xmax = as.Date("2021-07-01"), y = retrench), fill = "darkred", alpha = 0.1, inherit.aes = F) +
labs( title = "Number of Retrenchments by Industry, Quarterly", subtitle = paste("Q1 1998 to Q4", YEAR, ", Recessions in Red"), y = "", x = "", color = "") +
theme(legend.position = "top") +
theme_minimal()
To streamline the updating process, instead of manually specifying
the YEAR
parameter, we can directly access data from the
data provider using their Application Programming
Interface (API). An API serves as a link between different
applications, such as the client (you) and the data provider. By
leveraging an API, we can automate tasks like data extraction, cleaning,
visualization, and even analysis in the background, while concealing the
complexity of code execution when generating reports from RMarkdown.
In R, to fetch data using an API, we typically make an API
GET
call using the GET()
function from the
httr
package. For instance, websites like https://data.gov.sg offer
APIs for data retrieval. The GET
call to download the data
on retrenchment from https://data.gov.sg looks something like the
following:
GET(“https://data.gov.sg/api/action/datastore_search?resource_id=3d180571-81d3-4834-a759-8374806b731e&limit=500”)
There are three main components in the design of the above web API:
Base url: https://data.gov.sg/
Resource path: api/action/datastore_search
Query: ?resource_id=3d180571-81d3-4834-a759-8374806b731e&limit=500
Let’s extract the retrenchment data from https:://data.gov.sg using
an API and update the plot on retrenchments if new data are available.
First, we extract the data on retrenchments from data.gov.sg. The saved
data, raw.df
, is in the format called
JSON
.
# Required libraries: httr, jsonlite
raw.df <- httr::GET("https://data.gov.sg/api/action/datastore_search?resource_id=3d180571-81d3-4834-a759-8374806b731e&limit=500")
To extract the data frame from raw.df
, we need to parse
the JSON file by extacting the content in text (also known as
“character”). There are more than one ways to do so. Here, we use the
jsonlite
library to parse raw.df
into
df.out
.
# Parse the text content from raw.df
df.out <- fromJSON(rawToChar(raw.df$content), flatten = T)
Finally, extract the dataset from df.out
. Note that
different providers store the dataset in different locations. For
data.gov.sg, the dataset can be found in
$results$records
:
df <- df.out$result$records
# Note: You may also try df <- df.out[[3]].
# Some trial and error is required
After obtaining the data, let’s recycle the earlier codes used for
plotting trends in retrenchment. To save space, the codes are hidden.
Notice that the dates in the plot subtitle are updated automatically.
This is done by using the paste()
function to update the
start and end period in the subtitle, as shown below:
labs(title = "Number of Retrenchments by Industry, Quarterly", subtitle = paste(date.start[2], date.start[1], "to", date.end[2], date.end[1]), y = "", x ="", color = "")
Different databases organize API calls differently. In data.gov.sg,
the resource itself is queried through the resource_id
. To
make the correct API call, we should refer to the API documentation. For
data.gov.sg, you can find the documentation at this
link.
Let’s follow the instructions provided on this webpage, which offers an example of extracting HDB resale housing data from January 2017 onwards.
For the API, the base url is
https://data.gov.sg/api/action/datastore_search
. To extract
the housing dataset, we need to append its resource ID, which is
d_8b84c4ee58e3cfc0ece0d773c8ca6abc
, by forming
?resource_id=d_8b84c4ee58e3cfc0ece0d773c8ca6abc
. In the
case for data.gov.sg, the resource ID is “queried” (i.e. it comes after
?
).
The complete web API for this example is
"https://data.gov.sg/api/action/datastore_search?resource_id=d_8b84c4ee58e3cfc0ece0d773c8ca6abc"
We may now pass this link to fetch the housing data using a
GET
call.
# Extracting HDB Housing Data. raw.df2 is in the JSON format
raw.df2 <- GET("https://data.gov.sg/api/action/datastore_search?resource_id=d_8b84c4ee58e3cfc0ece0d773c8ca6abc")
# Extracting the content from the raw JSON file
df.out2 <- fromJSON(rawToChar(raw.df2$content), flatten = T)
# Extracting the data frame and saving it
df2 <- df.out2$result$records
# Look at the extracted file to see if the correct data are extracted
glimpse(df2)
## Rows: 100
## Columns: 12
## $ `_id` <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ month <chr> "2017-01", "2017-01", "2017-01", "2017-01", "2017-…
## $ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
## $ flat_type <chr> "2 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
## $ block <chr> "406", "108", "602", "465", "601", "150", "447", "…
## $ street_name <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 4", "ANG MO K…
## $ storey_range <chr> "10 TO 12", "01 TO 03", "01 TO 03", "04 TO 06", "0…
## $ floor_area_sqm <chr> "44", "67", "67", "68", "67", "68", "68", "67", "6…
## $ flat_model <chr> "Improved", "New Generation", "New Generation", "N…
## $ lease_commence_date <chr> "1979", "1978", "1980", "1980", "1980", "1981", "1…
## $ remaining_lease <chr> "61 years 04 months", "60 years 07 months", "62 ye…
## $ resale_price <chr> "232000", "250000", "262000", "265000", "265000", …
To fetch other data series from data.gov.sg, follow the sequence below:
Locate the data’s resource ID. To do so, visit https://beta.data.gov.sg/collections and click on the “Dataset” tab.
Search for the data by entering the keywords in the search bar. For instance, enter “Health” as a keyword.
Click the link to the dataset you want. For example, click the
the link to “Common health problems of students examined - Overweight,
Annual”. This will take us to the dataset’s unique url,
https://beta.data.gov.sg/datasets/d_7c3c14c03c4737ffefed396c477cbb94/view
.
The resource id is
d_7c3c14c03c4737ffefed396c477cbb94
.
Use the resource ID to form the web API,
"https://data.gov.sg/api/action/datastore_search?resource_id=d_7c3c14c03c4737ffefed396c477cbb94"
.
Extract the data by using the codes below:
# Extracting the common health problems data
raw.df3 <- GET("https://data.gov.sg/api/action/datastore_search?resource_id=d_7c3c14c03c4737ffefed396c477cbb94")
# Extracting the content from the raw JSON file
df.out3 <- fromJSON(rawToChar(raw.df3$content), flatten = T)
# Extracting the data frame
df3 <- df.out3$result$records # Saving the data
# Look at the extracted file
glimpse(df3)
## Rows: 48
## Columns: 5
## $ `_id` <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ year <chr> "2009", "2009", "2009", "2009", "2010", "2010", "20…
## $ age_group <chr> "Primary 1 and equivalent age groups", "Primary 1 a…
## $ gender <chr> "Male", "Female", "Male", "Female", "Male", "Female…
## $ per_10000_examined <chr> "1212", "1080", "1787", "1210", "1218", "1059", "17…
For the World Development Indicators, the resource is specified in the resource path. For example, the basic API call to extract total population is:
http://api.worldbank.org/v2/country/all/indicator/SP.POP.TOTL?format=json
Below, we extract the population data from 2000:2022, set the limit
to 10000 records, and retrieve the data in the json format. We use the
flatten = T
option to flatten nested data frames. Notice
that we didn’t do a GET
call.
# Fetching the data from WDI
wdi.df <- fromJSON("https://api.worldbank.org/v2/country/all/indicator/SP.POP.TOTL?date=2000:2022&per_page=10000&format=json", flatten = T)
# Inspecting the data
wdi.df[[2]] %>% head()
## countryiso3code date value unit obs_status decimal indicator.id
## 1 AFE 2022 720859132 0 SP.POP.TOTL
## 2 AFE 2021 702977106 0 SP.POP.TOTL
## 3 AFE 2020 685112979 0 SP.POP.TOTL
## 4 AFE 2019 667242986 0 SP.POP.TOTL
## 5 AFE 2018 649757148 0 SP.POP.TOTL
## 6 AFE 2017 632746570 0 SP.POP.TOTL
## indicator.value country.id country.value
## 1 Population, total ZH Africa Eastern and Southern
## 2 Population, total ZH Africa Eastern and Southern
## 3 Population, total ZH Africa Eastern and Southern
## 4 Population, total ZH Africa Eastern and Southern
## 5 Population, total ZH Africa Eastern and Southern
## 6 Population, total ZH Africa Eastern and Southern
For more details, see
https://datahelpdesk.worldbank.org/knowledgebase/topics/125589-developer-information
We may run Python commands in R by using the reticulate
package. Let’s scrape data from Wikipedia using the pandas
library in Python. To do so, importing the reticulate
package, we import the pandas
library.
library(reticulate)
After importing the reticulate
package, we may now use
Python commands by tagging the code chunk with python
, as
shown in the example below:
```python
# Install the dependent library for data scraping, lxml.
# Uncheck the code below to install it.
# pip install lxml
# Import the pandas library under the alias, pd.
import pandas as pd
We now scrape a table titled "List of national capitals by population" from the Wikipedia site, https://en.wikipedia.org/wiki/List_of_national_capitals_by_population. This is achieved by using the python code below.
```python
# Python code here
# Use the pandas library's read_html function to extract data from the wikipedia link.
pop_data = pd.read_html("https://en.wikipedia.org/wiki/List_of_national_capitals_by_population")
The python data frame, pop_data
, is stored as a list
item in the R object called py
. Therefore, to access
pop_data
in R, we will referenced it using the code,
py$pop_data
. The dataset we want is stored as the second
item in py$pop_data
. To access it, we reference it using
double square brackets py$pop_data[[2]]
and save it as an R
data frame called df.pop
.
# Saving data from Python into R
df.pop = py$pop_data[[2]]
# View the data and clean up as needed.
head(df.pop)
## Country / dependency Capital Population % of country Source
## 1 China * Beijing 21542000 1.5% [1] 2018
## 2 Japan * Tokyo 14094034 11.3% [2] 2023
## 3 Russia * Moscow 13104177 9.0% [3] 2023
## 4 DR Congo * Kinshasa 12691000 13.2% [4] 2017
## 5 Indonesia * Jakarta 10562088 3.9% [5] 2020
## 6 Peru * Lima 10151000 30.1% [6] 2023