class: center, middle, inverse, title-slide .title[ # Time Series Plots
] .author[ ### Nicholas Sim ] .date[ ### 03 September 2024 ] --- class: center, middle, inverse # Introduction --- ### Topics * `geom_line()` for plotting line graphs * `geom_dl` for labeling line graphs * `group` aesthetic * `label` aesthetic (more on this later) * `geom_ribbon()` to shade areas * `geom_rect()` to draw rectangles (more on this later) --- ### Required Libraries ```r library(tidyverse) library(socviz) library(gapminder) library(directlabels) # for labelling line plot theme_set(theme_minimal()) # We will use theme_minimal throughout ``` --- ### Visualizing a Time Series Ref: Chapter 4 KH A **time series** represents data collected on a particular observational unit over time. An example is the yearly GDP of Singapore from 1990 to 2023. There are various ways to visualize a time series. The most common plot type is the **line plot**, which depicts the data points connected by lines. Other plot types may include the **area plot**, where the area beneath the line is filled, and the **path plot**, which traces out the path between the individual data points. --- ### Visualizing a Time Series The `ggplot2` commands may vary slightly for different types of plots, such as lines, bars, histograms, and density plots. For instance, line plots, bar charts, histograms, and density plots may require different mandatory aesthetics. Each `geom` function may also have its own idiosyncratic features. Understanding these features can help us avoid plotting figures incorrectly. Time series data often come in the form of longitudinal data with a group structure, such as time series data on stock prices for Apple, Google, etc. To correctly plot time series from a longitudinal dataset, we must utilize a `group` aesthetic. The `group` aesthetic specifies the group structure in the data. To achieve this, the groups must be identified by a group variable in the data frame. A `group` aesthetic is essential for plotting data with a group structure, such as panel/longitudinal data or map data. (Seminar 5). --- ### Line Chart Time series data represent changes in a variable over time (e.g., GDP per capita between 1999-2023 for Singapore). A common approach to visualize time series data is by employing a line chart. The `geom` function used for plotting a line chart is `geom_line()`. In a line chart, the x-axis typically represents time, such as the date or year of observation, while the y-axis depicts the value of the variable over time. The required aesthetics for plotting a line chart are `x = date` and `y = series`, where the `date` variable could represent year, quarter, month, day, etc. --- class: center, middle, inverse # Working with Dates --- ### Declaring a Variable as a Date Class Before plotting a line chart, it is crucial to ensure that the date variable is properly formatted so that `ggplot2` recognizes it correctly. This prevents any confusion, especially if the date variable is initially stored as a string type. If the `date` variable represents years, it is advisable to keep it as a numeric variable, such as an integer. You can then plot the time series variable on the y-axis against the years on the x-axis. However, if the `date` variable represents quarters, months, etc., it needs to be declared as a `Date` class. This can be achieved using the `as.Date()` function. It is important to note that the dates displayed on the graph are merely labels. Each date is actually encoded as a number, known as a **Unix epoch date**. This underpinning structure is crucial for accurately representing the time series data. --- ### Declaring a Variable as a Date Class To find out more about how R handles dates and timestamps, see https://stats.idre.ucla.edu/r/faq/how-does-r-handle-date-values/ .panelset[ .panel[.panel-name[Example 1] .pull-left[ Suppose our date variable is formatted as "1990-07-23". Notice that a dash "-" is used to separate the year, month, and day. To declare our date variable in `as.Date()`, we must identify `1990` as the year, `07` as the month in the numeric format, `23` as the day, and the `-` symbol as the separator used in this date format. To identify "1990-07-23" as having the format "year-month-day", we use `"%Y-%m-%d"`. ] .pull-right[ ```r ex.1 <- as.Date("1990-07-23", format = '%Y-%m-%d') # Check if it is a date class class(ex.1) ``` ``` ## [1] "Date" ``` ] ] .panel[.panel-name[Example 2] .pull-left[ Suppose our date variable is formatted as "1990/07/23", where the separator is "/". To declare our date variable in `as.Date()`, we must identify the "/" symbol as the separator.] .pull-right[ ```r ex.2 <- as.Date("1990/07/23", format = '%Y/%m/%d') # Check if it is a date class class(ex.2) ``` ``` ## [1] "Date" ``` ] ] .panel[.panel-name[Example 3] .pull-left[ Suppose our date variable is formatted as "23 Jul 1993", where the separator is a single space " ". To declare our date variable in `as.Date()`, we must identify "Jul" as month using `%b` instead of `%m`.] .pull-right[ ```r ex.3 <- as.Date("23 Jul 1993", format = '%d %b %Y') # Check if it is a date class class(ex.3) ``` ``` ## [1] "Date" ``` ] ] ] --- ### Notations for the Date Class
--- ### Unix Epoch Date Here is an example showing how a date class is coded as a unique number that corresponds to a Unix time stamp. ```r date.1 <- as.Date("1970-07-23", format = '%Y-%m-%d') class(date.1) ``` ``` ## [1] "Date" ``` ```r # Transform date.1 into a number date.2 <- as.numeric(date.1) print(date.2) ### This is a unique number ``` ``` ## [1] 203 ``` ```r ### Transform date.2 into an date. date.3 <- as.Date(date.2, origin = "1970-01-01") # "1970-01-01" is the default origin # Compare date.1 and date.3 c(date.1,date.3) # They are the same ``` ``` ## [1] "1970-07-23" "1970-07-23" ``` --- class: center, middle, inverse # Example --- ### PM10 Readings in Singapore If you are using dates that are not represent by years only (e.g. 01 Jul 2024 or Q1 2024), you must declare your `date` variable as a date class first. Let's import daily pm10 readings from `pm10.csv`. ```r pm10 <- read_csv("pm10.csv") # From dplyr package ### There are data for 5 regions, West, East, Central, North, South, plus nationally. ### psi_date is the date variable. It is already declared as a date in R. head(pm10,6) ``` ``` ## # A tibble: 6 × 4 ## psi_date psi_measures region index ## <date> <chr> <chr> <dbl> ## 1 2020-04-01 pm10_twenty_four_hourly west 27 ## 2 2020-04-01 pm10_twenty_four_hourly national 28 ## 3 2020-04-01 pm10_twenty_four_hourly east 28 ## 4 2020-04-01 pm10_twenty_four_hourly central 25 ## 5 2020-04-01 pm10_twenty_four_hourly south 21 ## 6 2020-04-01 pm10_twenty_four_hourly north 26 ``` --- ### PM10 Readings in Singapore The `pm10` dataset contains information for different regions. If we simply add `geom_line()` to `ggplot`, the plot will be incorrect. .pull-left[ ```r ### Load the data globally via ggplot p <- ggplot(data = pm10, mapping = aes(x = psi_date, y = index)) ### Add a geom_line layer. p + geom_line() + labs(title = "An Incorrect Line Plot When Data Have a Group Structure", subtitle = 'The dataset has PM10 readings for several regions on the same day\nand the line incorrectly connects the readings across regions') + theme(title = element_text(size = 16), plot.subtitle = element_text(size = 14) ) ``` ] .pull-right[  ] --- ### Caution: Plotting Longitudinal Data The dataset exemplifies longitudinal data, also known as panel data, which encompasses information across both cross-sections and time. For instance, aside from time details, the dataset `pm10` comprises data on regions, such as central, north, south, west, east, and national. These regions serve as "cross-sectional" units. Consequently, in longitudinal data, natural group structures exist, where central is one group, north is another, and so forth. Without explicitly declaring the cross-sectional units as groups, R will not identify any group structure within the dataset. Consequently, the `geom_line()` function may erroneously connect the `pm10` values from one group to values from another group, without recognizing the inherent group structures. --- ### Caution: Plotting with Longitudinal Data This issue can be resolved by declaring the cross-sectional units as different groups using the `group` aesthetic, as shown in the figure below where a time trend is plotted for each region. .pull-left[ ```r p + geom_line(mapping = aes(group = region)) + labs(title = "A Line Plot with Region as a Group Aesthetic", subtitle = 'The plot shows a time series trend for each region as the group structure based on region is now recognized by using a group aesthetic') + theme(title = element_text(size = 16), plot.subtitle = element_text(size = 14) ) ``` ] .pull-right[  ] --- class: center, middle, inverse # Adding Features for Data Storytelling --- ### Plotting Separate Time Series for Different Groups To clean up the plot, we may consider using `region` as a color aesthetic to differentiate the line plot for each region. Instead of using a legend, we may label the lines using `geom_dl()` from the `directlabels` package. Let's plot the lines for `north` and `east` and label these lines by their regions. .panelset[ .panel[.panel-name[R Code] ```r p2 <- ggplot(data = subset(pm10, region %in% c("north", "east") ), mapping = aes(x = psi_date, y = index, group = region)) p2 + geom_line(mapping = aes(color = region), size=0.9) + # color the line plots geom_dl(aes(label = region, color = region), method = list("last.points", cex = 1.8, hjust = 0.8)) + # label the line plots. Try method = "smart.grid" guides(color = FALSE) + # turn color legend off labs(x = NULL, y = NULL, title = "PSI Reading, April 2020") + theme(title = element_text(size = 16, face = "bold"), axis.text = element_text(size = 14)) ``` ] .panel[.panel-name[Plot] <img src="TimeSeriesPlots_files/figure-html/line.3-out-1.png" style="display: block; margin: auto;" /> ] ] --- ### Faceting Another way to clean up the figure is to facet the line plots by region. This makes the color aesthetics redundant. Let's add titles to the plot. .pull-left[ ```r p + geom_line(mapping = aes(group=region)) + labs(y = "PM 10 Readings at 5pm", x="", title = "PM 10 Readings For April 2020") + facet_wrap(~region) ``` ] .pull-right[  ] --- ### Ribbons Ribbons can be utilized to shade areas of a time series plot that correspond to specific events. For instance, on April 21, an extension of the circuit breaker was announced, mandating all non-essential staff to stay home from April 22 until June 2. Let's shade the graph for the period starting from April 21 to observe if there was any change in the pollution level following the circuit breaker extension. To achieve this, we will employ the `geom_ribbon()` function. --- ### Ribbons To use `geom_ribbon()`, we specify the min and max values for `x` and plot `y`, or the min and max values for `y` and plot `x`. Here, we specify the min and max values for `x` and plot `y`, where `y` is the pollution index. .pull-left[ ```r p + geom_line(mapping = aes(group=region)) + labs(y = "PM 10 Readings at 5pm", x="", title = "PM 10 Readings For April 2020") + facet_wrap(~region) + geom_ribbon(aes(xmin = as.Date("2020-04-22"), xmax = as.Date("2020-04-30"), y = index), fill="gray80", alpha=0.5, inherit.aes = F) + theme(plot.title = element_text(size = 16)) ``` ] .pull-right[  ] --- ### Annotate with Rectangles We may also use the `geom_rect()` function to overlay an area of the same size (but you may not get the desired opacity), or the `annotate()` function to do the same (see the last line below). We will explore the `annotate()` function in Seminar 5. .pull-left[ ```r p + geom_line(mapping = aes(group=region)) + labs(y = "PM 10 Readings at 5pm", x="", title = "PM 10 Readings For April 2020") + facet_wrap(~region) + geom_rect(xmin = as.Date("2020-04-22"), xmax = as.Date("2020-04-30"), ymin = 15, ymax = 38, alpha = .01, color="gray100") + theme(plot.title = element_text(size = 16), plot.subtitle = element_text(size = 14) ) ``` ```r # Using the annotate function annotate(geom = "rect", xmin = as.Date("2020-04-22"), xmax = as.Date("2020-04-30"), ymin = 15, ymax = 38, alpha = .2) ``` ] .pull-right[  ] --- class: center, middle, inverse # Other Plots for Time Series --- ### Area Plot An area plot is a useful plot for comparing across two time series. For example, using the `gapminder` dataset, the plot below shows that Australia overtook New Zealand somewhere in the 1970s: .pull-left[ ```r # Construct an area plot filter(gapminder, continent == "Oceania") %>% ggplot(mapping = aes(x = year, y = lifeExp, fill = country)) + geom_area(position = "identity", alpha = .4) + # position = "identity" takes the y values as height of the area scale_y_continuous(breaks = seq(65,85,5)) + coord_cartesian(ylim=c(65, 85)) + labs(title = "Life Expectancy: Australia vs New Zealand", fill = "", y = "Life Expectancy at Birth", x = "", caption = "Source: Gapminder") + theme_minimal()+ theme(title = element_text(size = 16), axis.title = element_text(size = 14), axis.text = element_text(size = 12), legend.text = element_text(size = 12), legend.position = 'bottom') ``` ] .pull-right[  ] --- class: center, middle, inverse # Further Examples --- ### Visualizing GDP Per Capita Across The World With longitudinal data, we must declare a group aesthetic when plotting line graphs. Here, we will use the `gapminder`, which is a longitudinal dataset. ```r library(gapminder) ```
--- ### Visualizing GDP Per Capita Across The World If we have GDP per capita data across the world, but plot this time series without declaring country as a group, `ggplot2` will produce a line graph for each year across countries, not across year for each country. The graph shown below is incorrect as group structures have not been declared. .pull-left[ ```r p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap)) p + geom_line() + labs(title = "Another Incorrect Line Plot when Data Have a Group Structure", subtitle = 'A group aesthetic is not declared when countries are the natural groups in this dataset') + theme(title = element_text(size = 16), plot.subtitle = element_text(size = 14) ) ``` ] .pull-right[  ] --- ### Visualizing GDP Per Capita Across The World What R should have done is to plot a line for each country across years. To do so, we need to tell R that we want our data to be grouped at the country level. To do so, we pass `group = country` as an aesthetic so as to "map" the values in `country` into each line plot. .pull-left[ ```r p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap)) p + geom_line(aes(group = country)) ``` ] .pull-right[  ] --- ### Cleaning Up To simplify the figure, we may break it up into groups. A convenient way of doing so is to facet the plot with a suitable categorical variable. Let's facet our plot by `continent`, a variable in the `gapminder` dataset. .pull-left[ ```r p + geom_line(aes(group = country)) + facet_wrap(~continent) ``` ] .pull-right[  ] --- ### Cleaning Up Let's add a trendline to each plot using `geom_smooth()` .pull-left[ ```r ### x and y-axis are self-evident. So, let us suppress their titles. p + geom_line(aes(group = country)) + facet_wrap(~continent) + geom_smooth() + labs(x = "", y = "", title ="GDP Per Capita Across the World, 1950-2007") + theme(title = element_text(size = 16), plot.subtitle = element_text(size = 14) ) ``` ] .pull-right[  ] --- ### A Challenge (try it on your own) .panelset[ .panel[.panel-name[Problem] To overlay the national pollution time series on each subplot displaying the pollution level for each region, follow these steps: 1. Extract the national pollution time series. 2. Combine it with the original data frame, ensuring that the original data are sorted by region. The national pollution series will repeat itself for each region due to the recycling property of R. 3. Plot the regional pollution series separately from the national pollution series. You can verify your solutions using the codes provided in the RMarkdown script. ] .panel[.panel-name[Plot] <img src="TimeSeriesPlots_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] ]