class: center, middle, inverse, title-slide .title[ # Principles of Data Storytelling and the Grammar of Graphics ] .author[ ### Nicholas Sim ] .date[ ### 09 January 2024 ] --- class: center, middle, inverse # Data Visualisation --- ### What is Data Visualisation? "Data visualization is the **presentation of data** in a **pictorial** or **graphical** format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns." (sas.com) --- ### A Good Example Charles Joseph Minard’s Napoleon’s March to Moscow. <img src="seminar2_minard.png" width="90%" style="display: block; margin: auto;" /> --- ### A Not-So-Good Example There is too much unneccessary information. <img src="seminar2_junk.png" width="60%" style="display: block; margin: auto;" /> --- ### Why is Wrong? * Limited cognitive load * Unnecessary information clutters key visual information + Shadows + 3D + Chartjunk --- class: center, middle, inverse # Visualisation Attributes --- ### Data-Ink Ratio Edward Tufte (1983): Maximize Data-to-ink ratio **Definition 1** (DATA-INK) The **non-erasable** core of a graphic. **Definition 2** (DATA-INK RATIO) `$$\begin{aligned} \text{Data-Ink Ratio} = \frac{\text{Data ink}}{\text{Total ink used to print the graphic}} \end{aligned}$$` The proportion of a graphic’s ink devoted to non-redundant information. Maximizing the data-ink ratio means that we should trim down the visuals to the necessary minimum and avoid redundant features/information. --- ### How Do We Draw Attention To Specific Details? Gestalt Principles of Visual Perception (see Ref: Chapter 3 of Nussbaumer-Knaflic (2015)) To understand what features we may use in a visualization, it is important to first understand how a person would perceive these features. **Gestalt principles** refer to the principles on how individuals perceive order and relationships. Six Principles + Proximity + Similarity + Enclosure + Closure + Continuity + Connection These principles describe how our brains see things and organize elements into groups. --- ### Proximity Objects that are physically close together are perceived as related or belonging to part of a group. <img src="seminar2_proximity.png" width="85%" style="display: block; margin: auto;" /> --- ### Similarity Objects that share similar characteristics such as color, shape, size, or orientation are perceived as related or belonging to part of a group. <img src="seminar2_similarity.png" width="85%" style="display: block; margin: auto;" /> --- ### Enclosure Objects that are physically enclosed together are perceived as related or belonging to part of a group. <img src="seminar2_enclosure.png" width="85%" style="display: block; margin: auto;" /> --- ### Closure We tend to perceive a set of individual elements as a single, recognizable shape when we may. When parts of a whole are missing, our eyes fill in the gap. <img src="seminar2_closure.png" width="85%" style="display: block; margin: auto;" /> --- ### Continuity When looking at objects, our eyes seek the smoothest path and naturally create continuity in what we see even where it may not explicitly exist. <img src="seminar2_continuity.png" width="40%" style="display: block; margin: auto;" /> --- ### Connection We tend to think of objects that are physically connected as part of a group. <img src="seminar2_connection.png" width="85%" style="display: block; margin: auto;" /> --- class: center, middle, inverse # The Grammar of Graphics --- ### The Grammar of Graphics * Leland Wilkinson (1999, 2005) + GG provides rule/structure for data visualization There are seven layers in the Grammar of Graphics: .pull-left[ * **Mandatory** + Data + Aesthetics + Geometry ] .pull-right[ * **Optional** + Statistic + Coordinates (Projection) + Facet + Theme ] --- class: center, middle, inverse # Mandatory Layers --- ### Data **Data**: Records of events and their attributes. The records are organized as rows and attributes (i.e. variables) as columns. The variables belong to two general types: qualitative and numerical. --- ### Numerical Data **Numerical data**: Data that are represented by numbers and are ordered. The types of numerical data are * Floating point values - numerical values with decimal places * Integer Examples of numerical data are GDP per capita across ASEAN countries, literacy achievement scores across the OECD, population count of a country from 1980 to 2024, etc. --- ### Qualitative Data **Qualitative** data: Data that represent certain qualities or attributes. They could be represented by numbers or strings. They could be ordered or unordered. A qualitative variable that represents multiple *unordered* categories is called a **categorical** or **nominal** variable. Examples are location (North, South, East, West, etc.), ethnicity (Bosnians, Croats, Montenegrins, etc.). A qualitative variable that represents multiple *ordered* categories is called an **ordinal** variable. Examples are years of education, income range such as 0-1000, 1000-2000, etc. A qualitative variable that represents two categories, ordered or unordered, is called a **dummy** or **indicator** variable. It may be numerically represented by 1 or 0, where 1 indicates the presence of a quality. Examples include having an health insurance or not, graduated from a university or not, plays tennis or not, etc. --- ### Cross-Sectional, Time Series and Panel Data There are three types of tabular data. They can be cross-sectional data, a time series, or panel data. --- ### Data Frame In R, data should be organized into a *data frame* where + **Rows**: Observations (Records) + **Columns**: Variables The row in a data frame represents an observation or record. The column represents a variable, field, feature, or attribute. --- ### Data Frame A dataset organized in this manner is said to be **tabular** or in a *long form*. Our source data (e.g. data extracted from an API) may not be organized in this way. Therefore, before conducting our analysis, we must first shape it into a long form first. An example of a long form data is below.
--- ### Plots Types Certain plots are suitable for presentation for certain data types. The following is a non-exhaustive list: * Qualitative - Univariate plots: bar plots, pie charts, treemap - Multivariate plots: stacked bar plots, dodged bar plots * Numerical - Univariate plots: histograms, density plots, box plots, line plots (for time series) - Multivariate plots: scatter plots, density/contour plots, waterfall charts * Qualitative and Numerical - boxplots (categorical variable on one axis, boxplots on the other axis), dot plots, jitter plots, lollipop plots * Spatial - Chloropeth maps See https://r-graph-gallery.com/ --- ### Aesthetics **Aesthetics**: *Mapping* of the data to the output (i.e. figure). Example: To construct a scatter plot, our data could be mapped into the following features of a data visualisation: - **x**- or **y**-variable. (mandatory) - **color** of the data points - **size** of the data points - **shape** of the data points (e.g. triangle, square, circle, etc.) - **label** of the data points (e.g. name of the country associated with each data point) --- ### Geometry **Geometry**: How the selected data are to be presented (i.e. figure). - **point** (scatter) plot - **line** plot - **bar** graph - **boxplot** - **polygon** (maps) --- class: center, middle, inverse # Optional Layers --- ### Statistic **Statistic**: Provides context for the graph (e.g. mean, regression line, etc.) .pull-left[ The average miles per gallon is shown here. <img src="PrinciplesOfDataStorytellingAndTheGrammarOfGraphics_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ] .pull-right[ A trend (i.e. regression) line is added here. <img src="PrinciplesOfDataStorytellingAndTheGrammarOfGraphics_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] --- ### Coordinates **Coordinates**: How the data are projected onto the graph. The most common coordinate system for data visualisation are the x-y (Cartesian) coordinates. Other coordinate systems include polar coordinates, which are used for plotting pie charts, and spatial coordinates such as the Albers Projection, which are used for map construction. --- ### Facet **Facet**: Displaying multiple plots (or different facets of the data) within a single graph. The facet is typically represented by a qualitative variable. Faceting involves creating a plot specific to a subset of data corresponding to a category within that categorical variable. For example, the scatter plot below plots the relationship between petal length and sepal length for each of the three iris species. <img src="PrinciplesOfDataStorytellingAndTheGrammarOfGraphics_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ### Theme **Theme**: General appearance/look/feel of the graph. The theme is related to elements on the visualisation such as the text size of the plot and axis titles and the axis text, the margins of the plot and axis titles, the style of the visualisation, etc. For example, here are two different themes of the same plot. <img src="PrinciplesOfDataStorytellingAndTheGrammarOfGraphics_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- class: center, middle, inverse # Examples --- ### A Scatter Plot .pull-left[ A nice plot... <img src="seminar2_scatter1.png" width="85%" style="display: block; margin: auto;" /> ] .pull-right[ A more attractive plot... <img src="seminar2_scatter2.png" width="85%" style="display: block; margin: auto;" /> ] Which features in this figure correspond with: Aesthetics? Geometry? Statistic? Coordinates? Is the color distinction necessary? --- ### US Presidents, Then and Now Let's analyse the history of US Presidential elections. Here are the first few rows of the dataset.
--- ### US Presidents, Then and Now What are the aesthetics (i.e. visualisation elements such as the x-variable, y-variable, color and label are mapped from the data frame)? What is the geometry, i.e. what sort of graph is this? <img src="PrinciplesOfDataStorytellingAndTheGrammarOfGraphics_files/figure-html/elections_historic-1.png" style="display: block; margin: auto;" /> --- ### Retrenchments in Singapore Here are some data on retrenchment.
--- ### Retrenchments in Singapore Given that retrenchment data are time series data, the appropriate visualisation type is the line plot. <img src="PrinciplesOfDataStorytellingAndTheGrammarOfGraphics_files/figure-html/retrenchment1-1.png" style="display: block; margin: auto;" /> --- ### Retrenchments in Singapore Here is an example of a facet plot, where the faceting is based on industry. <img src="PrinciplesOfDataStorytellingAndTheGrammarOfGraphics_files/figure-html/retrenchment2-1.png" style="display: block; margin: auto;" /> --- ### Example: US Elections, 2016 This is an example of a **choropleth** map, where the x- and y- aesthetics are the longitude and latitude and the "Albers" coordinate projection is used (which gives the map a conical appearance). The winning party is used to fill in the color of the states. <img src="PrinciplesOfDataStorytellingAndTheGrammarOfGraphics_files/figure-html/US_presidents-1.png" style="display: block; margin: auto;" /> --- ### Recap The Grammar of Graphics has three necessary layers: + data, aesthetics, geometry. The rest are optional layers that we exploit for data storytelling: + statistic, coordinates, facet, theme --- ### Exercise TRUE or FALSE? * Geometry refers to the plot by which the data are presented. * An aesthetic is the mapping of a data to a feature in the geometry. * A data frame that presents variables in rows and observations in columns is said to be in the long form/tidy. --- ### Exercise TRUE or FALSE? * The data contained in the table below is organized in the long form.
--- ### Remarks On Data Shape .pull-left[ In the previous table, the columns are years and the rows are related to the attribute "GDP Per capita in ASEAN" for each country. Such data format is in a wide form, not long form, and therefore, unsuitable for visualisation. To reshape the data, we may use the functions `pivot_longer()` and `pivot_wider()` from `dplyr`. Here, we will use `pivot_longer()` to gather all the values of "GDP Per capita in ASEAN" into a single column. The columns shown here represent the attributes such as `Country`, `year`, and `GDP per Capita`. To see how the data are reshaped into the long form shown here, refer to the RMarkdown source file for this slide deck. ] .pull-right[
]