Data Management in Base R

class: center, middle, inverse, title-slide

.title[
# Data Management in Base R
]
.author[
### Nicholas Sim
]
.date[
### 09 January 2024
]

---

# Topics

1. Useful functions for working with data: 
    * `str()` (to check data structure), 
    * `head()` (to see the first few rows), 
    * `summary()` (to construct summary statistics), 
    * `table()` (to tabulate) 
    * `data.frame()` (to construct data frames from matrix/vectors)
    * `names()` (to assign names to vector elements)
    * `colnames()` (to assign names to columns)
    * `rownames()` (to assign names to rows)

2. Variable referencing symbol `$`  e.g.  `weather.df$days`

3. Subsetting by index selection, e.g. `weather.df[weather.df$temp > 30,]`

4. Referencing element(s) in the matrix e.g. 
    + `weather.df[2,3]` 
    + `weather.df[2,c('days','rain')]` 
    + `weather.df['temp']`

---
# Topics

5. Subsetting function `subset()` e.g. 
    + `subset(weather.df, rain==TRUE & temp<30)`

6. Membership operator `%in%`  e.g. 
    + `iris$Species %in% c("versicolor", "virginica")`

7. Order function `order()` e.g. 
    + `order(weather.df['temp'])`
  
8. Merge function `merge()`

9. Removing missing data `omit.na()`

10. Working with strings

---
class: center, middle, inverse
# The Data Frame

---
### The Data Frame

Before implementing data analysis, we need to first organize our data into a **data frame**.

Here, we will go over some data management tools in base R. Later,  we will learn how to manage data using the `dplyr` package from the `tidyverse` library.

A data frame is simply data organized as rows and columns, where the rows represent observations and the columns represent variables.

A data frame looks like a matrix, except that a matrix is merely a collection of elements of the same type. Thus, unlike a matrix, a data frame may contain variables (i.e. columns) of different types (e.g. numerical values, strings, factors).

Subsetting rules for matrices apply to data frames as well. However, there are convenient functions for subsetting data frames, such as the `subset()` function, that enable us to construct subsets based on more general conditions.

---
### Useful Functions for Exploring a Data Frame

The following are useful functions for describing data frames:

* `head()` - print out the first 6 rows of the data frame

* `str()` - print out the structure of a data frame, which tells us what the variables are such as the variables' names and types

* `summary()`  - get a quick statistical summary of all the variables (i.e. columns) of a data frame. 
* `names()`  - extract the names of the columns (see, also, `colnames()`)

* `table()` - tabulating the values in a factor variable

As an example, let's consider the dataset `mtcars`, which comes with base R (type `data()` to see the list of built-in datasets).

---
### The `head()` function

To explore first few rows of `mtcars`, we use the `head()` function, which returns the first 6 rows of `mtcars`:

```r
head(mtcars)
```

```
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```

---
### The `str()` Function

To see the type of the variables we have in our dataset, we use the structure function `str()`.

```r
str(mtcars)
```

```
## 'data.frame':	32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
```
The output shows that all the variables are numeric.

---
### The `summary()` Function

To summarize the variables in `mtcars`, we use the `summary()` function:

```r
summary(mtcars)
```

```
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
```

---
### The `names()` Function

To find the names of all variables, we use the `names()` function

```r
names(mtcars)
```

```
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
```

---
### The `table()` Function

The `table()` function is useful for tabulating factor variables. Let's consider the `iris` dataset.

```r
head(iris,3)
```

```
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
```

```r
str(iris)
```

```
## 'data.frame':	150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
```

There are 3 numeric variables and 1 factor variable called `Species` (Column 5).

---
### The `table()` Function

To see what the levels of  `Species` are, we tabulate it using the `table()` function:

```r
table(iris[,5])
```

```
## 
##     setosa versicolor  virginica 
##         50         50         50
```

---
class: center, middle, inverse
#  Creating a Data Frame

---
### Using the `data.frame()` Function

Vectors can be combined into a data frame using the `data.frame()` function.

```r
# Create the following vectors
days <- c('Mon','Tue','Wed','Thu','Fri')
temp <- c(31.2,27.3,28.7, 32.5, 33.8)
rain <- c(TRUE, TRUE, TRUE, FALSE, FALSE)

# Pass in the vectors and assign the values to the dataframe named weather.df:
weather.df <- data.frame(days,temp,rain)
weather.df # Notice that the column names are already defined here.
```

```
##   days temp  rain
## 1  Mon 31.2  TRUE
## 2  Tue 27.3  TRUE
## 3  Wed 28.7  TRUE
## 4  Thu 32.5 FALSE
## 5  Fri 33.8 FALSE
```

---
### Using the `data.frame()` Function

To explore the structure of `df`, we use the `str()` function:

```r
str(weather.df)
```

```
## 'data.frame':	5 obs. of  3 variables:
##  $ days: chr  "Mon" "Tue" "Wed" "Thu" ...
##  $ temp: num  31.2 27.3 28.7 32.5 33.8
##  $ rain: logi  TRUE TRUE TRUE FALSE FALSE
```
which shows that `days` is a character vector, `temp` is a numerical vector, and `rain` is a logical vector.

---
### Converting a Vector or Matrix into a Data Frame

A vector or matrix can be converted into a data frame using the `as.data.frame()` function. 
 
Let's generate a vector named `precip` containing daily precipitation records in inches, with values 1.2, 7.6, 8.3, 0.8, 0.2 for Monday to Friday.

Let's convert it into a data frame and save the output using the same name.

```r
precip <- c(1.2, 7.6, 8.3, 0.8, 0.2)
precip <- as.data.frame(precip) 
precip
```

```
##   precip
## 1    1.2
## 2    7.6
## 3    8.3
## 4    0.8
## 5    0.2
```

---
### Exercise

1. Create a vector called `humidity`, which represents the percentage of moisture in the air, with the values of 62.4, 96.3, 97.6, 73.2,	55.3 for Monday to Friday.

2. Declare it as a data frame.

3. Combine `precip` and `humidity` with the `weather.df` data frame using the `data.frame()` function. Call it `weather.df.2`.

---
class: center, middle, inverse
#  Selection and Indexing

---
###  Selection and Indexing

Elements referencing and indexing follow the same approach for a data frame as for a matrix. For example, to filter out (i.e. select) the first row of `weather.df` (while keeping all the column elements)

```r
# Everything from first row. Notice the column argument is empty.
weather.df[1,] #use square brackets
```

```
##   days temp rain
## 1  Mon 31.2 TRUE
```

.pull-left[ 
To select the second column of `weather.df` (while keeping all the row elements)

```r
#Everything from second column. Notice the row argument is empty.
weather.df[,2]
```

```
## [1] 31.2 27.3 28.7 32.5 33.8
```
]

.pull-right[
To see if it rained on Tuesday,

```r
weather.df[2,3]
```

```
## [1] TRUE
```
 ]

---
### Omitting Row and Column Arguments in Index Referencing

If we pass in an index without specifying the row and column arguments, this will, by default, reference a column.

.pull-left[
For instance, the first column is referenced below:

```r
# Everything from first row
weather.df[1] # Notice that we omit the comma here
```

```
##   days
## 1  Mon
## 2  Tue
## 3  Wed
## 4  Thu
## 5  Fri
```
]
.pull-right[
And the second column is selected below:

```r
#Everything from second column
weather.df[2]
```

```
##   temp
## 1 31.2
## 2 27.3
## 3 28.7
## 4 32.5
## 5 33.8
```
]

---
###  Selecting using Column Names in the Column Argument

If the column names are specified, we may select the columns their names.

.pull-left[
For instance, we can select Column 3 by passing in its name, `rain`:

```r
weather.df[2,'rain']
```

```
## [1] TRUE
```
]
.pull-right[
As another example, to display the values for `day` and `rain` for Tuesday, we filter out the second row and select the `days` and `rain` columns:

```r
weather.df[2,c('days','rain')]
```

```
##   days rain
## 2  Tue TRUE
```
]

---
###  Selecting Variables Using Column Names without Row-Column Index Referencing

If we pass a variable name into the index brackets, the variable will be returned as an output.

.pull-left[
For instance, let's first use the `names()` function to retrieve the names of all variables in `weather.df`.

```r
names(weather.df)
```

```
## [1] "days" "temp" "rain"
```
]
.pull-right[
Let's pass `'temp'` into the brackets without the row-column index referencing:

```r
weather.df['temp'] # Without using row and column arguments, the column will be selected by default.
```

```
##   temp
## 1 31.2
## 2 27.3
## 3 28.7
## 4 32.5
## 5 33.8
```
]

---
###  Excluding Variables in the Selection

We may use the negation "`-`" symbol to select all variables except the one negated.

.pull-left[
For example, to select all columns except Column 3, we pass in -3 into the column argument of `weather.df`:

```r
weather.df[,-3]
```

```
##   days temp
## 1  Mon 31.2
## 2  Tue 27.3
## 3  Wed 28.7
## 4  Thu 32.5
## 5  Fri 33.8
```
]

.pull-right[
We may also leave out the comma, and not specify the row and column argument when referencing. By default, this will select all columns except Column 3 (as well as all the row elements).

```r
weather.df[-3]
```

```
##   days temp
## 1  Mon 31.2
## 2  Tue 27.3
## 3  Wed 28.7
## 4  Thu 32.5
## 5  Fri 33.8
```
]

---
###  Selecting Variables Using `$` Referencing

A data frame is an R list that can contain different data types. Thus, we may reference a variable in a data frame using the "`$`" symbol, like how we reference an item in a list.

.pull-left[
For example, we may select `days`, `temp` and `rain` from the `weather.df` data frame by

```r
weather.df$days
```

```
## [1] "Mon" "Tue" "Wed" "Thu" "Fri"
```

```r
weather.df$temp
```

```
## [1] 31.2 27.3 28.7 32.5 33.8
```

```r
weather.df$rain
```

```
## [1]  TRUE  TRUE  TRUE FALSE FALSE
```
]
.pull-right[
**Note**:  The difference between `weather.df['temp']` and `weather.df$temp` is that the former returns a data frame (i.e. in a column) while the latter returns a vector.

**Note**: We use the "`$`" symbol to reference an item in a list. The data frame is a special type of list and which therefore enable us to use "`$`" for referencing. By contrast, a matrix is not a list. Thus, the "`$`" referencing approach will not work.

]

---
###  Exercise

The formula to convert temperature from degrees Celsius into degrees Fahrenheit is `$(X°C × 9/5) + 32$`, where `$X$` is the temperature in degrees Celsius.

Convert `temp` in `weather.df` to degrees Fahrenheit and save it as `temp.F`. Combine `temp.F` with the data frame `weather.df` and call it `weather.df.3`.

Next, type in `names(weather.df.3)[2]`. What do you observe? Rename `temp` as `"temp.C"`. (Hint: Assign the new name to `names(weather.df.3)[2]`)

---
class: center, middle, inverse
#  Subsetting

---
###  Subsetting

Subsetting is the creation of a smaller dataset from the original dataset by selecting the rows (observations), the columns (variables), or both.

**Filtering** refers to *extracting observations* from the data frame using one or more conditions as filters.

**Selecting** refers to *extracting variables* from the data frame.

---
###  Filtering using Comparison Operators

Let's extract observations from `weather.df` based on the condition that temperature is greater than `$30°C$`.

```r
filter.df <- weather.df$temp > 30
filter.df # Vector of TRUE/FALSE
```

```
## [1]  TRUE FALSE FALSE  TRUE  TRUE
```

```r
weather.df[filter.df,]
```

```
##   days temp  rain
## 1  Mon 31.2  TRUE
## 4  Thu 32.5 FALSE
## 5  Fri 33.8 FALSE
```

We have created a vector of logical values called `filter.df`, which contains the values `TRUE`, `FALSE`, `FALSE`, `TRUE`, `TRUE`, i.e. the  1st, 4th and 5th elements are `TRUE`.

This implies that `temp` is greater than `$30°C$` in the  1st, 4th and 5th elements of `weather.df$temp`, i.e. on Monday, Thursday, and Friday.

`filter.df` acts as a selector that identifies the elements in `weather.df$temp` where `temp` is greater than `$30°C$`. To select the  1st, 4th and 5th rows of `weather.df`, where `temp` is greater than `$30°C$`, we pass  `filter.df` into the row argument of `weather.df`.

---
###  Example

Rather than creating a selector vector, we may pass the selector condition `weather.df$temp > 30` directly into the row argument of `weather.df` to create a subset that contains only the days where temperature has exceeded `$30°C$`:

```r
weather.df[weather.df$temp > 30,]
```

```
##   days temp  rain
## 1  Mon 31.2  TRUE
## 4  Thu 32.5 FALSE
## 5  Fri 33.8 FALSE
```

In the above, we select the rows in `weather.df` where `temp` is greater `$30°C$`, and keep all columns. To save the subset, just assign it to a new object:

```r
weather.df.hot <- weather.df[weather.df$temp > 30,]
```

---
###  Example

We may filter out the days that rained. To do so, we pass the condition `weather.df$rain == TRUE` into the row argument

```r
weather.df[weather.df$rain == TRUE,]
```

```
##   days temp rain
## 1  Mon 31.2 TRUE
## 2  Tue 27.3 TRUE
## 3  Wed 28.7 TRUE
```

The above reads as "select the rows in `weather.df` where `rain` is `TRUE`, and keep all columns"

---
###  Combining Filtering Conditions

We may create subsets that satisfy two or more conditions. These conditions may be combined using the conjunctions "AND" and "OR".

The conjunction "AND" means that both conditions must be met for the joint condition to be `TRUE`. The conjunction "OR" means that either one of the conditions must be met for the condition to be `TRUE`.

In R, the "AND" conjunction is represented by the ampersand `&`. The "OR" conjunction is represented by the vertical pipe `|`.

---
###  Example

Let's filter out the data when the temperature was hot (i.e. temperature greater than `$30°C$`) **AND** rainy. To do so, we pass the joint condition `weather.df$temp > 30 & weather.df$rain == TRUE` into the **row** argument of the `weather.df` data frame

```r
weather.df[weather.df$temp > 30 & weather.df$rain == TRUE,]
```

```
##   days temp rain
## 1  Mon 31.2 TRUE
```

---
###  Example

Next, let's filter out the days when the temperature was hot (i.e. temperature greater than `$30°C$`) **OR** when it rained. To do so, we pass the joint condition `weather.df$temp > 30 | weather.df$rain == TRUE` into the **row** argument of `weather.df`

```r
weather.df[weather.df$temp > 30 | weather.df$rain == TRUE,]
```

```
##   days temp  rain
## 1  Mon 31.2  TRUE
## 2  Tue 27.3  TRUE
## 3  Wed 28.7  TRUE
## 4  Thu 32.5 FALSE
## 5  Fri 33.8 FALSE
```

Here, we recover the entire dataset because each day was either hot (i.e. Thursday, Friday), or rainy (i.e. Tuesday, Wednesday), or both (i.e. Monday).

---
###  Selecting

.pull-left[
Filtering refers to the extraction of rows/observations. Selecting refers to the extraction of columns/variables.

To select columns, just pass the column indices or column names into the data frame.

Note: 
  * `weather.df[,'temp']` means "keep all rows and select the column `temp`".
  * `weather.df[,c(1,2)]` means "keep all rows and select the 1st and 2nd columns".
]

.pull-right[

```r
weather.df[,'temp']
```

```
## [1] 31.2 27.3 28.7 32.5 33.8
```

```r
weather.df[,c('days','temp')]
```

```
##   days temp
## 1  Mon 31.2
## 2  Tue 27.3
## 3  Wed 28.7
## 4  Thu 32.5
## 5  Fri 33.8
```

```r
weather.df[,c(1,2)]
```

```
##   days temp
## 1  Mon 31.2
## 2  Tue 27.3
## 3  Wed 28.7
## 4  Thu 32.5
## 5  Fri 33.8
```
]

---
###  Combining Filtering and Selecting

Finally, we may combine filtering (the observations) and selecting (the variables).

For example, let's extract the days that rained and keep only the variables `days` and `temp`:

```r
weather.df[weather.df$rain == TRUE, c('days','temp')] #use the combine function
```

```
##   days temp
## 1  Mon 31.2
## 2  Tue 27.3
## 3  Wed 28.7
```

Notice that we merely called the name of the variables `days` and `temp` when we passed them into the column argument of `weather.df`.

However, to refer to the `rain` variable in the row argument, we must make a reference to the data frame, i.e.  `weather.df$rain`.

---
###  Common Pitfall in Conditional Filtering

Why must we spell out the condition `weather.df$rain == TRUE` in the row argument with the name of the data frame, and not simply use `'rain' == TRUE`?

This because to filter out a desired observation, a boolean value, `TRUE`, must be present in the row argument of the data frame *for that specific row*.

The problem with passing in the condition `'rain' == TRUE` is that the string character, `'rain'`, is not equal to `TRUE`! Therefore, `'rain' == TRUE`  will return `FALSE`.

By contrast, `weather.df$rain` references the variable `rain`, which is a boolean vector. Therefore, for the  rows in `weather.df$rain` that contain the value `TRUE`, a `TRUE` value will be returned for the rows where the condition `weather.df$rain == TRUE` is met.

---
###  Common Pitfall in Column Referencing

For column referencing, we pass in the column name or column index only. If we reference the data frame, for instance, `weather.df$temp`, it will be wrong, e.g.

`weather.df[weather.df$rain == TRUE, weather.df$temp]`

This is because `weather.df$temp` is a vector of temperature values, i.e.  31.2, 27.3, 28.7,  32.5, 33.8.

By passing `weather.df$temp` into the column argument, you are attempting to retrieve columns 31.2, 27.3, 28.7,  32.5, 33.8 from `weather.df`, which is incorrect!

---
###  Exercise

Using `weather.df.2`, create a subset (i.e. select the rows) where precipitation is more than 2 inches.

---
###  Filtering using `subset()`

In base R, subsetting requires us to write codes like

```r
weather.df[weather.df$temp > 30 & weather.df$rain == TRUE,]
```

```
##   days temp rain
## 1  Mon 31.2 TRUE
```

Instead of passing in a set of logicals into the row argument, we may use the `subset()` command to perform conditional filtering. The `subset()` function has 3 inputs:

+ Input 1: Name of the data frame
+ Input 2: The subsetting condition(s)
+ Input 3: The columns we want to select (Optional)

---
###  Example

Suppose we want to grab the days that had rained. We may use the `subset()` function as follows:

```r
subset(weather.df, subset = rain==TRUE)
```

```
##   days temp rain
## 1  Mon 31.2 TRUE
## 2  Tue 27.3 TRUE
## 3  Wed 28.7 TRUE
```

For the subsetting condition above, we simply pass in `rain==True` and not `weather.df$rain==True`. This is because we have declared `weather.df` as the data frame in the first input. Therefore, it is understood that the `rain` variable is from the `weather.df` data frame.

That being said, you may write the code as

```r
subset(weather.df, subset = weather.df$rain==TRUE)`
```

The code works but is unnecessarily long. This is also not recommended as you may unintentionally select a variable from a different data frame!

---
###  Example 
Once you are aware that the second input of `subset()` specifies the subsetting condition(s), there is no need to declare input name, i.e. `subset = `. For example, the following code will work just as fine:

```r
subset(weather.df, rain==TRUE) # omit "subset = " in the second argument
```

```
##   days temp rain
## 1  Mon 31.2 TRUE
## 2  Tue 27.3 TRUE
## 3  Wed 28.7 TRUE
```

---
###  Subsetting with Joint Conditions

We may combine subsetting conditions using `&`  (i.e. and) or `|` (i.e. or) and include the combined conditions in the `subset()` function. For example, let's add another condition, `temp < 30` into the function:

```r
subset(weather.df, rain==TRUE & temp<30)
```

```
##   days temp rain
## 2  Tue 27.3 TRUE
## 3  Wed 28.7 TRUE
```

---
###  Subsetting with Joint Conditions
The `subset()` function has a third input that allows us to select variables. For example, let's select the 'days' and 'temperature' for the days that rained.

```r
subset(weather.df, rain==TRUE, select = c(days, temp))
```

```
##   days temp
## 1  Mon 31.2
## 2  Tue 27.3
## 3  Wed 28.7
```

Again, once we are aware that the third input involves selecting the columns, we may drop `select = ` and pass in `c(days, temp)` only>

```r
subset(weather.df, rain==TRUE, c(days, temp)) # No argument names at all.
```

```
##   days temp
## 1  Mon 31.2
## 2  Tue 27.3
## 3  Wed 28.7
```

Notice that we did not wrap `days` and `temp` with quotation marks. This is because R knows that these are names of variables in `weather.df` and therefore "tolerates" this omission.

---
###  The `%in%` Operator

The `%in%`, or membership, operator generates a Boolean indicator to indicate if values in a variable belong to a list of pre-specified values.

For example, the variable `iris$Species`  contains three species,  "setosa", "versicolor", "virginica". To choose a subset (call it `iris.small`) containing only "versicolor", "virginica" (i.e. "versicolor" and "virginica" are contained in `Species`), we may pass

`Species %in% c("versicolor", "virginica")`

into the subsetting argument in `subset()`, which means `Species` **containing** `c("versicolor", "virginica")`.

---
###  Example

Let's compare using the base R command versus the `subset()` function for subsetting.

In base R, to choose the observations that correspond to the "versicolor" or  "virginica" species, we will write

```r
iris.small.1 <- iris[iris$Species == "versicolor" | iris$Species =="virginica",]
head(iris.small.1)
```

```
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 51          7.0         3.2          4.7         1.4 versicolor
## 52          6.4         3.2          4.5         1.5 versicolor
## 53          6.9         3.1          4.9         1.5 versicolor
## 54          5.5         2.3          4.0         1.3 versicolor
## 55          6.5         2.8          4.6         1.5 versicolor
## 56          5.7         2.8          4.5         1.3 versicolor
```

---
###  Example

Using the `subset()` function and the `%in%` operator, we may achieve the same with

```r
iris.small.2 <- subset(iris, Species %in% c("versicolor", "virginica"))
head(iris.small.2)
```

The above command reads something like: construct a subset from `iris` where  `Species` contains the elements "versicolor" and "virginica".

---
###  Exercise

1. Using the `subset()` command, create a subset from `weather.df.2` where precipitation is more than 2 inches.

2.  Use the `mtcars` dataset. Using the `subset()` command and the `%in%` operator, create a subset where  "cyl" contains 4 and 6 (note these are numbers, not strings).

3.  Use the `mtcars` dataset. Using the `subset()` command and the `%in%` operator, create a subset containing the variables "mpg",  "cyl" and "disp", where  "cyl" contains 4 and 6.

---
class: center, middle, inverse
#  Other Useful Functions

---
###  Other Useful Functions

We explore two commonly used functions:
  
  1. `order()`: for sorting observations by a certain attribute

2. `na.omit()`: for removing missing values

---
###  Sorting the Observations

We may sort the data frame by using the `order()` function. To do so, we

1) pass the variable you wish to sort the data frame by into the `order()` function. This will produce an index vector containing the order of the observations in the variable. 
      2) pass this index vector into the row argument of the data frame to be reordered.

---
###  Example

Let's sort `weather.df` by the temperature:

```r
sorted.temp <- order(weather.df['temp'])
sorted.temp
```

```
## [1] 2 3 1 4 5
```

```r
weather.df[sorted.temp,]
```

```
##   days temp  rain
## 2  Tue 27.3  TRUE
## 3  Wed 28.7  TRUE
## 1  Mon 31.2  TRUE
## 4  Thu 32.5 FALSE
## 5  Fri 33.8 FALSE
```

Let's take a look at what is happening here.

The first object, `sorted.temp`, generates an index of the (ascending) order of the elements in `temp`. Here, the lowest temperature took place on Wednesday (i.e. Element 3); the second lowest temperature took place on Monday (i.e. Element 1), and so on.

Then, by passing `sorted.temp` into the row argument of `weather.df`, we may sort the rows of `weather.df` according to the index in `sorted.temp`, i.e. `2 3 1 4 5`, so that Row 1 becomes Row 2, Row 2 becomes Row 3, Row 3 becomes Row 1, and so on.

---
###  Example

Note: We may also use the variable selector "`$`"

```r
sort.temp <- order(weather.df$temp)
weather.df[sort.temp,]
```

```
##   days temp  rain
## 2  Tue 27.3  TRUE
## 3  Wed 28.7  TRUE
## 1  Mon 31.2  TRUE
## 4  Thu 32.5 FALSE
## 5  Fri 33.8 FALSE
```

---
###  Sort by Descending Order

To sort by descending order of temperature, we prefix `weather.df['temp']` with a negative sign:

```r
desc.temp <- order(-weather.df['temp'])
weather.df[desc.temp,]
```

```
##   days temp  rain
## 5  Fri 33.8 FALSE
## 4  Thu 32.5 FALSE
## 1  Mon 31.2  TRUE
## 3  Wed 28.7  TRUE
## 2  Tue 27.3  TRUE
```

---
###  Remark

There are other ways of sorting a dataset using libraries, which we will explore later. One library we will be using frequently is the `tidyverse` library, which loads a package called `dplyr`, among others.

The `arrange()` function in the `dplyr` package enables us to sort a data frame easily. Thus, there is no need to generate a vector of order index, and then, use it to sort a data frame as we have done earlier.

```r
dplyr::arrange(weather.df, temp)
```

```
##   days temp  rain
## 1  Tue 27.3  TRUE
## 2  Wed 28.7  TRUE
## 3  Mon 31.2  TRUE
## 4  Thu 32.5 FALSE
## 5  Fri 33.8 FALSE
```

---
###  Exercise

Sort `weather.df.2` in ascending order based on `precip`:

---
###  Removing Missing Values (NA)

Missing values in the data frame are represented by `NA`. To remove rows containing NA from a data frame, we pass the data frame into the `na.omit()` function.

**Note**: It is not always a good idea to use `na.omit()` to remove missing values. This is because the `na.omit()` will omit a row if it contains an `NA`, regardless whether is there only one `NA` in that row or many `NA`s. Therefore, `omit.na()` may remove non-missing data that could be useful for future analysis.

---
###  Exercise

Create a vector called `v.1` with the numbers 1 to 4, NA, NA, 5 to 9 using the combine function. Calculate the mean of `v.1` by passing it into `mean()`. What do you observe?

Now, using `na.omit()`, remove the `NA`s from `v.1` and save the output as `v.2`. Then, calculate the mean of `v.2`. What is it?

---
class: center, middle, inverse
#  Working with Strings

---
### Common String Operations

Often, we need to manipulate string values in our data frame. The following are commonly encountered scenarios in string manipulations:

1. Recoding string values

2. Concatenating string values

3. Replacing a section in a string

---
### Recoding String Values

We may use the `ifelse()` function to recode strings. Later in the course, we will explore the `recode()` function from the `dplyr` package to do the same.

The `ifelse()` function has three inputs:

(1) the condition to be tested
    (2) the specified value if the condition is true
    (3) the specified value if the condition is false

As an example, suppose we wish to abbreviate the species names in `iris` from "setosa" to "SE", "virginica" to "VA", and "versicolor" to "VE".

---
### Example

The codes below recodes one string expression at a time using the `ifelse()` function:

```r
#save the iris data into iris.df
iris.df <- iris 
summary(iris.df$Species) #summarize the `Species` variable
```

```
##     setosa versicolor  virginica 
##         50         50         50
```

```r
# Recode "setosa" into "SE", and save the output
iris.df$Species.Short1 <- ifelse(iris.df$Species == "setosa", "SE",iris.df$Species)

# Recode "virginica" into "VA", and save the output
iris.df$Species.Short1 <- ifelse(iris.df$Species == "virginica", "VA", iris.df$Species.Short1)

# Recode "versicolor" into "VE", and save the output
iris.df$Species.Short1 <- ifelse(iris.df$Species == "versicolor", "VE", iris.df$Species.Short1)

# Check the output
summary(as.factor(iris.df$Species.Short1))
```

```
## SE VA VE 
## 50 50 50
```

---
### Example

We may recode all values in a single line by nesting the `ifelse()` function iteratively into the "otherwise" (i.e. 3rd) argument. As a remark, there are better ways to recode the values in a variable such as using `recode()` from `dplyr`.

```r
#save the iris data into iris.df
iris.df <- iris

# Save the recoded variables into Species.Short
iris.df$Species.Short2 <- ifelse(iris.df$Species == "setosa", "SE", 
                                 ifelse(iris.df$Species == "virginica", "VA", 
                                        ifelse(iris.df$Species == "versicolor", "VE", iris.df$Species)))
```

---
### String Concatenation

To concatenate two or more strings using a single space separator, we may use the `paste()` function. To do the same without a default space separator, we may use the `paste0()` function.

.left-pull[ 
To use a specific separator than a single space, we specify the separator in the `sep = "your separator"` option.
]

.right-pull[

```r
# a single separating space by default
paste("Hello", "World")
```

```
## [1] "Hello World"
```

```r
# no default separating space 
paste0("Hello", "World") 
```

```
## [1] "HelloWorld"
```

```r
# specifying the separator
paste("Hello", "World", sep ="") 
```

```
## [1] "HelloWorld"
```

```r
# specifying the separator, @
paste("Hello", "World", sep ="@")
```

```
## [1] "Hello@World"
```
]

You may pass in a vector of strings into the `paste()` and `paste0()` functions.

---
### Replacing a Section in a String

We may replace a section of a string, called a **substring**. For example, if the species names are "setosa flower", "virginica flower" and "versicolor flower", then " flower" (with a leading space) is redundant and should be removed (i.e. replace with a blank).

To do so, we may use the `gsub()` function, which has the syntax structure:

`gsub(pattern to be replaced, replacement, the expression/variable to apply the replacement on)`

Accompanying the `gsub()` function is the `sub()` function, which replaces only the first instance.

---
### Example

Below, we use the `gsub()` function to remove the word "flower" from the variable `iris.df$Species.Long`:

```r
iris.df$Species.Long <- paste(iris.df$Species,"flower") # appending "flower" to iris.df$Species
iris.df$Species.Original <- gsub(" flower", "", iris.df$Species) # replace " flower" with "" in iris.df$Species

head(iris.df[c('Species.Long','Species.Original')])
```

```
##    Species.Long Species.Original
## 1 setosa flower           setosa
## 2 setosa flower           setosa
## 3 setosa flower           setosa
## 4 setosa flower           setosa
## 5 setosa flower           setosa
## 6 setosa flower           setosa
```

---
### Further Remarks

Other commonly encountered tasks in string manipulation include replacing the names of a certain label, textwrapping of titles or axis labels, etc.

String manipulation in R deserves a separate discussion. In base R, working with strings can be challenging. An easier way is to use functions from the `stringr` package, which is part of the `tidyverse` library.

The link below summarizes some useful commands in `stringr` and their counterparts in base R.

https://stringr.tidyverse.org/articles/from-base.html

---
### Exercise

Construct the following vectors

* `name` with "Graham", "Lily", "Rose"
    * `gender` with "M", "F", "F"
    * `salutation` with  "Mr","Madam", "Madam"
    * `address` with "Forfar Ave", "Clarence St", "Dawson Ave"

Construct a data frame called `df.records` using these vectors.

Check whose address has "Ave" and "St".

Rename "Ave" to "Avenue" and "St" to "Street".

Rename "Madam" to "Ms" within the variable `salutation`.

Overwrite these changes in `df.records`.

---
class: center, middle, inverse
# R Packages

---
### Installing and Using R Packages

Packages are a collection of programs to solve certain tasks conveniently. The repository for R packages is called CRAN, i.e. Comprehensive R Archive Network.

For this course, we will be using the `tidyverse` library (package and library are synonymous in R). The package contains several sub-packages for data visualization, data wrangling, functional programming, etc.

For data visualization, we will be using the `ggplot2` package in the `tidyverse` library. Once we install and load `tidyverse`, the `ggplot2` package will be automatically loaded.

For data management, the `dplyr` package in the `tidyverse` library is a powerful tool. We will return to `tidyverse` later in the course.

---
### Installing `tidyverse`
Let's install and load the `tidyverse` package

```r
#install.packages("tidyverse")
library(tidyverse)
```
The first line installs the `tidyverse` package (commented out as I have already installed the package). The second line loads `tidyverse` into R. This is necessary in order for us to use functions from `tidyverse`.

---
class: center, middle, inverse
# Importing and Exporting Files into R

---
### Working Directory

The working directory is the directory that you wish to access your files from without specifying the entire filepath. For our convenience, it will be a good idea to declare where the working directory is.

To check the existing working directory, use the `getwd()` command.

```r
getwd()
```

```
## [1] "C:/Users/nicho/OneDrive/Documents/Teaching/ANL501/Course Content/course_notes/current/Seminar 2"
```

To specify the working directory, use the `setwd()` command, e.g.

```r
# setwd("C:\\Users\\nicholas\\Documents\\Learning\\rstudio\\Seminar2") # PC users
# setwd("/Users/nicholas/Documents/Learning/rstudio/Seminar2")  # Mac Users
```
(I commented out the command as the directory is only an example.) Notice that a double backslash `\\` is used instead of a single backslash `\`, which represents an `escape` (to escape a symbol or a sign, which is used to represent something in R). An alternative to `\\` is a single forward slash `/`.

---
### Importing a Data Filie

After setting the working directory, we may import the csv file `Seminar2_Data.csv` from the directory and save it as an object called `WDI.data`.

```r
WDI.data <- read.csv("Seminar2_Data.csv")
```
This dataset is constructed from the World Bank World Development Indicators.

Notice that the entire file path is omitted in `read.csv()` as  `Seminar2_Data.csv` is in the working directory.

---
### Checking the Imported File

It is a good practice to first explore the dataset after importing it. We first look at the structure of our data by using the `str()` command.

```r
str(WDI.data)
```

```
## 'data.frame':	5425 obs. of  10 variables:
##  $ Country.Code                                                       : chr  "ABW" "ABW" "ABW" "ABW" ...
##  $ Country.Name                                                       : chr  "Aruba" "Aruba" "Aruba" "Aruba" ...
##  $ Year                                                               : int  1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 ...
##  $ GDP.growth..annual...                                              : num  2.55 1.19 7.05 1.99 1.24 ...
##  $ GDP.per.capita..constant.2010.US..                                 : num  26705 26088 27191 27152 26954 ...
##  $ GINI.index..World.Bank.estimate.                                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Mortality.rate..under.5..per.1.000.live.births.                    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Poverty.headcount.ratio.at..1.90.a.day..2011.PPP.....of.population.: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Income.Group                                                       : chr  "High income" "High income" "High income" "High income" ...
##  $ Region                                                             : chr  "Latin America & Caribbean" "Latin America & Caribbean" "Latin America & Caribbean" "Latin America & Caribbean" ...
```

Notice that this data frame has  "5425 obs. of 19 variables". As discussed, each variable can be referenced by their name by using the "`$`" symbol.

---
### Checking the Imported File

For example, to reference the variable `Country.Code`, we write the code

We may summarize our data by using the `summary()` function. For instance, let's summarize GDP growth

```r
summary(WDI.data$GDP.growth)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -62.076   1.625   3.745   3.818   6.003 149.973     639
```

---
### Tidying Up Our Data

As the column names are very long, it is good idea to rename them into something simpler. To do so, let's check the column names of `WDI.data` first. Then, we re-assign simpler names to the columns.

To rewrite the variable names from Columns 1 to 8 in `WDI.data`, we use the function `colnames()` or `names()`

```r
# Let's overwrite the column names of `WDI.data}

colnames(WDI.data)[1:8]<- c("Code", "Country", "Year", "GDP.Growth", "GDP.PerCap", "Gini", "Child.Mortality", "Poverty.Under.2" )

colnames(WDI.data) # much better!
```

```
##  [1] "Code"            "Country"         "Year"            "GDP.Growth"     
##  [5] "GDP.PerCap"      "Gini"            "Child.Mortality" "Poverty.Under.2"
##  [9] "Income.Group"    "Region"
```

---
### Exporting Our Data File

Let's export `WDI.data` back into our working directory using the `write.csv()` function from base R. The first input of `write.csv()` takes in the name of the data frame we want to export. The second argument specifies the name of the file to be exported.

For instance, the following command

```r
write.csv(WDI.data, "Seminar2_Data2.csv")
```
exports the data frame `WDI.data` as "`Seminar2_Data2.csv`" into the working directory.

The exported csv file will contain a new column containing row numbers  (i.e. which are row names) in "`Seminar2_Data2.csv`". To prevent R from creating a new column of row names, we use the option `row.names=FALSE`.

```r
write.csv(WDI.data, "Seminar2_Data2.csv", row.names  = FALSE)
```

Another way to avoid generating row names as a new variable in the exported dataset is to use the `write_csv()` function (notice the underscore) that comes from the `readr` package of `tidyverse`.

---
### Exercise

Generate a vector `vec.1` using a sequence from 1 to 10. Generate another vector `vec.2` using the first 10 elements of `letters`.

Column combine `vec.1` and `vec.2` and call the output `df`. Assign column names "Numbers" and "Letters" to the first and second column respectively.

Export `df` as a csv file named "`Seminar2_last_exercise.csv`" to your working directory and exclude row names.

To verify that you have successfully exported `df`, go to your working directory and open "`Seminar2_last_exercise.csv`".