Into the Tidyverse | Session Four

Warwick Data Science Society

Acknowledgements

DigitalOcean Logo

Recap

Statistics in R

R has many useful statistical functions:
- mean(), median(), min(), max(), quantile()[i]
- range(), diff(), IQR()
- var(), sd()
Remember to use na.rm = TRUE if you have missing values

Comparisons and Boolean operators

Comparisons in R can be done with <, <=, >, and >=
Equality can be checked with ==, !=, and near()
R has three main Boolean operators & (and), | (or), and ! (not)
All of these comparisons and Boolean operators are vectorised

The dplyr Verbs

There are five key dplyr functions referred to as verbs:
- filter() - pick observations by their values
- arrange() - reorder observations based on their values
- select() - pick variables by their names
- mutate() - create new variables as functions of existing variables
- summarise()/summarize() - collapse many values down to a single summary
These can be combined with group_by() to change the scope of aggregate functions
When using multiple verbs, combine them with the pipe (%>%) operator

Tidy Data

Introduction

Where Are We?

“Tidy datasets are all alike; but every messy dataset is messy in its own way” - Hadley Wickham
This session is all about organising your data in a consistent way called tidy data
This organisation scheme requires some work upfront but pays off in the long-run by making it much simpler to use the tools in the tidyverse

Data Analysis Map - Tidy

What is tidyr?

tidyr is the foundational package in the tidyverse
It allows you to take a dataset in a non-tidy form and transform it into tidy data
Once data is in a tidy form, you can spend much less time munging data from one representation to another, allowing more time to answer difficult analytical questions

tidyr Hex

Data Representations

You can represent the same underlying data in multiple ways
The following example shows four different ways to organise the same data
Each dataset shows the same values of four variables, country, year, population, and cases
Only one of these is tidy data. We will discuss which one this is at a later point

Data Representations (cont.)

table1

# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Data Representations (cont.)

table2

# A tibble: 12 x 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Data Representations (cont.)

table3

# A tibble: 6 x 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Data Representations (cont.)

table4a

# A tibble: 3 x 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

table4b

# A tibble: 3 x 3
  country         `1999`     `2000`
* <chr>            <int>      <int>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

Tidy Data

These are all representations of the same underlying data
Despite this, they are not all equally easy to use
Only one of these datasets is in a tidy format
The tidy dataset will be much easier to work with inside of the tidyverse

Tidy Data (cont.)

There a three interrelated rules which make a dataset tidy:

Each variable (a value you can measure) must have its own column
Each observation (a thing which has measurable properties) must have its own row
Each value must have its own cell

We can using the following graphic to represent this visually

Tidy Data Rules

Tidy Data (cont.)

These rules are interrelated since it's impossible to only satisfy two of the three
This interrelationship leads to an even simpler set of practical instructions:

Put each dataset in tibble
Put each variable in its own column

According to these rules, only the first representation is tidy

# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

We can now look at why the other representations are not tidy and how we can fix this

Gathering

An Untidy Example

Let's start by having a look at the first tibble in the fourth representation

table4a

# A tibble: 3 x 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

This representation violates the first rule of tidy data - each variable must have its own column
Here we have two columns, 1999 and 2000, which do not represent variables
Instead they represent specific values of the year variable
As a consequence, it also violates the second rule - each observation must have its own row
This is because each row corresponds to two observations, one from each year

Tidying the Dataset

To tidy a dataset like this we need to gather those two non-variable columns into a new pair of variables
To describe this operation we need three parameters:
- The set of columns that are not variables (1999 and 2000 in this case)
- The name of the variable whose values form the column names (we said before that this is year). We refer to this as the key
- The name of the variable whose values are spread over the cells (here it is cases). We refer to this as the value

Tidying the Dataset (cont.)

We can use these parameters in a call to gather()

table4a %>%
  # use backticks so R doesn't confuse the column names for numbers
  gather(`1999`, `2000`, key = "year", value = "cases")

# A tibble: 6 x 3
  country     year   cases
  <chr>       <chr>  <int>
1 Afghanistan 1999     745
2 Brazil      1999   37737
3 China       1999  212258
4 Afghanistan 2000    2666
5 Brazil      2000   80488
6 China       2000  213766

We could perform a similar process to tidy table4b though this time our value would be "population"
To combine these two tables we would then have to use techniques covered later in this session

Diagramming the Transformation

We can use the following diagram to show the gathering process we just performed.

Gathering

Spreading

An Untidy Example

Now let's take a look at the second representation

table2

# A tibble: 12 x 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

This suffers from the opposite problem of the last representation
Each observation is scattered across two rows and the variables cases and population are contained in the type column

Tidying the Dataset

We can use the spread() function to fix this issue
This time we only need two parameters:
- The column which contains the variable names (here, it's type) which we referred to as the key
- The column that contains values from multiple variables (here it's count) which we referred to as the value

Tidying the Dataset (cont.)

We can use these parameters in a call to spread()

table2 %>%
  spread(key = type, value = count)

# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

We don't have to use quotes for type and count since they already exist in the dataset whereas with gather() we were adding them

Diagramming the Transformation

We can use the following diagram to show the spreading process we just performed.

Spreading and Gathering

As you might be able to guess from the common key and value arguments, spread() and gather() are opposites of each other and can be used to reverse each other's effects
gather() makes wide tables narrower and longer, whereas spread() makes long tables shorter and wider
Collectively, these operations are known as pivoting (think pivot tables in Excel)
A general rule of thumb is that wide tables are best for looking at data, whereas tall tables are best for performing analysis.

Aside: Pivoting

Since tidyr 1.0.0 (September 2019), a new syntax for pivoting tables has been introduced
This uses two functions pivot_longer() and pivot_wider() which have a more intuitive syntax than gather() and spread()
As of writing this presentation, these functions are still in experimental lifecycle, and soon to move into maturing with tidyr 1.1.0
Even though this new syntax is more elegant, most examples and existing code uses the old syntax, so it is important to know this
You can, however, learn more about pivoting here

Separating

An Untidy Example

The last representation to look at is number three

table3

# A tibble: 6 x 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

This violates the last rule of tidy data - each value must have its own cell
As a consequence we also end up with a non-variable column rate

Tidying the Dataset

We can use the separate() function to fix this issue
We will need to specify two parameters:
- The column that we want to split into multiple columns wherever a separator character appears
- The names of the new columns to create as a character vector
By default separate() will split values whenever it sees a non-alphanumeric character though this can be specified manually using the sep argument
We also need to specify convert = TRUE if we want tidyr to automatically convert the new columns to the correct data types (by default they would still be characters)

Tidying the Dataset (cont.)

We then use these parameters in a call to separate()

table3 %>%
  separate(rate, into = c('cases', 'population'), sep = '/', convert = TRUE)

# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Bonus Note: You can pass separate() a vector of integers as the sep argument to be used as the positions to split at

Diagramming the Transformation

We can use the following diagram to show the spreading process we just performed.

Separating

Uniting

An Untidy Example

Here is another representation of the dataset we've been looking that we did not include earlier

table5

# A tibble: 6 x 4
  country     century year  rate             
* <chr>       <chr>   <chr> <chr>            
1 Afghanistan 19      99    745/19987071     
2 Afghanistan 20      00    2666/20595360    
3 Brazil      19      99    37737/172006362  
4 Brazil      20      00    80488/174504898  
5 China       19      99    212258/1272915272
6 China       20      00    213766/1280428583

This violates the last rule of tidy data too but in an opposite way
Here, the value of year has been split up into two cells, one containing the century and one containing the rest

Tidying the Dataset

We can use the unite() function to fix this issue
We will need to specify the following parameters:
- The name of the new column to create
- The set of columns to combine
Optionally, we can also include a sep argument to specify the separator or use the default _

Tidying the Dataset (cont.)

We then use these parameters in a call to unite()

table5 %>%
  unite(new, century, year, sep = '')

# A tibble: 6 x 3
  country     new   rate             
  <chr>       <chr> <chr>            
1 Afghanistan 1999  745/19987071     
2 Afghanistan 2000  2666/20595360    
3 Brazil      1999  37737/172006362  
4 Brazil      2000  80488/174504898  
5 China       1999  212258/1272915272
6 China       2000  213766/1280428583

unite() does not have a convert parameter so we would have to perform this conversion manually
In this case we would use mutate(year = as.integer(year))

Diagramming the Transformation

We can use the following diagram to show the spreading process we just performed.

Separating

Relational Data

Introduction

Where Are We?

We now make a return to the transformation part of the data analysis pipeline
It's rare that a data analysis project will only concern a single table of data so in this section we learn how to combine tables to answer questions we are interested in
We will be using dplyr to help us do this using a new set of functions from the package

Data Analysis Map - Transform

Relationships

Relations are always defined between pairs of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between the pairs
There are many ways to deal with relational data but due to time constraints, we will only be looking at the most critical types - mutating joins
The other two main types are filtering joins and set operations. These can be read about in R4DS chapter 10 if you need these for your work in the DataViz battle

Mutating Joins

Introduction

A mutating join allows you to combine variables from two tables
This is performed by matching observations on a collection of columns (or single column) known as the key
There are different types of joins which handle missing values in different ways
Any join takes three main arguments
- Two tables which will be joined together
- A single column name or vector of column names to be used as the key
The output of a join contains the shared keys as well as the other columns from both tables

Band Membership

To illustrate the various join types, we use the following toy dataset made of two tables

band_members

# A tibble: 3 x 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles

band_instruments

# A tibble: 3 x 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

Inner Join

The inner join is the simplest join
It matches pairs of observations whenever their keys are equal
The output of an inner join only contains rows which contained the key in both tables

inner_join(band_members, band_instruments, by = 'name')

# A tibble: 2 x 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass

Here, neither Mick nor Keith are included in the result as they don't appear in both tables

Left Join

A left join will result in a table containing every value of the key appearing in the piped-in table
Any missing values will be filled in with na
It is called a left join since the keys that are preserved come from the table used as the left argument

left_join(band_members, band_instruments, by = 'name')

# A tibble: 3 x 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass

Here the result contains all 3 musicians featured in the band_members table

Right Join

A right join is the opposite of a left join
It preserves all keys that come from the table used as the right argument

right_join(band_members, band_instruments, by = 'name')

# A tibble: 3 x 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
3 Keith <NA>    guitar

Here the result contains all 3 musicians featured in the band_instruments table

Full Join

A full join combines the behaviour of a left and right join
It creates a row in the result for every key that appears in either table

full_join(band_members, band_instruments, by = 'name')

# A tibble: 4 x 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
4 Keith <NA>    guitar

Differing Key Column Names

It is not always the case that your key columns will have the same names
Here is an altered version of the previous problem demonstrating this:

names(band_members)

[1] "name" "band"

names(band_instruments2)

[1] "artist" "plays"

We can still perform joins but we have to slightly modify our syntax to the following

inner_join(band_members, band_instruments2, by = c("name" = "artist"))

# A tibble: 2 x 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass

Duplicate Keys

Mutating joins also work when one table has duplicate keys
Detailed discussion of this is beyond the scope of this course but here is a brief example

employees <- tibble(name = c('John', 'Anne'), experience = c(4, 7))
sales <- tibble(item_code = c('21', '52', '35'), sold_by = c('John', 'Anne', 'Anne'))
sales %>%
  inner_join(employees, by = c('sold_by' = 'name'))

# A tibble: 3 x 3
  item_code sold_by experience
  <chr>     <chr>        <dbl>
1 21        John             4
2 52        Anne             7
3 35        Anne             7

Aside: Tidy Data in Practice

How would we create this graph?

plot of chunk unnamed-chunk-28