Into the Tidyverse | Session Three

Warwick Data Science Society

Acknowledgements

DigitalOcean Logo

Recap

Vectors and Arithmetic

  • You typically create a vector in R using the c() function
x <- c(1, 4, 9)
x
[1] 1 4 9
  • You can create a vector whose entries are an arithmetic sequence using seq()
seq(-5, 5, length.out = 6)
[1] -5 -3 -1  1  3  5
  • Most arithmetical operators in R are vectorised
sqrt(x)
[1] 1 2 3

Reading CSVs

  • We read CSVs in R using the read_csv() function from the readr package
  • The first argument we pass in is the path to the file we want to import
  • This path should be relative to our current working directory
# import a csv file at ~/project_name/data/my_data.csv
setwd('~/project_name') # or use Session > Set Working Director > ...

read_csv('data/my_data.csv')

readr Parameters

  • Sometimes you need to specify additional parameters when reading files. These include:
    • skip = n - skip the first n lines of the file
    • comment = '{char}' - ignore lines which begin with {char}
    • col_names = FALSE - the file has no column names and we don't know them
    • col_names = c('col1_name', ...) - the file has no column names but we do know them
    • na = '{char}' - import {char} as a missing value
    • col_types = cols(col1 = col_{type}(), ...) - override types of columns

Line Plots

  • Line plots can be made using geom_line() or geom_smooth()
  • Key aesthetics are linetype and group
  • Remove error bars with se = FALSE
  • Choose which smoothing method to use with method = 'lm'/'loess'/'gam'

Back to the Basics

Introduction

  • There are still many basic features that we have omitted learning about
  • These will be important when we start manipulating datasets
  • We therefore focus on them now

R for Statistics

  • Statistics is R's speciality
  • Despite this we've barely even used this functionality
  • A statistic is a function of data points
  • Examples include mean, median, maximum, etc.

Measures of Location

x <- c(4, 10, 10, 12, 19)
mean(x)
[1] 11
median(x)
[1] 10
quantile(x)
  0%  25%  50%  75% 100% 
   4   10   10   12   19 
quantile(x)[2]
25% 
 10 

Measures of Spread

x <- c(4, 10, 10, 12, 19)
# returns two values
range(x)
[1]  4 19
# use diff() to find difference
diff(range(x))
[1] 15
# returns one value
IQR(x)
[1] 2

Measures of Spread (cont.)

x <- c(4, 10, 10, 12, 19)
# variance
var(x)
[1] 29
# standard deviation = sqrt(variance)
sd(x)
[1] 5.385165

Handling Missing Values

  • If a vector containing a missing value (NA) has a statistical transformation applied to it then the result will always be NA
x <- c(1, 5, 10, NA, 12)
mean(x)
[1] NA
  • This can be avoid by setting na.rm = TRUE
mean(x, na.rm = TRUE)
[1] 7

Comparisons and Boolean Operators

  • It is very useful to be able to compare two values and ask how they relate - Is one larger than the other? Are they the same value?
  • We can go even further by stringing these comparisons together using Boolean operators (and, or, etc.)

Ordering Comparisons

4 < 6
[1] TRUE
3 <= 3
[1] TRUE
5 < 4
[1] FALSE
5 >= 4
[1] TRUE

Equality Comparisons

# use double ='s to check for equality
# a single equals is already used for specifying parameters of function
4 == 4
[1] TRUE
4 == 5
[1] FALSE
4 != 4
[1] FALSE
4 != 5
[1] TRUE

Comparisons on Vectors

  • All comparisons in R are vectorised. This means that they act element-wise on vectors
x <- c(2, 6, 10)
y <- c(3, 5, 10)
x <= y
[1]  TRUE FALSE  TRUE
x == y
[1] FALSE FALSE  TRUE
  • You can transform a logical vector into a single logical value using all() and any(). See their corresponding help pages for more details

Boolean Operators (and)

TRUE & TRUE
[1] TRUE
TRUE & FALSE
[1] FALSE
FALSE & FALSE
[1] FALSE

Boolean Operators (or)

TRUE | TRUE
[1] TRUE
TRUE | FALSE
[1] TRUE
FALSE | FALSE
[1] FALSE
  • Note, unlike in normal usage, when programming, or is inclusive. I.e. if both inputs to | are TRUE then so is the output

Boolean Operators (not)

!TRUE
[1] FALSE
!FALSE
[1] TRUE

Combining Boolean Operators

(4 > 3) & (2 == 1)
[1] FALSE
(7 != 5) | (4 < 2)
[1] TRUE
(!(4 < 3)) & (2 == 2)
[1] TRUE
  • The brackets above are not actually needed due the order of operations in R
  • It is best to always include them until you have a lot of practice though as this avoids errors that are difficult to diagnose
  • See ?Syntax for more information

Boolean Operators on Vectors

  • Boolean operators are also vectorised
x <- c(1, 5, 9); y <- c(2, 5, 7); z <- c(1, 6, 8)

x == y
[1] FALSE  TRUE FALSE
y < z
[1] FALSE  TRUE  TRUE
(x == y) & (y < z)
[1] FALSE  TRUE FALSE
!(x == y)
[1]  TRUE FALSE  TRUE

Data Transformation

Introduction

Where Are We?

  • We now take a step backwards to look at data transformation using the tidyverse
  • This session is all about taking a dataset, transforming it into a form that meets our needs, and using it to answer questions

Data Analysis Map - Transform

What is dplyr?

  • dplyr is the third tidyverse package that we will be looking at
  • It has five main features, referred to as the dplyr verbs
  • These allow you to filter a dataset or transform it by creating new variables/summaries
  • You can also use these to reorder your observations to make a dataset easier to work with

dplyr Hex

The dplyr Verbs

  • There are five key dplyr functions referred to as verbs.
  • These alone allow you to handle the majority of data manipulation tasks
  • They are:
    • filter() - pick observations by their values
    • arrange() - reorder observations based on their values
    • select() - pick variables by their names
    • mutate() - create new variables as functions of existing variables
    • summarise()/summarize() - collapse many values down to a single summary

The dplyr Verbs (cont.)

  • These verbs can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. More on this later
  • All five main verbs plus the group_by verb work similarly:
    • The first argument is a data frame
    • The subsequent arguments describe what to do with the data frame, using the variable names (without quotes)
    • The result is a new data frame

Filter Rows

Introduction

filter(mpg, cty > 30, cyl == 4)
# A tibble: 2 x 11
  manufacturer model   displ  year   cyl trans   drv     cty   hwy fl    class  
  <chr>        <chr>   <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>  
1 volkswagen   jetta     1.9  1999     4 manual~ f        33    44 d     compact
2 volkswagen   new be~   1.9  1999     4 manual~ f        35    44 d     subcom~
  • Warning: Don't confuse == with = else you'll get the error

    Error: [...] must not be named, do you need '=='?

Filtering with Boolean Operators

  • We can use Boolean operators in filter clauses
filter(mpg, model == 'land cruiser wagon 4wd' | displ > 6.8)
# A tibble: 3 x 11
  manufacturer model      displ  year   cyl trans  drv     cty   hwy fl    class
  <chr>        <chr>      <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
1 chevrolet    corvette     7    2008     8 manua~ r        15    24 p     2sea~
2 toyota       land crui~   4.7  1999     8 auto(~ 4        11    15 r     suv  
3 toyota       land crui~   5.7  2008     8 auto(~ 4        13    18 r     suv  
  • Another useful operator is %in% which checks if a value is in a vector
filter(band_members, name %in% c('John', 'Paul'))
# A tibble: 2 x 2
  name  band   
  <chr> <chr>  
1 John  Beatles
2 Paul  Beatles

Handling Floating-point Numbers

  • Computers cannot perform arithmetic with infinite precision. This leads to peculiar results.
sqrt(2) ^ 2 == 2
[1] FALSE
(1 / 49) * 49 == 1
[1] FALSE
  • When handling values that are not integers, use the near() function to check for equivalence.
near(sqrt(2) ^ 2, 2)
[1] TRUE
near((1 / 49) * 49, 1)
[1] TRUE

Handling Missing Values

  • In many cases you may want to either remove missing values or look only at rows which contain a missing value
  • The is.na() function can be used to this extent
df <- tibble(x = c(1, NA, 3), y = c(4, 5, 6))
filter(df, is.na(x))
# A tibble: 1 x 2
      x     y
  <dbl> <dbl>
1    NA     5
filter(df, !is.na(x))
# A tibble: 2 x 2
      x     y
  <dbl> <dbl>
1     1     4
2     3     6

Arrange Rows

Introduction

  • arrange() orders observations by one or more variables in ascending order
  • The observations are ordered by the first column, then ties are settled by the second, etc.
# order by class, break ties with year
arrange(mpg, class, year)
# A tibble: 234 x 11
   manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
   <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
 1 chevrolet    corvette   5.7  1999     8 manual~ r        16    26 p     2sea~
 2 chevrolet    corvette   5.7  1999     8 auto(l~ r        15    23 p     2sea~
 3 chevrolet    corvette   6.2  2008     8 manual~ r        16    26 p     2sea~
 4 chevrolet    corvette   6.2  2008     8 auto(s~ r        15    25 p     2sea~
 5 chevrolet    corvette   7    2008     8 manual~ r        15    24 p     2sea~
 6 audi         a4         1.8  1999     4 auto(l~ f        18    29 p     comp~
 7 audi         a4         1.8  1999     4 manual~ f        21    29 p     comp~
 8 audi         a4         2.8  1999     6 auto(l~ f        16    26 p     comp~
 9 audi         a4         2.8  1999     6 manual~ f        18    26 p     comp~
10 audi         a4 quat~   1.8  1999     4 manual~ 4        18    26 p     comp~
# ... with 224 more rows

Descending Order

  • You can wrap a variable in desc() to use descending order for that column
arrange(mpg, class, desc(year))
# A tibble: 234 x 11
   manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
   <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
 1 chevrolet    corvette   6.2  2008     8 manual~ r        16    26 p     2sea~
 2 chevrolet    corvette   6.2  2008     8 auto(s~ r        15    25 p     2sea~
 3 chevrolet    corvette   7    2008     8 manual~ r        15    24 p     2sea~
 4 chevrolet    corvette   5.7  1999     8 manual~ r        16    26 p     2sea~
 5 chevrolet    corvette   5.7  1999     8 auto(l~ r        15    23 p     2sea~
 6 audi         a4         2    2008     4 manual~ f        20    31 p     comp~
 7 audi         a4         2    2008     4 auto(a~ f        21    30 p     comp~
 8 audi         a4         3.1  2008     6 auto(a~ f        18    27 p     comp~
 9 audi         a4 quat~   2    2008     4 manual~ 4        20    28 p     comp~
10 audi         a4 quat~   2    2008     4 auto(s~ 4        19    27 p     comp~
# ... with 224 more rows
  • Note that there is a convention that TRUE is greater than FALSE since the underlying representation of these objects are 1 and 0 respectively

Missing Values

  • Missing values are always sorted at the end
df <- tibble(x = c(1, NA, 5))
arrange(df, x)
# A tibble: 3 x 1
      x
  <dbl>
1     1
2     5
3    NA
arrange(df, desc(x))
# A tibble: 3 x 1
      x
  <dbl>
1     5
2     1
3    NA

Select Columns

Introduction

  • It is not unusual to get be given a dataset with hundreds or even thousands of columns
  • In this case, you may want to narrow this down to variables you actually care about
select(iris, Petal.Length, Petal.Width, Species)
# A tibble: 150 x 3
   Petal.Length Petal.Width Species
          <dbl>       <dbl> <fct>  
 1          1.4         0.2 setosa 
 2          1.4         0.2 setosa 
 3          1.3         0.2 setosa 
 4          1.5         0.2 setosa 
 5          1.4         0.2 setosa 
 6          1.7         0.4 setosa 
 7          1.4         0.3 setosa 
 8          1.5         0.2 setosa 
 9          1.4         0.2 setosa 
10          1.5         0.1 setosa 
# ... with 140 more rows

Dropping Columns

  • You can get rid of columns by prefixing their name with a - symbol
select(iris, -Species)
# A tibble: 150 x 4
   Sepal.Length Sepal.Width Petal.Length Petal.Width
          <dbl>       <dbl>        <dbl>       <dbl>
 1          5.1         3.5          1.4         0.2
 2          4.9         3            1.4         0.2
 3          4.7         3.2          1.3         0.2
 4          4.6         3.1          1.5         0.2
 5          5           3.6          1.4         0.2
 6          5.4         3.9          1.7         0.4
 7          4.6         3.4          1.4         0.3
 8          5           3.4          1.5         0.2
 9          4.4         2.9          1.4         0.2
10          4.9         3.1          1.5         0.1
# ... with 140 more rows

Selecting Ranges

  • You can use the : operator to select or remove columns in a range (inclusive)
select(iris, Sepal.Length:Petal.Length)
# A tibble: 150 x 3
   Sepal.Length Sepal.Width Petal.Length
          <dbl>       <dbl>        <dbl>
 1          5.1         3.5          1.4
 2          4.9         3            1.4
 3          4.7         3.2          1.3
 4          4.6         3.1          1.5
 5          5           3.6          1.4
 6          5.4         3.9          1.7
 7          4.6         3.4          1.4
 8          5           3.4          1.5
 9          4.4         2.9          1.4
10          4.9         3.1          1.5
# ... with 140 more rows

Selecting Ranges (cont.)

  • This also works with -
select(iris, -(Sepal.Length:Petal.Length))
# A tibble: 150 x 2
   Petal.Width Species
         <dbl> <fct>  
 1         0.2 setosa 
 2         0.2 setosa 
 3         0.2 setosa 
 4         0.2 setosa 
 5         0.2 setosa 
 6         0.4 setosa 
 7         0.3 setosa 
 8         0.2 setosa 
 9         0.2 setosa 
10         0.1 setosa 
# ... with 140 more rows

Selection Helper Functions

  • There are a number of helper functions you can use within select():
    • starts_with('abc')
    • ends_with('xyz')
    • contains('ijk')
    • num_range('var', 2:4) - matches var2, var3, var4
    • everything - matches all variables
  • See ?select_helpers/?select for more information

Renaming Columns

  • select() can be used to rename columns but this can be more easily achieved using rename()
rename(iris, length_of_sepal = Sepal.Length, width_of_sepal = Sepal.Width)
# A tibble: 150 x 5
   length_of_sepal width_of_sepal Petal.Length Petal.Width Species
             <dbl>          <dbl>        <dbl>       <dbl> <fct>  
 1             5.1            3.5          1.4         0.2 setosa 
 2             4.9            3            1.4         0.2 setosa 
 3             4.7            3.2          1.3         0.2 setosa 
 4             4.6            3.1          1.5         0.2 setosa 
 5             5              3.6          1.4         0.2 setosa 
 6             5.4            3.9          1.7         0.4 setosa 
 7             4.6            3.4          1.4         0.3 setosa 
 8             5              3.4          1.5         0.2 setosa 
 9             4.4            2.9          1.4         0.2 setosa 
10             4.9            3.1          1.5         0.1 setosa 
# ... with 140 more rows

Add New Variables

Introduction

  • mutate() lets you transform one or more variables currently in your dataset to create a new one
  • The new column is added at the end of the dataset or in the same place as before if it overwrites an existing column
temps <- tibble(day = c('Mon', 'Tues', 'Wed'), temp_c = c(22, 24, 19))
mutate(temps, temp_f = temp_c * 1.8 + 32)
# A tibble: 3 x 3
  day   temp_c temp_f
  <chr>  <dbl>  <dbl>
1 Mon       22   71.6
2 Tues      24   75.2
3 Wed       19   66.2

Functions of Multiple Variables

  • You can use multiple variables to create a new column
health <- tibble(name = c('Ann', 'Bob', 'Charlie'), 
                 weight = c(71, 87, 65),
                 height = c(1.68, 1.77, 1.72))
mutate(health, bmi = weight / height ^ 2)
# A tibble: 3 x 4
  name    weight height   bmi
  <chr>    <dbl>  <dbl> <dbl>
1 Ann         71   1.68  25.2
2 Bob         87   1.77  27.8
3 Charlie     65   1.72  22.0

Referencing New Columns

  • You are also allowed to reference columns that you've just created
sprint <- tibble(time_sec = c(1, 3, 5), dist_m = c(8, 26, 47))
mutate(sprint, 
       avg_speed_m_sec = dist_m / time_sec,
       avg_speed_km_hr = avg_speed_m_sec * 3.6)
# A tibble: 3 x 4
  time_sec dist_m avg_speed_m_sec avg_speed_km_hr
     <dbl>  <dbl>           <dbl>           <dbl>
1        1      8            8               28.8
2        3     26            8.67            31.2
3        5     47            9.4             33.8

Useful Creation Functions

  • Basic arithmetic operators: +, -, *, // ^
  • Aggregate functions: x / sum(x) (proportion), y - mean(y) (centring)
  • Modular arithmetic: %/% (integer division), %% (remainder)
times <- tibble(time = c(0923, 1321, 1908))
mutate(times,
       hour = time %/% 100,
       min = time %% 100)
# A tibble: 3 x 3
   time  hour   min
  <dbl> <dbl> <dbl>
1   923     9    23
2  1321    13    21
3  1908    19     8
  • Logarithms: log() (natural), log2(), log10()

Useful Creation Functions (cont.)

  • Cumulative aggregates: cumsum(), cumprod(), cummin(), cummax(), cummean()
mutate(tibble(x = c(1, 4, 0)), sum = cumsum(x), prod = cumprod(x), 
       min = cummin(x), max = cummax(x), mean = cummean(x))
# A tibble: 3 x 6
      x   sum  prod   min   max  mean
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     1     1     1     1  1   
2     4     5     4     1     4  2.5 
3     0     5     0     0     4  1.67
  • Logical Comparisons: <, <=, >, >=, ==, !=
mutate(tibble(time = c(1147, 1252)), afternoon = time >= 1200)
# A tibble: 2 x 2
   time afternoon
  <dbl> <lgl>    
1  1147 FALSE    
2  1252 TRUE     

Summarise Values

Introduction

  • summarise() (or US spelling summarize()) allows you to collapse a data frame to a single row based on an aggregation function
profits <- tibble(day = c('Mon', 'Tues', 'Wed', 'Thurs', 'Fri'), 
                  profit = c(323, 432, 491, NA, 402))
summarise(profits, avg_profit = mean(profit, na.rm = TRUE))
# A tibble: 1 x 1
  avg_profit
       <dbl>
1        412

Group Summaries

  • summarise() by itself is not very useful. We could have done this with mean(profits$profit, na.rm = TRUE)
  • The real power comes when it is combined with the group_by() function
profits <- tibble(day = c('Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'),
                  wkdy = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE),
                  profit = c(323, 432, 491, NA, 402, 631, 583))
by_wkdy <- group_by(profits, wkdy)
summarise(by_wkdy, avg_profit = mean(profit, na.rm = TRUE))
# A tibble: 2 x 2
  wkdy  avg_profit
  <lgl>      <dbl>
1 FALSE        607
2 TRUE         412

Multiple Summaries

  • You can summarise multiple variables in one go or have multiple summaries for a single variable
by_species <- group_by(iris, Species)
summarise(by_species, mean_sepal_len = mean(Sepal.Length),
          count = n(),  # use function n() to count how many in each group
          range_of_petal_width = diff(range(Petal.Width)))
# A tibble: 3 x 4
  Species    mean_sepal_len count range_of_petal_width
  <fct>               <dbl> <int>                <dbl>
1 setosa               5.01    50                  0.5
2 versicolor           5.94    50                  0.8
3 virginica            6.59    50                  1.1

Grouping by multiple variables

  • You are not limited to grouping by only one variable
mon_split <- mutate(airquality, mon_half = ifelse(Day <= 15, 'Start', 'End'))
airquality_grpd <- group_by(mon_split, Month, mon_half)
summarise(airquality_grpd, med_wind = median(Wind))
# A tibble: 10 x 3
# Groups:   Month [5]
   Month mon_half med_wind
   <int> <chr>       <dbl>
 1     5 End         11.8 
 2     5 Start       10.9 
 3     6 End          9.2 
 4     6 Start        9.7 
 5     7 End          8.3 
 6     7 Start        9.2 
 7     8 End          8.85
 8     8 Start        8.6 
 9     9 End         10.3 
10     9 Start       10.3 

Aside: Piping

  • dpylr comes with an amazing tool called the pipe - %>%
  • This allows you to pass the output of one function into the first argument of the next
  • Using this, the previous code can by rewritten as
airquality %>%
  mutate(mon_half = ifelse(Day <= 15, 'Start', 'End')) %>%
  group_by(Month, mon_half) %>%
  summarise(med_wind = median(Wind))
  • Essentially, you only need to mention a single data frame at the start and never need to create temporary objects

Combining dpylr with ggplot

  • Question: of all large diamonds of colour G, what is the median volume (assume approximate ellipsoidality) for each cut?
diamonds %>%
  filter(color == 'G') %>%
  # volume of ellipsoid: https://en.wikipedia.org/wiki/Ellipsoid#Volume
  mutate(vol = 4/3 * pi * x * y * z) %>%
  group_by(cut) %>%
  summarise(med_vol = median(vol)) %>%
  ggplot(aes(x = cut, y = med_vol, fill = cut)) +
    # we'll learn more about geom_col in session 5
    geom_col()
  • This is a very complex example so try to work through it bit by bit