Into the Tidyverse | Session Two

Warwick Data Science Society

Acknowledgements

DigitalOcean Logo

Recap

Basic Scatter Plots

The basic structure of the code needed to generate a scatter plot is

ggplot(<DATA>) +
  <GEOM_FUNCTION>(aes(x = <VAR1>, y = <VAR2>, 
                      <FURTHER MAPPINGS>))

Further mappings include col/colour/color, size, alpha, shape
We manually set aesthetics by placing them outside of aes()

Faceting

We can facet a plot in one of two ways

ggplot(<DATA>) +
  <GEOM_FUNCTION>(aes(<MAPPINGS>)) +
  facet_wrap(~<VAR>)

ggplot(<DATA>) +
  <GEOM_FUNCTION>(aes(<MAPPINGS>)) +
  facet_grid(<ROW_VAR> ~ <COL_VAR>)

Working with Data Frames

Useful functions for inspecting dataframes are
- head(), tail(),
- summary()
- str()
- $, [[...]]
We get more information about a dataset using ?dataset_name

Back to the Basics

Introduction

We now have some experience with running R code
But there is still a lot that we have not covered
Here we cover some fundamentals of coding with R
I'll also offer some tips and tricks for using RStudio

Coding Basics

R as a Calculator

We have already seen that R can be used as a basic calculator
This functionality extends beyond basic arithmetic

(58 + 73 * 2) / 3  # normal rules of BIDMAS apply

[1] 68

sin(pi / 2)

[1] 1

sqrt(81)

[1] 9

log(42)  # natural logarithm - also called ln()

[1] 3.73767

Objects

R is capable of storing any value in an object (also called variables)
Objects are simply a way of labelling a piece of data for later use
In R, objects are assigned using the <- operator

x <- 3 * 4

Once an object has been assigned a value, it can be referenced by using its name

[1] 12

x / 2

[1] 6

Objects (cont.)

All R statements where you create objects have the form

object_name <- value

An object will retain its value until one of the following occur:
- You close RStudio
- You delete the object manually
- You overwrite its value with another
You can see a list of all objects in the 'Environment' pane in RStudio
Many programming languages use the = symbol for assignment and R will accept this too; this will however cause confusion later, so please avoid doing so
If you are getting tired of tying <- you can use the keyboard shortcut alt-minus in RStudio to insert it

What's in a Name?

Object names must start with a letter, and can only contain letters, numbers, _, and .
A good object name will be descriptive and follow the same convention as the rest of your code
Don't be afraid of long variable names! RStudio is capable of code completion. Simply type the first few letters of a variable name (or function, for that matter), press TAB, select the correct name from the drop-down list, and press ENTER.
Beware: Object names in R are case-sensitive and are intolerant to typos!

Aside: Code History

When typing in the console, you can use the up and down arrow keys to jump between previously entered lines of code
This is especially helpful if you made a mistake in a line of code and which to correct it before re-running
You can see your full code history in the 'History' pane in RStudio

Functions

Calling Functions

R has a large collection of built-in functions that are called like this:

function_name(arg1 = val1, arg2 = val2, ...)

An example of such a function is seq() which makes a regular sequence of numbers

seq(from = 1, to = 10)

 [1]  1  2  3  4  5  6  7  8  9 10

For many functions, it is unnecessary to name the arguments as there is an obvious order that they should be entered in

seq(1, 10)

 [1]  1  2  3  4  5  6  7  8  9 10

There are ways to check which functions allow this, but for now, trial and error plus some common sense will suffice

Common Problems

If the symbol at the start of the R console changes from a > to a + then it means that R is expecting more input. Perhaps you forgot to close a bracket. If you can't find how to make R happy, you can press the ESC key to exit the current statement and start afresh.
If you receive an object not found error then it is most likely that you've made a typo or have forgotten to assign a value to a variable you are referencing
If you are unsure of what arguments a specific function takes, have a look at the relevant help page using the ? operator

Data Import

Introduction

Where Are We?

Now that we have the basics of data visualisation under our belt, we can return to the beginning of the data science pipeline
In doing this, we are departing from the order that R4DS takes, although I think this is best so we can get our hands on the weather dataset

Data Analysis Map - Import

What is readr?

readr is another of the many packages included in the tidyverse
It allows you to import data from a wide variety of storage formats
It is designed to balance import speed, ease-of-use, and consistency

readr Hex

First Steps

CSV files

Common-separated value (CSV) files are almost certainly the most common storage format for flat data
Each observation occupies its own line and fields are separated by commas (hence the name)
CSVs can contain a header row with column names, although this is not always the case

Example of a CSV file

Importing CSV files

You can import a CSV file using the read_csv() function
Don't forget to import the tidyverse package first!
The first argument is the most important; it's the path to the file to read
The path of the file should be relative to your current working directory
This can be changed using Session > Set Working Directory > Choose Directory or by using Ctrl+Shift+H

Importing CSV files (cont.)

people_df <- read_csv("data/people.csv")

Parsed with column specification:
cols(
  earn = col_double(),
  height = col_double(),
  sex = col_character(),
  ed = col_double(),
  age = col_double(),
  race = col_character()
)

When using the read_csv() function, you are outputted a message containing the column names and types
Warning: Don't confuse readr's read_csv() function with the built-in read.csv() function

Importing CSV files (cont.)

people_df

# A tibble: 1,192 x 6
    earn height sex       ed   age race    
   <dbl>  <dbl> <chr>  <dbl> <dbl> <chr>   
 1 50000   74.4 male      16    45 white   
 2 60000   65.5 female    16    58 white   
 3 30000   63.6 female    16    29 white   
 4 50000   63.1 female    16    91 other   
 5 51000   63.4 female    17    39 white   
 6  9000   64.4 female    15    26 white   
 7 29000   61.7 female    12    49 white   
 8 32000   72.7 male      17    46 white   
 9  2000   72.0 male      15    21 hispanic
10 27000   72.2 male      12    26 white   
# ... with 1,182 more rows

Aside: Tibbles

Tibbles are just standard R data frames, but they tweak some old behaviours to make life a little easier
For example, Tibbles limit how many rows and columns are printed in one go to avoid flooding the console
They also give you useful tips such a column types and the dimension of the data frame
A standard R data frame can be converted to a tibble using the as_tibble() function

mtcars_tb <- as_tibble(mtcars)

In-line CSV files

For the purposes of testing code, the read_csv() function can process an in-line CSV
This is also important for creating reproducible examples

read_csv(
  "a, b, c
   1, 2, 3
   4, 5, 6"
)

# A tibble: 2 x 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

readr Parameters

Skipping Rows

As seen in both previous examples, the default behaviour for read_csv() is to use the first line for column names
This convention can be circumvented by using either the skip or comment parameter

Skipping Rows (cont.)

The skip parameter can be used to skip a specified number of rows in the CSV file before starting to read data
This is useful when your CSV contains metadata at the top of the file

read_csv(
  "The first line of metadata
   The second line of metadata
   x, y, z
   1, 2, 3",
  skip = 2
)

# A tibble: 1 x 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     3

Skipping Rows (cont.)

The comment parameter can be used to drop all lines starting with a specified character

read_csv(
  "# This line is a comment
   x, y, z
   # So is this one
   1, 2, 3",
  comment = '#'
)

# A tibble: 1 x 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     3

Manual Column Names

It may be the case that the CSV file you are importing does not have column names
In this case you can use the col_names parameter to control what to do

Manual Column Names (cont.)

Setting col_names = FALSE will tell read_csv() that the first row of the CSV file is data and that the column should be labelled using generic names 'X1', 'X2', etc.

read_csv(
  "1, 2, 3
   4, 5, 6",
  col_names = FALSE
)

# A tibble: 2 x 3
     X1    X2    X3
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

Manual Column Names (cont.)

Alternatively, if you know what the column names should be, you can specify then using a character vector

read_csv(
  "1, 2, 3
   4, 5, 6",
  col_names = c("x", "y", "z")
)

# A tibble: 2 x 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

Aside: Vectors

In R, a vector is a collection of elements of the same type
There are integer vectors, character vectors, double vectors, and many more
Vectors are created using the c() function. This stands for combine

x <- c(1, 4, 9)

Most mathematical functions in R act element-wise on vectors

sqrt(x)

[1] 1 2 3

A particular value of a vector can be accessed using the [ accessor

x[2]  # R uses 1-based-indexing

[1] 4

Missing Values

In the real world, it is rare to ever have a complete data set
Missing values can be encoded using a range of symbols - ., ?, N/A, -, etc.
You can tell readr() which symbol is used in a given CSV file using the na parameter

read_csv(
  "x, y, z
   1, 2, .
   4, ., 6",
  na = '.'
)

# A tibble: 2 x 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     2    NA
2     4    NA     6

Summary

With these few techniques, one is most likely capable of reading about 75% of all CSV files that come up in 'the wild'
These techniques easily be adapted for use with other readr functions

Column Types

Manual Column Types

R4DS now spends a dozen or so pages discussing how to control the parsing of column types
This is worth a read if you're interested but I will summarise the main points in a few slides
readr uses a heuristic to figure out the type of each column from the first 1000 rows
This means that it can sometimes get these wrong
In this case, it is worth re-importing but with column types specified manually
The most common issues are integers being read as doubles, factors (categorical variables) being read as strings, or dates being read as strings

Manual Column Types (cont.)

You can manually specify column types using the col_types parameter with the cols() function

people <- read_csv('data/people.csv',
                   col_types = cols(
                     earn = col_double(),
                     sex = col_factor(),
                     age = col_integer()
                   ))

Any columns not specified will be guessed
A column can be skipped by setting its type to col_skip()
Valid column types are logical, integer, double, character, factor, date, time, datetime, number (this is only used in special cases)

(A Tiny Bit More) Data Visualisation

More Geometries

Line plots

So far, we have only produced scatter plots and that's a bit boring
By changing which geometry we use, we generate a line plot
A dataset suited to this type of graph is the oranges dataset

Orange <- as_tibble(Orange)
head(Orange)

# A tibble: 6 x 3
  Tree    age circumference
  <ord> <dbl>         <dbl>
1 1       118            30
2 1       484            58
3 1       664            87
4 1      1004           115
5 1      1231           120
6 1      1372           142

How would we learn more about the Orange dataset?

Line plots (cont.)

A line plot is produced by using the geom_line() geometry
Note that I am specifying parameters by position rather than by name

ggplot(Orange) +
  geom_line(aes(x = age, y = circumference, col = Tree), size = 2)

plot of chunk unnamed-chunk-34

Smooth Fitted Lines (cont.)

A similar geometry is geom_smooth(). This adds a smooth fitted line to a plot
This can be combined with geom_point() to create the following graph
Use can use the se parameter to control whether confidence intervals are shown. These are best turned off for now

ggplot(Orange) +
  geom_smooth(aes(x = age, y = circumference, col = Tree), size = 2, se = FALSE) +
  geom_point(aes(x = age, y = circumference, col = Tree), size = 2)

plot of chunk unnamed-chunk-35

Global Aesthetics

The previous code is repetitive as we specify the mapping for aesthetics x, y, and col twice
We can instead specify these as the second argument of ggplot()
These will be treated as global variables mappings
You can still specify local mappings in either of the geometry functions

ggplot(Orange, aes(x = age, y = circumference, col = Tree)) +
  geom_smooth(size = 2, se = FALSE) +
  geom_point(aes(size = circumference))

plot of chunk unnamed-chunk-36

Aesthetics for line graphs

Both line graph geometries both come with a linetype aesthetic to control what type of line is used
This should have a categorical variable mapped to it or manually set to a positive integer value/name of a line type
Both aesthetics also has a group() aesthetic which allows you to separate the full data set before generating the lines (see exercises for examples of this)
This can be set to -1 to use no grouping

Into the Tidyverse | Session Two

Acknowledgements

Recap

Basic Scatter Plots

Faceting

Working with Data Frames

Back to the Basics

Introduction

Coding Basics

R as a Calculator

Objects

Objects (cont.)

What's in a Name?

Aside: Code History

Functions

Calling Functions

Common Problems

Data Import

Introduction

Where Are We?

What is readr?

First Steps

CSV files

Importing CSV files

Importing CSV files (cont.)

Importing CSV files (cont.)

Aside: Tibbles

In-line CSV files

readr Parameters

Skipping Rows

Skipping Rows (cont.)

Skipping Rows (cont.)

Manual Column Names

Manual Column Names (cont.)

Manual Column Names (cont.)

Aside: Vectors

Missing Values

Summary

More readr Functions

More Delimited Files

Fixed-width files (Omitted)

Importing fixed-width files

Importing fixed-width files (cont.)

Importing fixed-width files (cont.)

Importing Foreign Data

Column Types

Manual Column Types

Manual Column Types (cont.)

(A Tiny Bit More) Data Visualisation

More Geometries

Line plots

Line plots (cont.)

Smooth Fitted Lines (cont.)

Global Aesthetics

Aesthetics for line graphs