Into the Tidyverse | Session One

Warwick Data Science Society

Accessing Resources

These slides, as well as other resources for the course, can be found on the corresponding GitHub repository:

https://github.com/warwickdatascience/into-the-tidyverse

An easier way to access these is through the course website:

https://warwickdatascience.github.io/into-the-tidyverse

Introductions

About The Course

Prerequisites

The only real expectation of this course is that you have general numerical literacy and can navigate around a computer
Previous programming experience will make learning easier but is not expected at all

Basis of the Course

Based on the book R for Data Science by Hadley Wickham & Garrett Grolemund
Many sections have been removed (originally ~500 pages)
Freely available to read at r4ds.had.co.nz

This course differs from R4DS in two ways:

More focus on immediate gains rather than complete mastery of the tidyverse
Exercises have a more practical focus

R for Data Science Cover

What Will This Course Teach You?

This course aims to teach you how to:

Set up R, RStudio, and the tidyverse on your computer
Import a data set from the web and clean it
Wrangle a data set from one form to another
Transform a data set to help answer a question
Visualise a data set using plots and graphs
Communicate data honestly and effectively

What Won't This Course Teach You?

Sections of the R4DS book not covered in this course include:

Working with complicated data types
Handling non-categorical text data
Iterative techniques (loops and maps)
Statistical modelling
Communicating results with wider audiences

Teaching Style

Fast and light
Goal is to show you what is available and where to find resources
Actual learning happens with practice

Why Bother Learning the Tidyverse?

Why learn data analytics?

Data skills are currently in high demand
In our modern economy, almost all professions process data in some way

Why use R/the tidyverse?

Far more powerful and expandable than Excel or Tableau
Open-source and free to use (unlike SAS or SPSS)
A large and beginner-friendly community
A lot more intuitive than more conventional programming languages (Python, Julia, JavaScript, etc.)

Course Agenda

The course will be ran weekly for five weeks
This will be followed by a data visualisation competition
Each session will introduce new material and work through some examples
A homework sheet will then be distributed to apply the new techniques

The Tidy-what?

Foundations

What is R?

Programming language for statistical computing and graphics
Built upon an historic language (S) which was developed in the mid-70s
Known for being highly extensible
Completely free and open-source with a large, friendly community

R is an incredibly powerful tool. It can be used for machine learning, statistical modeling, big data, and much more. It is also capable of producing websites, interactive notebooks, and presentations (such as this one!).

Setting Up R

Installation

Go to CRAN, the comprehensive R archive network
Select the download link that matches your operating system
Choose the base installation and download the latest version
Install as you would any other program

Running

If the installation was successful, you will have a new entry in your start menu called R x64 [Version Number]
Clicking this will open a program in which you can write R code

Limitations of Base R

Problem: The standard R code editor is difficult to use, missing useful features, and frankly quite ugly
Solution: Install RStudio and use it on top of R

What is RStudio?

Integrated development environment (IDE) specifically made for programming in R
Contains a rich set of features to allow you to code in R with less hassle
Displays your code and its output in a much clearer way and gives you access to extra features such as help files, code history, and environment variables

Essentially, unless you are trying to provoke a headache, it is advised to do all R programming in RStudio.

Setting Up RStudio

The latest version of RStudio can easily be download from www.rstudio.com/download
This can also be reached by simply searching for RStudio
Once installed, you can open the program and all of the behind-the-scenes communication with your R installation will be done automatically.

NB: RStudio does not work as a standalone program. You will still need R installed even if you intend to only write code in RStudio.

Navigating RStudio

Opening RStudio will confront you with an interface similar to this
It is likely though, that the code editor will be hidden, with the console taking up the whole of left hand side
For the time being, we will type our code directly into the console

RStudio Interface

Running R Code

R code can be executed by typing directly into the console and pressing the enter key
For example, you can add together two numbers with

1 + 2

[1] 3

Here the first line shows what I typed into the console and the second shows the output that R gave us
Have a go at writing your own mathematical expressions! The main operators are +, -, *, /, and ^. Figure out what these do by trying some simple examples.

The Tidyverse

What is the Tidyverse?

The power of R comes from its extendibility
Whereas the built-in capabilities of R are somewhat limited, additional packages can be installed to let you do more powerful things
A package is a collection of functions, data, and documentation which extend the usual capacity of R
The tidyverse is a collection of packages for performing clean and efficient data analysis

What is the Tidyverse? (cont.)

The packages in the tidyverse are each designed for a specific area of data analysis
For example there are packages for data importing, visualisation, and data manipulation
They share a common philosophy of data and so work completely naturally together

Tidyverse Hex

Using the Tidyverse

The tidyverse can be installed by running the following command in the RStudio console

install.packages('tidyverse')

You only need to install this package once per computer, though running the command multiple times will do no harm
If you then wish to use the tidyverse, you must first run the command

library('tidyverse')

This must be re-run each time you open RStudio
Think of install.packages(...) as installing a new piece of software, and libary(...) as clicking its icon to open it.

Data Visualisation

Introduction

Where Are We?

When learning new data analytics techniques, it is useful to know where these new skills fit into the overall 'map' of data analytics
Here is such a map (though there will be many other equally correct representations)

Data Analysis Map - Visualise

Where Are We? (cont.)

In beginning this course by learning how to perform data visualisation, we are taking things slightly out of order
The pay-off for learning data visualisation is very clear however, and so we will start there and return to earlier topics in the future
This means that, until next week, we will be forced to used datasets that come pre-packaged with the tidyverse

What is ggplot2?

ggplot2 is one of the many packages included in the tidyverse
It allows you to easy construct stylish graphs using a coherent and consistent system
A ggplot2 graph follows the grammar of graphics, in which plots are built by adding layers one at a time

ggplot2 Hex

First Steps

The mpg Data Frame

Before we plot our first graph, we need a dataset to work with
Since we do not yet know how to import our own, let's use a built-in dataset - mpg

mpg

# A tibble: 234 x 11
   manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
   <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
 1 audi         a4         1.8  1999     4 auto(l… f        18    29 p     comp…
 2 audi         a4         1.8  1999     4 manual… f        21    29 p     comp…
 3 audi         a4         2    2008     4 manual… f        20    31 p     comp…
 4 audi         a4         2    2008     4 auto(a… f        21    30 p     comp…
 5 audi         a4         2.8  1999     6 auto(l… f        16    26 p     comp…
 6 audi         a4         2.8  1999     6 manual… f        18    26 p     comp…
 7 audi         a4         3.1  2008     6 auto(a… f        18    27 p     comp…
 8 audi         a4 quat…   1.8  1999     4 manual… 4        18    26 p     comp…
 9 audi         a4 quat…   1.8  1999     4 auto(l… 4        16    25 p     comp…
10 audi         a4 quat…   2    2008     4 manual… 4        20    28 p     comp…
# … with 224 more rows

The mpg Data Frame (cont.)

This dataset contains observations collected by the EPA on 38 models of cars
The variables in mpg include:
- displ - a car's engine size (in litres)
- hwy - a car's fuel efficiency on the highway (in miles per gallon)
We can use this dataset to answer the question “do cars with larger engines use more fuel than cars with smaller engines?”
You may know the answer to this already but can we make our answer more precise? Is the trend linear or non-linear? Are there any exceptions to the general trend?

Our First ggplot

We can plot the desired graph using the following code

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy))

plot of chunk unnamed-chunk-5

Breaking the Code Down

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy))

We first start by calling the ggplot() function. This creates an empty graph which we can then add layers to
We pass in the mpg data frame so that ggplot2 knows we will be using that for our plot
We then use the + symbol to add a new layer. Specifically we add a point geometry using the geom_point() function
We need to tell ggplot2 which variables in the data frame to map to the various aesthetics of the plot. In this case, we just specify the variables mapping to the x and y coordinates

A General Template

In general, our code will look something like the following

ggplot(<DATA>) +
  <GEOM_FUNCTION>(aes(<MAPPINGS>))

All we have to do is fill in the blanks
Note, that the indentation and spaces are just to help readability; R doesn't care whether or not these are include though it is advised to do so
For example, we may want to ask how highway (hwy) and city (cty) mileage are related

Aside: Working With Data Frames

Printing Data Frames

An entire data frame can be printed to the console simply by typing it's name

mpg

The first few rows of a data frame can be printed using the head() function

head(mpg)

A specific number of rows can be printed by specifying the parameter n

head(mpg, n = 3)

tail() works the same as head() but gives you the bottom few rows. It also has an optional n parameter

Data Frame Overviews

The str() function displays the structure of a data frame (column names, data types, etc.)

str(mpg)

The summary() function displays a statistical summary of each column of the data frame

summary(mpg)

Accessing Columns

A specific column a data frame can be accessed using either the $ or [[]] accessor

mpg$displ

mpg[['displ']]

The latter approach is favoured if a column name contains spaces

Further Aesthetic Mappings

Explaining Outliers

Looking at our graph, we appear to have a few outliers amongst the cars with large engines.
Perhaps introducing another variable into the plot will help us explain this
Using our template from before, this is simple. All we have to do is add one more mapping to our list
Perhaps a sensible suggestion would be to colour the data points by the type of car (SUV, compact, pick-up, etc.)
This information is contained in the class column

Explaining Outliers (cont.)

To add colour to our plot, we simply add a new mapping to the code we had before

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy, colour = class))

plot of chunk unnamed-chunk-16

It appears that the anomalous points correspond to 2-seater sports cars

More Aesthetics

The point geometry can accept a wide range of aesthetics including:
- size - the size of each point
- alpha - the transparency of each point
- shape - the shape of the plotting character for each point
Depending on what type of data you are using (continuous, discrete, categorical, etc.) certain mappings will be more appropriate
Sometimes, a categorical variable may be stored in a dataset as a discrete/continuous variable. In this case you must explicitly tell ggplot that the variable is categorical by wrapping it in the factor function
Whenever you set an aesthetic mapping, ggplot2 will automatically scale your variable as well as create a legend for the new aesthetic

Manually Setting Aesthetics

So far, we have controlled aesthetics by mapping a column of our data frame to them
What if we want to manually set an aesthetic?
We simply perform the mapping outside of the aes() function and assign the aesthetic a value rather than a variable

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy), colour = 'orange')

plot of chunk unnamed-chunk-17

Manually Setting Aesthetics (cont.)

It is important to specify a value that makes sense for the aesthetic:
- The name of a colour should be a character string or a colour code (e.g. '#467A9F')
- The size of a point should be a number (in mm)
- The shape of a point should be an integer chosen from the figure across or a single character to be used as the plotting character (e.g. 'P')
- If hollow shapes (0-14) have a border determined by colour; the solid shapes (15-18) are filled with colour; and the filled shapes (21-24) have a border of colour and are filled with fill

Shape List

Facets

Introduction

Aesthetics offer one method of adding additional variables to a plot
Another way, particularly when using categorical variables, is to split a plot into facets
Facets are sub-plots that each show a subset of the entire dataset
Facets can be generated in two ways using either the facet_wrap() or facet_grid() function

Using facet_wrap()

facet_wrap() is used to facet a plot by a single variable
The first argument of facet_wrap() should be a formula
A formula is created using the ~ symbol followed by a variable name
An example formula would be ~ class which you read as “by class”
The variable using in the formula should be discrete
You can control the layout of the facets using the nrow or ncol parameters

Using facet_wrap() (cont.)

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy)) +
  facet_wrap(~ class, nrow = 2)

plot of chunk unnamed-chunk-18

Using facet_wrap() (cont.)

Facets can be combined with aesthetics
A common application of this is adding colour
In this case it is often worth specifying show.legend = FALSE in the geometry layer

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy, colour = class), show.legend = FALSE) +
  facet_wrap(~ class, nrow = 2)

plot of chunk unnamed-chunk-19

Using facet_grid()

facet_grid() is often used for faceting on a combination of two variables
As with facet_wrap(), the first argument of facet_grid() is also a formula
This time however the formula should contain two variable names separated by a ~
The first variable will be used to facet the rows and the second, the columns

Using facet_grid() (cont.)

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy)) +
  facet_grid(drv ~ cyl)

plot of chunk unnamed-chunk-20

Using facet_grid() (cont.)

If you only wish to use facet_grid() with one variable, you can use a . instead of a variable name on one side of the formula

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

plot of chunk unnamed-chunk-21

Getting Help

Course Resources

This course is not designed to be learnt by watching, but doing
Mistakes are expected, and it will take time to move these techniques into your memory
These slides will available to review at any time
I am contactable on LinkedIn: https://www.linkedin.com/in/tim-hargreaves/
Feel free to work together and ask lots of questions

Online Resources

Stack Overflow is a great resource - avoid asking your own questions as almost every beginner question has already been asked
Simply web-searching your problem followed by one of 'R', 'Tidyverse', or the specific tidyverse package you are using will return you many helpful guides
The tidyverse has a website with help guides, tutorials, and full documentation
You can also check out the cheat sheets for each of the packages we're using, though these are quite in-depth

Help in R

The R language contains well-written and lengthy help documents
You can access these using the ? symbol

# help on a dataset
?mpg

# help on a function - don't add brackets!
?facet_wrap