Warwick Data Science Society
These slides, as well as other resources for the course, can be found on the corresponding GitHub repository:
https://github.com/warwickdatascience/into-the-tidyverse
An easier way to access these is through the course website:
This course differs from R4DS in two ways:
This course aims to teach you how to:
Sections of the R4DS book not covered in this course include:
Why learn data analytics?
Why use R/the tidyverse?
R is an incredibly powerful tool. It can be used for machine learning, statistical modeling, big data, and much more. It is also capable of producing websites, interactive notebooks, and presentations (such as this one!).
Installation
base
installation and download the latest versionRunning
Essentially, unless you are trying to provoke a headache, it is advised to do all R programming in RStudio.
NB: RStudio does not work as a standalone program. You will still need R installed even if you intend to only write code in RStudio.
enter
key1 + 2
[1] 3
+
, -
, *
, /
, and ^
. Figure out what these do by trying some simple examples.install.packages('tidyverse')
library('tidyverse')
install.packages(...)
as installing a new piece of software, and libary(...)
as clicking its icon to open it.ggplot2
is one of the many packages included in the tidyverseggplot2
graph follows the grammar of graphics, in which plots are built by adding layers one at a timempg
mpg
# A tibble: 234 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l… f 18 29 p comp…
2 audi a4 1.8 1999 4 manual… f 21 29 p comp…
3 audi a4 2 2008 4 manual… f 20 31 p comp…
4 audi a4 2 2008 4 auto(a… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto(l… f 16 26 p comp…
6 audi a4 2.8 1999 6 manual… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto(a… f 18 27 p comp…
8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp…
9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp…
10 audi a4 quat… 2 2008 4 manual… 4 20 28 p comp…
# … with 224 more rows
mpg
include:
displ
- a car's engine size (in litres)hwy
- a car's fuel efficiency on the highway (in miles per gallon)ggplot(mpg) +
geom_point(aes(x = displ, y = hwy))
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy))
ggplot()
function. This creates an empty graph which we can then add layers tompg
data frame so that ggplot2
knows we will be using that for our plot+
symbol to add a new layer. Specifically we add a point geometry using the geom_point()
functionggplot2
which variables in the data frame to map to the various aesthetics of the plot. In this case, we just specify the variables mapping to the x and y coordinatesggplot(<DATA>) +
<GEOM_FUNCTION>(aes(<MAPPINGS>))
hwy
) and city (cty
) mileage are relatedmpg
head()
functionhead(mpg)
n
head(mpg, n = 3)
tail()
works the same as head()
but gives you the bottom few rows. It also has an optional n
parameterstr()
function displays the structure of a data frame (column names, data types, etc.)str(mpg)
summary()
function displays a statistical summary of each column of the data framesummary(mpg)
$
or [[]]
accessormpg$displ
mpg[['displ']]
class
columnggplot(mpg) +
geom_point(aes(x = displ, y = hwy, colour = class))
size
- the size of each pointalpha
- the transparency of each pointshape
- the shape of the plotting character for each pointfactor
functionggplot2
will automatically scale your variable as well as create a legend for the new aestheticaes()
function and assign the aesthetic a value rather than a variableggplot(mpg) +
geom_point(aes(x = displ, y = hwy), colour = 'orange')
colour
; the solid shapes (15-18) are filled with colour
; and the filled shapes (21-24) have a border of colour
and are filled with fill
facet_wrap()
or facet_grid()
functionfacet_wrap()
is used to facet a plot by a single variablefacet_wrap()
should be a formula~
symbol followed by a variable name~ class
which you read as “by class”nrow
or ncol
parametersggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
show.legend = FALSE
in the geometry layerggplot(mpg) +
geom_point(aes(x = displ, y = hwy, colour = class), show.legend = FALSE) +
facet_wrap(~ class, nrow = 2)
facet_grid()
is often used for faceting on a combination of two variables facet_wrap()
, the first argument of facet_grid()
is also a formula~
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
facet_grid()
with one variable, you can use a .
instead of a variable name on one side of the formulaggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
?
symbol# help on a dataset
?mpg
# help on a function - don't add brackets!
?facet_wrap