Warwick Data Science Society
mean()
, median()
, min()
, max()
, quantile()[i]
range()
, diff()
, IQR()
var()
, sd()
na.rm = TRUE
if you have missing values<
, <=
, >
, and >=
==
, !=
, and near()
&
(and), |
(or), and !
(not)dplyr
functions referred to as verbs:
filter()
- pick observations by their valuesarrange()
- reorder observations based on their valuesselect()
- pick variables by their namesmutate()
- create new variables as functions of existing variablessummarise()
/summarize()
- collapse many values down to a single summarygroup_by()
to change the scope of aggregate functions%>%
) operatortidyr
is the foundational package in the tidyversecountry
, year
, population
, and cases
table1
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
table2
# A tibble: 12 x 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
table3
# A tibble: 6 x 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
table4a
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
table4b
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
table4a
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
1999
and 2000
, which do not represent variablesyear
variable1999
and 2000
in this case)year
). We refer to this as the key
cases
). We refer to this as the value
gather()
table4a %>%
# use backticks so R doesn't confuse the column names for numbers
gather(`1999`, `2000`, key = "year", value = "cases")
# A tibble: 6 x 3
country year cases
<chr> <chr> <int>
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
4 Afghanistan 2000 2666
5 Brazil 2000 80488
6 China 2000 213766
table4b
though this time our value
would be "population"
table2
# A tibble: 12 x 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
cases
and population
are contained in the type
columnspread()
function to fix this issuetype
) which we referred to as the key
count
) which we referred to as the value
spread()
table2 %>%
spread(key = type, value = count)
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
type
and count
since they already exist in the dataset whereas with gather()
we were adding themkey
and value
arguments, spread()
and gather()
are opposites of each other and can be used to reverse each other's effectsgather()
makes wide tables narrower and longer, whereas spread()
makes long tables shorter and widerpivot_longer()
and pivot_wider()
which have a more intuitive syntax than gather()
and spread()
experimental
lifecycle, and soon to move into maturing
with tidyr 1.1.0table3
# A tibble: 6 x 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
rate
separate()
function to fix this issueseparate()
will split values whenever it sees a non-alphanumeric character though this can be specified manually using the sep
argumentconvert = TRUE
if we want tidyr
to automatically convert the new columns to the correct data types (by default they would still be characters)separate()
table3 %>%
separate(rate, into = c('cases', 'population'), sep = '/', convert = TRUE)
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
separate()
a vector of integers as the sep
argument to be used as the positions to split attable5
# A tibble: 6 x 4
country century year rate
* <chr> <chr> <chr> <chr>
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583
unite()
function to fix this issuesep
argument to specify the separator or use the default _
unite()
table5 %>%
unite(new, century, year, sep = '')
# A tibble: 6 x 3
country new rate
<chr> <chr> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
unite()
does not have a convert
parameter so we would have to perform this conversion manuallymutate(year = as.integer(year))
dplyr
to help us do this using a new set of functions from the packageband_members
# A tibble: 3 x 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
band_instruments
# A tibble: 3 x 2
name plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
inner_join(band_members, band_instruments, by = 'name')
# A tibble: 2 x 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
na
left_join(band_members, band_instruments, by = 'name')
# A tibble: 3 x 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
band_members
tableright_join(band_members, band_instruments, by = 'name')
# A tibble: 3 x 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
3 Keith <NA> guitar
band_instruments
tablefull_join(band_members, band_instruments, by = 'name')
# A tibble: 4 x 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
4 Keith <NA> guitar
names(band_members)
[1] "name" "band"
names(band_instruments2)
[1] "artist" "plays"
inner_join(band_members, band_instruments2, by = c("name" = "artist"))
# A tibble: 2 x 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
employees <- tibble(name = c('John', 'Anne'), experience = c(4, 7))
sales <- tibble(item_code = c('21', '52', '35'), sold_by = c('John', 'Anne', 'Anne'))
sales %>%
inner_join(employees, by = c('sold_by' = 'name'))
# A tibble: 3 x 3
item_code sold_by experience
<chr> <chr> <dbl>
1 21 John 4
2 52 Anne 7
3 35 Anne 7