Remember, before you can use the tidyverse, you need to load the package.
library(tidyverse)
Back to the Basics
R for statistics
- Extract the
displ
column from the mpg
dataset and assign it to the object x
- What is the mean and median of
x
?
- What are the largest and smallest values of
x
? Hint: use either min()
/max()
or range()
- Find the upper and lower quartile of
x
. What about the inter-quartile range?
- What is the variance and standard deviation of
x
?
Missing Values
- What do you think the result of
5 + NA
will be? Try it and see if you were correct
- Replace
+
with some other arithmetic operators. Is the result always the same?
- Calculate the median of the vector
c(2, 6, 3, 4, NA)
, ignoring missing values
- (From R4DS) Why is
NA ^ 0
not missing? What about NA | TRUE
or NA & FALSE
? Is there a general rule? (NA * 0
is a tricky counterexample - discuss this with me if you are curious)
Arithmetic with Boolean Values
- What is the value of
5 + TRUE
?
- What about
5 + FALSE
?
- What is the significance of the sum (
sum()
) of a logical vector?
- What about the mean?
- Why is it true that
FALSE < TRUE
? How would this affect ordering by a logical column using arrange()
Comparisons and Boolean Operators
- Predict the output of the following R statements. Run them yourself to check your thinking
((4 > 3) & (7 == 6)) | (5 <= 8)
(sqrt(2) ^ 2 == 2) | !near(sqrt(2) ^ 2, 2)
(For this one you may want to look at ?Syntax
)
FALSE & !FALSE | TRUE
The dpylr verbs
Filter
- How many automatic cars by Audi are in the
mpg
dataset?
- Who makes the only two cars with a highway mileage above 41 miles/gallon?
- Are there any flowers in the
iris
dataset with petals that are wider than they are long?
- What is the carat of the highest priced diamond in the the
diamond
dataset? (A condition such as price == max(price)
may be useful)
- For each year in the
mpg
dataset, which car has the smallest engine size? Question 4 may be helpful. Note that you can use group_by()
before filter()
to change the scope of any aggregate functions
- Select all flowers from the
iris
dataset where either the petal length is greater than 6.4 or the petal width is greater than 2.4 (there should be 7 of these)
- Which observations in the
airquality
dataset are missing a reading for Ozone
Arrange
- Now, using
arrange()
, find the car in the mpg
dataset with the best city mileage. What about the worst?
- Order the
iris
dataset by the product of the petal length and width (Note: you don’t need to create a new variable using mutate
to do this)
- Look at the column types when you print
mpg
and diamonds
to the console. What order do you get when you arrange mpg
by class
and what order do you get when you arrange diamonds
by cut
. Why is this?
- Order the cars in
mpg
by the difference between their city and highway mileage (biggest difference first) - if you wanted to do this properly you should use the abs()
function, or you can use a filter to check that there are no cars with city mileage higher than highway mileage
Select
- Re-order the columns of the
iris
dataset so that Species is first. To do this, select Species
first and then use the everything()
helper to select everything else
- Rename the
displ
column in mpg
to eng_size
- Select all columns apart from
price
in the diamonds dataset
- What does the following code do?
select(mpg, 1, 2, 4, 11)
Mutate
- Using the
mpg
dataset, create a new column called cty_km_l
which measures the city mileage in kilometres per litre (Hint: 1 mile ~ 1.6 km, 1 gallon ~ 3.8 litres)
- Create column called
max_dimension
for the diamonds
dataset which contains the maximum of the values in the columns x
, y
, z
- Read the help page for
transmutate()
- The
mpg
dataset currently has trans
stored as a character. Convert this to a factor using mutate()
and the factor()
function
- Create a column called
is_automatic
which is TRUE
if and only if a given car has automatic transmission. Don’t forget to use ==
for comparison
mutate
can be combined with group_by
to change the scope of aggregate functions. Use this to create a new variable in the mpg
dataset called best_in_class
which is TRUE
if and only if the highway mileage is the highest out of all other cars in that class. (Hint: first group by class then use hwy == max(hwy)
)
- Create a new column in the
diamonds
dataset called expensive
which is TRUE
if and only if the price of the diamond is above the upper quartile of all prices
Summarise
- What is the variance of the city mileage and highway mileage column in the
mpg
dataset?
- How many cars are there of each class in the
mpg
dataset? Group by class
and then use the n()
function to count
- What are the median values for each of the 4 numeric columns in the
iris
dataset?
- What is the price of the least expensive diamond of each cut?
- What is the mean ozone for each month in the
airquality
dataset? Make sure you ignore NA
s
- For each cut and colour in the diamonds dataset, what is the range (
diff(range(...))
) of the carats?
Pipelines
In this section, all code should be written by piping the output of each function into the next using %>%
- After removing all 2-seater cars, calculate the average of the city and highway mileage for each car and add this as a new column called
efficiency.
Group by manufacturer and the calculate the maximum efficiency for each group. Arrange these in descending order of efficiency
- Create a new column in the
iris
dataset called Petal.Area
which is the product of the petal length and width. Create a similar column called Sepal.Area
. Pipe this data frame into a call to ggplot
to create a plot of these two variables, coloured by Species. Add a line of best fit for each species
- Formulate your own question about the
diamonds
dataset and use a pipeline to answer it
Going Beyond
Ranking
- Read the help page for
ranking
. In particular try to understand the difference between row_number()
, min_rank()
, dense_rank()
. The others are of less use
- For each date in the
airquality
dataset, rank observations (think about which ranking method is most appropriate) by the intensity of solar radiation compared to other observations in the same month. Select only observations which have the 2nd highest solar radiation for each month
- For each year in the
mpg
dataset, rank the cars by city mileage so that the lowest mileage is ranked first. Select the cars that are ranked number one for each year
Un-grouping
- Try running the following code. What error message do you get?
- Run the code line by line, where do things go wrong?
as_tibble(Titanic) %>%
group_by(Class, Age) %>%
summarise(count = sum(n)) %>%
mutate(Class = reorder(Class, count))
- Read the help page for
ungroup
. How might you rectify this issue?
---
title: "Into the Tidyverse"
subtitle: "Session Three Exercises"
output: html_notebook
---

Remember, before you can use the tidyverse, you need to load the package.

```{r message=FALSE}
library(tidyverse)
```

## Back to the Basics

#### R for statistics

1. Extract the `displ` column from the `mpg` dataset and assign it to the object `x`
2. What is the mean and median of `x`?
3. What are the largest and smallest values of `x`? Hint: use either `min()`/`max()` or `range()`
4. Find the upper and lower quartile of `x`. What about the inter-quartile range?
5. What is the variance and standard deviation of `x`?

#### Missing Values

1. What do you think the result of `5 + NA` will be? Try it and see if you were correct
2. Replace `+` with some other arithmetic operators. Is the result always the same?
3. Calculate the median of the vector `c(2, 6, 3, 4, NA)`, ignoring missing values
4. (From R4DS) Why is `NA ^ 0` not missing? What about `NA | TRUE` or `NA & FALSE`? Is there a general rule? (`NA * 0` is a tricky counterexample - discuss this with me if you are curious)

#### Arithmetic with Boolean Values

1. What is the value of `5 + TRUE`?
2. What about `5 + FALSE`?
3. What is the significance of the sum (`sum()`) of a logical vector?
4. What about the mean?
5. Why is it true that `FALSE < TRUE`? How would this affect ordering by a logical column using `arrange()`

#### Comparisons and Boolean Operators

1. Predict the output of the following R statements. Run them yourself to check your thinking

```{r eval=FALSE}
((4 > 3) & (7 == 6)) | (5 <= 8)
```

```{r eval=FALSE}
(sqrt(2) ^ 2 == 2) | !near(sqrt(2) ^ 2, 2)
```

(For this one you may want to look at `?Syntax`)

```{r eval=FALSE}
FALSE & !FALSE | TRUE
```

## The dpylr verbs

### Filter

1. How many automatic cars by Audi are in the `mpg` dataset?
2. Who makes the only two cars with a highway mileage above 41 miles/gallon?
3. Are there any flowers in the `iris` dataset with petals that are wider than they are long?
4. What is the carat of the highest priced diamond in the the `diamond` dataset? (A condition such as `price == max(price)` may be useful)
5. For each year in the `mpg` dataset, which car has the smallest engine size? Question 4 may be helpful. Note that you can use `group_by()` before `filter()` to change the scope of any aggregate functions
6. Select all flowers from the `iris` dataset where either the petal length is greater than 6.4 or the petal width is greater than 2.4 (there should be 7 of these)
7. Which observations in the `airquality` dataset are missing a reading for `Ozone`

### Arrange

1. Now, using `arrange()`, find the car in the `mpg` dataset with the best city mileage. What about the worst?
2. Order the `iris` dataset by the product of the petal length and width (Note: you don't need to create a new variable using `mutate` to do this)
3. Look at the column types when you print `mpg` and `diamonds` to the console. What order do you get when you arrange `mpg` by `class` and what order do you get when you arrange `diamonds` by `cut`. Why is this?
4. Order the cars in `mpg` by the difference between their city and highway mileage (biggest difference first) - if you wanted to do this properly you should use the `abs()` function, or you can use a filter to check that there are no cars with city mileage higher than highway mileage

### Select

1. Re-order the columns of the `iris` dataset so that Species is first. To do this, select `Species` first and then use the `everything()` helper to select everything else
2. Rename the `displ` column in `mpg` to `eng_size`
3. Select all columns apart from `price` in the diamonds dataset
4. What does the following code do? 

```{r eval=FALSE}
select(mpg, 1, 2, 4, 11)
```

### Mutate

1. Using the `mpg` dataset, create a new column called `cty_km_l` which measures the city mileage in kilometres per litre (Hint: 1 mile ~ 1.6 km, 1 gallon ~ 3.8 litres)
2. Create column called `max_dimension` for the `diamonds` dataset which contains the maximum of the values in the columns `x`, `y`, `z`
3. Read the help page for `transmutate()`
4. The `mpg` dataset currently has `trans` stored as a character. Convert this to a factor using `mutate()` and the `factor()` function
5. Create a column called `is_automatic` which is `TRUE` if and only if a given car has automatic transmission. Don't forget to use `==` for comparison
6. `mutate` can be combined with `group_by` to change the scope of aggregate functions. Use this to create a new variable in the `mpg` dataset called `best_in_class` which is `TRUE` if and only if the highway mileage is the highest out of all other cars in that class. (Hint: first group by class then use `hwy == max(hwy)`)
7. Create a new column in the `diamonds` dataset called `expensive` which is `TRUE` if and only if the price of the diamond is above the upper quartile of all prices

### Summarise

1. What is the variance of the city mileage and highway mileage column in the `mpg` dataset?
2. How many cars are there of each class in the `mpg` dataset? Group by `class` and then use the `n()` function to count
3. What are the median values for each of the 4 numeric columns in the `iris` dataset?
4. What is the price of the least expensive diamond of each cut?
5. What is the mean ozone for each month in the `airquality` dataset? Make sure you ignore `NA`s
6. For each cut and colour in the diamonds dataset, what is the range (`diff(range(...))`) of the carats?

### Pipelines

In this section, all code should be written by piping the output of each function into the next using `%>%`

1. After removing all 2-seater cars, calculate the average of the city and highway mileage for each car and add this as a new column called `efficiency.` Group by manufacturer and the calculate the maximum efficiency for each group. Arrange these in descending order of efficiency
2. Create a new column in the `iris` dataset called `Petal.Area` which is the product of the petal length and width. Create a similar column called `Sepal.Area`. Pipe this data frame into a call to `ggplot` to create a plot of these two variables, coloured by Species. Add a line of best fit for each species
3. Formulate your own question about the `diamonds` dataset and use a pipeline to answer it

## Going Beyond

### Ranking

1. Read the help page for `ranking`. In particular try to understand the difference between `row_number()`, `min_rank()`, `dense_rank()`. The others are of less use
2. For each date in the `airquality` dataset, rank observations (think about which ranking method is most appropriate) by the intensity of solar radiation compared to other observations _in the same month_. Select only observations which have the 2nd highest solar radiation for each month
3. For each year in the `mpg` dataset, rank the cars by city mileage so that the lowest mileage is ranked first. Select the cars that are ranked number one for each year

### Un-grouping

1. Try running the following code. What error message do you get?
2. Run the code line by line, where do things go wrong?

```{r eval=FALSE}
as_tibble(Titanic) %>% 
     group_by(Class, Age) %>% 
     summarise(count = sum(n)) %>% 
     mutate(Class = reorder(Class, count))
```

3. Read the help page for `ungroup`. How might you rectify this issue?