Remember, before you can use the tidyverse, you need to load the package.
library(tidyverse)
Coding Basics
R as a Calculator 1
(Taken form R4DS)
- Create a sequence using the
seq()
function starting at 0 and ending at 100 assigning the output to an object called integers
- Create a new vector
squares
which contains the square of every value in integers
. Remember most mathematical functions in R act on each element individually. You might want to use the ^
operator.
- Create a data frame with these two vectors as columns by running
tibble(integers, squares)
. Assign the output to an object called squares_df
- Use
ggplot()
to produce either a scatter or line plot of these two variables (or both at the same time if you’re feeling brave!)
R as a Calculator 2
- Create a new sequence called
x
containing 1000 numbers spaced uniformly between -6 and 6. You can do this by adding the length.out
parameter to seq()
and setting it to 1000
(this parameter must be specified by name, inside of the brackets)
- Calculate the sine of each value in
x
using sin()
and assign the value to an object called y
- Create a dataframe as before and plot a line graph
- Change the line colour to any one of your choosing, set the
linetype
to 2
(dashed), and size
to 1.5
Specifying Parameters by Position
- Which parameters from the following code can you use without specifying their name?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, colour = factor(class)))
Reading CSVs
The ‘People’ Dataset
- Read the
people.csv
file from this session’s data
folder. Make sure you specify that sex
and race
are factors (categorical variables) and that earn
and age
are integers.
- How does level of education affect salary? Make a scatter plot to find out. You should set
position = 'jitter'
in the geom_point()
function to avoid over-plotting and perhaps also set transparency to 0.5
.
- Facet the above plot by
sex
, colouring the points too. Make sure you hide the legend with show.legend = FALSE
- Label the plot using the
labs()
function (see end of exercise sheet one)
- Create a new plot showing how earnings change with age. Use
geom_smooth()
to show the general trend
- Set the colour aesthetic in the above plot to represent the
race
variable. Include errors and set the thickness of each line to 2
In-line CSVs
- Read the following in-line CSV into R (Unknown is used here to represent a missing value)
"
Employee Database Version 3
name, age, job
John, 34, Analyst
Ann, 44, Consultant
Barry, 24, Unknown
Freya, Unknown, Developer"
- Read the following in-line CSV into R. The columns of this dataset are
City
, Area
, Population
. Make sure you don’t try to read the comment line
"
Shanghai, 6341, 24183300
Tokyo, 627, 13515271
Seoul, 605, 9806000
>> Note: Mumbai was previously called Bombai
Mumbai, 438, 12442373"
Datasets from the Web
- Go to the following webpage - https://gist.github.com/tiangechen/b68782efa49a16edaf07dc2cdaa855ea - which contains data on the top grossing movies between 2007 and 2011
- Select
download ZIP
extract the CSV file (if this is causing difficulties just use the copy in the data
folder)
- Import this dataset into R
- Run this code to clean up the dataset (we’ll learn how to make code like this next week)
movies <- movies %>%
mutate(Genre = ifelse(Genre %in% c('Comdy', 'comedy'), 'Comedy', Genre)) %>%
mutate(Genre = ifelse(Genre %in% c('Romence', 'romance'), 'Romance', Genre)) %>%
mutate(`Worldwide Gross` = as.double(str_replace(`Worldwide Gross`, '\\$', '')))
- Make a scatter plot of
Worldwide Gross
against Audience score %
coloured by Genre
. When using a column name with a space you need to surround the name with back-ticks (`)
Line geometries
Groups
- Take a look at the built in
ChickWeight
dataset. What do the columns represent?
- Make a line plot of chick weight against time. Group by chick and colour by diet
- What happens if you forget to group by chick?
- The graph in (2) is very busy. Instead of using
geom_line
, use geom_smooth
with the colour aesthetic set to Diet
. There is no need to specify the group in this case - why? Hide the error regions with se=FALSE
- Overlay the data points coloured by diet also. Add a jitter to prevent over-fitting and reduce transparency so this doesn’t distract from the trend lines
- Label the plot
Smoothing Methods
- Create a plot of sepal length against sepal width for each observation in the
iris
dataset colouring by species
- Add a
geom_smooth
layer with colour mapped to Species still. Inside this call specify method = 'lm'
. This tells geom_smooth
to use a linear model (straight line) for smoothing.
- Other methods include
loess
and gam
. See what these look like
- If no method is specified or
method = 'auto'
is included, a method is chosen automatically. In this case, which method is used?
Going Beyond
Theming Graphs
- Using the
mpg
dataset create any graph of your choosing
- Label the graph by adding a title, caption, and axis labels
- Add a new layer to the graph using the function
theme_minimal()
- Try using other themes. Type
theme_
and look at the various auto-complete options
ggplot Objects
- Run the following code
p <- ggplot(iris, aes(x = Petal.Length, y = Sepal.Length, col = Species)) +
geom_point()
- What do you think this did? What will happen if we type
p
? Give it a go
- Type the following code
p +
geom_smooth(se = FALSE, method = 'lm')
- What did that do? How can you use this to re-use code?
Scripting
- In RStudio, select
File > New File > R Script
- A new panel should open. Type the code required to generate any plot of the
Orange
dataset (Perhaps circumference against age coloured by tree. Add a trend curve if feeling confident)
- On the top bar of the script panel select
source > source
. What does this do?
- Use
File > Save
to save this script somewhere. (Feel free to delete it after)
