Warwick Data Science Society
ggplot(<DATA>) +
<GEOM_FUNCTION>(aes(x = <VAR1>, y = <VAR2>,
<FURTHER MAPPINGS>))
col/colour/color, size, alpha, shapeaes()ggplot(<DATA>) +
<GEOM_FUNCTION>(aes(<MAPPINGS>)) +
facet_wrap(~<VAR>)
ggplot(<DATA>) +
<GEOM_FUNCTION>(aes(<MAPPINGS>)) +
facet_grid(<ROW_VAR> ~ <COL_VAR>)
head(), tail(),summary()str()$, [[...]]?dataset_name(58 + 73 * 2) / 3 # normal rules of BIDMAS apply
[1] 68
sin(pi / 2)
[1] 1
sqrt(81)
[1] 9
log(42) # natural logarithm - also called ln()
[1] 3.73767
<- operatorx <- 3 * 4
x
[1] 12
x / 2
[1] 6
object_name <- value
= symbol for assignment and R will accept this too; this will however cause confusion later, so please avoid doing so<- you can use the keyboard shortcut alt-minus in RStudio to insert it_, and .TAB, select the correct name from the drop-down list, and press ENTER.function_name(arg1 = val1, arg2 = val2, ...)
seq() which makes a regular sequence of numbersseq(from = 1, to = 10)
[1] 1 2 3 4 5 6 7 8 9 10
seq(1, 10)
[1] 1 2 3 4 5 6 7 8 9 10
> to a + then it means that R is expecting more input. Perhaps you forgot to close a bracket. If you can't find how to make R happy, you can press the ESC key to exit the current statement and start afresh.object not found error then it is most likely that you've made a typo or have forgotten to assign a value to a variable you are referencing? operatorreadr is another of the many packages included in the tidyverseread_csv() functiontidyverse package first!Session > Set Working Directory > Choose Directory or by using Ctrl+Shift+Hpeople_df <- read_csv("data/people.csv")
Parsed with column specification:
cols(
earn = col_double(),
height = col_double(),
sex = col_character(),
ed = col_double(),
age = col_double(),
race = col_character()
)
read_csv() function, you are outputted a message containing the column names and typesreadr's read_csv() function with the built-in read.csv() functionpeople_df
# A tibble: 1,192 x 6
earn height sex ed age race
<dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 50000 74.4 male 16 45 white
2 60000 65.5 female 16 58 white
3 30000 63.6 female 16 29 white
4 50000 63.1 female 16 91 other
5 51000 63.4 female 17 39 white
6 9000 64.4 female 15 26 white
7 29000 61.7 female 12 49 white
8 32000 72.7 male 17 46 white
9 2000 72.0 male 15 21 hispanic
10 27000 72.2 male 12 26 white
# ... with 1,182 more rows
as_tibble() functionmtcars_tb <- as_tibble(mtcars)
read_csv() function can process an in-line CSVread_csv(
"a, b, c
1, 2, 3
4, 5, 6"
)
# A tibble: 2 x 3
a b c
<dbl> <dbl> <dbl>
1 1 2 3
2 4 5 6
read_csv() is to use the first line for column namesskip or comment parameter skip parameter can be used to skip a specified number of rows in the CSV file before starting to read dataread_csv(
"The first line of metadata
The second line of metadata
x, y, z
1, 2, 3",
skip = 2
)
# A tibble: 1 x 3
x y z
<dbl> <dbl> <dbl>
1 1 2 3
comment parameter can be used to drop all lines starting with a specified characterread_csv(
"# This line is a comment
x, y, z
# So is this one
1, 2, 3",
comment = '#'
)
# A tibble: 1 x 3
x y z
<dbl> <dbl> <dbl>
1 1 2 3
col_names parameter to control what to docol_names = FALSE will tell read_csv() that the first row of the CSV file is data and that the column should be labelled using generic names 'X1', 'X2', etc.read_csv(
"1, 2, 3
4, 5, 6",
col_names = FALSE
)
# A tibble: 2 x 3
X1 X2 X3
<dbl> <dbl> <dbl>
1 1 2 3
2 4 5 6
read_csv(
"1, 2, 3
4, 5, 6",
col_names = c("x", "y", "z")
)
# A tibble: 2 x 3
x y z
<dbl> <dbl> <dbl>
1 1 2 3
2 4 5 6
c() function. This stands for combinex <- c(1, 4, 9)
sqrt(x)
[1] 1 2 3
[ accessorx[2] # R uses 1-based-indexing
[1] 4
., ?, N/A, -, etc.readr() which symbol is used in a given CSV file using the na parameterread_csv(
"x, y, z
1, 2, .
4, ., 6",
na = '.'
)
# A tibble: 2 x 3
x y z
<dbl> <dbl> <dbl>
1 1 2 NA
2 4 NA 6
readr functionsread_csv2() - semicolon-separated filesread_tsv() - read tab-separated filesread_delim() - reads files with any delimiterread_delim('path/to/file.txt', '|')
read_csv()read_fwf() functionread_csv() although column names must be specified manuallyread_fwf() can be fwf_widths(). This then takes a vector of integers to represent the widths of the columns and an optional vector of column namespeople <- read_fwf('data/people.txt',
fwf_widths(c(6, 9, 7, 3, 3, 9),
c('earn','height','sex','ed','age','race')),
skip = 1)
Parsed with column specification:
cols(
earn = col_double(),
height = col_double(),
sex = col_character(),
ed = col_double(),
age = col_double(),
race = col_character()
)
people
# A tibble: 1,192 x 6
earn height sex ed age race
<dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 50000 74.4 male 16 45 white
2 60000 65.5 female 16 58 white
3 30000 63.6 female 16 29 white
4 50000 63.1 female 16 91 other
5 51000 63.4 female 17 39 white
6 9000 64.4 female 15 26 white
7 29000 61.7 female 12 49 white
8 32000 72.7 male 17 46 white
9 2000 72.0 male 15 21 hispanic
10 27000 72.2 male 12 26 white
# ... with 1,182 more rows
fwf_positions(), fwf_cols(), and fwf_empty()?read_fwf to learn moreread_table() functionpeople <- read_table('data/people.txt',
skip=1,
col_names = c('earn','height','sex','ed','age','race'))
Parsed with column specification:
cols(
earn = col_double(),
height = col_double(),
sex = col_character(),
ed = col_double(),
age = col_double(),
race = col_character()
)
This is worth a read if you're interested but I will summarise the main points in a few slides
readr uses a heuristic to figure out the type of each column from the first 1000 rows
This means that it can sometimes get these wrong
In this case, it is worth re-importing but with column types specified manually
The most common issues are integers being read as doubles, factors (categorical variables) being read as strings, or dates being read as strings
col_types parameter with the cols() functionpeople <- read_csv('data/people.csv',
col_types = cols(
earn = col_double(),
sex = col_factor(),
age = col_integer()
))
col_skip()logical, integer, double, character, factor, date, time, datetime, number (this is only used in special cases)Orange <- as_tibble(Orange)
head(Orange)
# A tibble: 6 x 3
Tree age circumference
<ord> <dbl> <dbl>
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
Orange dataset?geom_line() geometryggplot(Orange) +
geom_line(aes(x = age, y = circumference, col = Tree), size = 2)
geom_smooth(). This adds a smooth fitted line to a plotgeom_point() to create the following graphse parameter to control whether confidence intervals are shown. These are best turned off for nowggplot(Orange) +
geom_smooth(aes(x = age, y = circumference, col = Tree), size = 2, se = FALSE) +
geom_point(aes(x = age, y = circumference, col = Tree), size = 2)
x, y, and col twiceggplot()ggplot(Orange, aes(x = age, y = circumference, col = Tree)) +
geom_smooth(size = 2, se = FALSE) +
geom_point(aes(size = circumference))
linetype aesthetic to control what type of line is usedgroup() aesthetic which allows you to separate the full data set before generating the lines (see exercises for examples of this)-1 to use no grouping