Warwick Data Science Society
ggplot(<DATA>) +
<GEOM_FUNCTION>(aes(x = <VAR1>, y = <VAR2>,
<FURTHER MAPPINGS>))
col
/colour
/color
, size
, alpha
, shape
aes()
ggplot(<DATA>) +
<GEOM_FUNCTION>(aes(<MAPPINGS>)) +
facet_wrap(~<VAR>)
ggplot(<DATA>) +
<GEOM_FUNCTION>(aes(<MAPPINGS>)) +
facet_grid(<ROW_VAR> ~ <COL_VAR>)
head()
, tail()
,summary()
str()
$
, [[...]]
?dataset_name
(58 + 73 * 2) / 3 # normal rules of BIDMAS apply
[1] 68
sin(pi / 2)
[1] 1
sqrt(81)
[1] 9
log(42) # natural logarithm - also called ln()
[1] 3.73767
<-
operatorx <- 3 * 4
x
[1] 12
x / 2
[1] 6
object_name <- value
=
symbol for assignment and R will accept this too; this will however cause confusion later, so please avoid doing so<-
you can use the keyboard shortcut alt-minus
in RStudio to insert it_
, and .
TAB
, select the correct name from the drop-down list, and press ENTER
.function_name(arg1 = val1, arg2 = val2, ...)
seq()
which makes a regular sequence of numbersseq(from = 1, to = 10)
[1] 1 2 3 4 5 6 7 8 9 10
seq(1, 10)
[1] 1 2 3 4 5 6 7 8 9 10
>
to a +
then it means that R is expecting more input. Perhaps you forgot to close a bracket. If you can't find how to make R happy, you can press the ESC
key to exit the current statement and start afresh.object not found
error then it is most likely that you've made a typo or have forgotten to assign a value to a variable you are referencing?
operatorreadr
is another of the many packages included in the tidyverseread_csv()
functiontidyverse
package first!Session > Set Working Directory > Choose Directory
or by using Ctrl+Shift+H
people_df <- read_csv("data/people.csv")
Parsed with column specification:
cols(
earn = col_double(),
height = col_double(),
sex = col_character(),
ed = col_double(),
age = col_double(),
race = col_character()
)
read_csv()
function, you are outputted a message containing the column names and typesreadr
's read_csv()
function with the built-in read.csv()
functionpeople_df
# A tibble: 1,192 x 6
earn height sex ed age race
<dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 50000 74.4 male 16 45 white
2 60000 65.5 female 16 58 white
3 30000 63.6 female 16 29 white
4 50000 63.1 female 16 91 other
5 51000 63.4 female 17 39 white
6 9000 64.4 female 15 26 white
7 29000 61.7 female 12 49 white
8 32000 72.7 male 17 46 white
9 2000 72.0 male 15 21 hispanic
10 27000 72.2 male 12 26 white
# ... with 1,182 more rows
as_tibble()
functionmtcars_tb <- as_tibble(mtcars)
read_csv()
function can process an in-line CSVread_csv(
"a, b, c
1, 2, 3
4, 5, 6"
)
# A tibble: 2 x 3
a b c
<dbl> <dbl> <dbl>
1 1 2 3
2 4 5 6
read_csv()
is to use the first line for column namesskip
or comment
parameter skip
parameter can be used to skip a specified number of rows in the CSV file before starting to read dataread_csv(
"The first line of metadata
The second line of metadata
x, y, z
1, 2, 3",
skip = 2
)
# A tibble: 1 x 3
x y z
<dbl> <dbl> <dbl>
1 1 2 3
comment
parameter can be used to drop all lines starting with a specified characterread_csv(
"# This line is a comment
x, y, z
# So is this one
1, 2, 3",
comment = '#'
)
# A tibble: 1 x 3
x y z
<dbl> <dbl> <dbl>
1 1 2 3
col_names
parameter to control what to docol_names = FALSE
will tell read_csv()
that the first row of the CSV file is data and that the column should be labelled using generic names 'X1', 'X2', etc.read_csv(
"1, 2, 3
4, 5, 6",
col_names = FALSE
)
# A tibble: 2 x 3
X1 X2 X3
<dbl> <dbl> <dbl>
1 1 2 3
2 4 5 6
read_csv(
"1, 2, 3
4, 5, 6",
col_names = c("x", "y", "z")
)
# A tibble: 2 x 3
x y z
<dbl> <dbl> <dbl>
1 1 2 3
2 4 5 6
c()
function. This stands for combinex <- c(1, 4, 9)
sqrt(x)
[1] 1 2 3
[
accessorx[2] # R uses 1-based-indexing
[1] 4
.
, ?
, N/A
, -
, etc.readr()
which symbol is used in a given CSV file using the na
parameterread_csv(
"x, y, z
1, 2, .
4, ., 6",
na = '.'
)
# A tibble: 2 x 3
x y z
<dbl> <dbl> <dbl>
1 1 2 NA
2 4 NA 6
readr
functionsread_csv2()
- semicolon-separated filesread_tsv()
- read tab-separated filesread_delim()
- reads files with any delimiterread_delim('path/to/file.txt', '|')
read_csv()
read_fwf()
functionread_csv()
although column names must be specified manuallyread_fwf()
can be fwf_widths()
. This then takes a vector of integers to represent the widths of the columns and an optional vector of column namespeople <- read_fwf('data/people.txt',
fwf_widths(c(6, 9, 7, 3, 3, 9),
c('earn','height','sex','ed','age','race')),
skip = 1)
Parsed with column specification:
cols(
earn = col_double(),
height = col_double(),
sex = col_character(),
ed = col_double(),
age = col_double(),
race = col_character()
)
people
# A tibble: 1,192 x 6
earn height sex ed age race
<dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 50000 74.4 male 16 45 white
2 60000 65.5 female 16 58 white
3 30000 63.6 female 16 29 white
4 50000 63.1 female 16 91 other
5 51000 63.4 female 17 39 white
6 9000 64.4 female 15 26 white
7 29000 61.7 female 12 49 white
8 32000 72.7 male 17 46 white
9 2000 72.0 male 15 21 hispanic
10 27000 72.2 male 12 26 white
# ... with 1,182 more rows
fwf_positions()
, fwf_cols()
, and fwf_empty()
?read_fwf
to learn moreread_table()
functionpeople <- read_table('data/people.txt',
skip=1,
col_names = c('earn','height','sex','ed','age','race'))
Parsed with column specification:
cols(
earn = col_double(),
height = col_double(),
sex = col_character(),
ed = col_double(),
age = col_double(),
race = col_character()
)
This is worth a read if you're interested but I will summarise the main points in a few slides
readr
uses a heuristic to figure out the type of each column from the first 1000 rows
This means that it can sometimes get these wrong
In this case, it is worth re-importing but with column types specified manually
The most common issues are integers being read as doubles, factors (categorical variables) being read as strings, or dates being read as strings
col_types
parameter with the cols()
functionpeople <- read_csv('data/people.csv',
col_types = cols(
earn = col_double(),
sex = col_factor(),
age = col_integer()
))
col_skip()
logical
, integer
, double
, character
, factor
, date
, time
, datetime
, number
(this is only used in special cases)Orange <- as_tibble(Orange)
head(Orange)
# A tibble: 6 x 3
Tree age circumference
<ord> <dbl> <dbl>
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
Orange
dataset?geom_line()
geometryggplot(Orange) +
geom_line(aes(x = age, y = circumference, col = Tree), size = 2)
geom_smooth()
. This adds a smooth fitted line to a plotgeom_point()
to create the following graphse
parameter to control whether confidence intervals are shown. These are best turned off for nowggplot(Orange) +
geom_smooth(aes(x = age, y = circumference, col = Tree), size = 2, se = FALSE) +
geom_point(aes(x = age, y = circumference, col = Tree), size = 2)
x
, y
, and col
twiceggplot()
ggplot(Orange, aes(x = age, y = circumference, col = Tree)) +
geom_smooth(size = 2, se = FALSE) +
geom_point(aes(size = circumference))
linetype
aesthetic to control what type of line is usedgroup()
aesthetic which allows you to separate the full data set before generating the lines (see exercises for examples of this)-1
to use no grouping