As you may remember from the Intermediate R workshop, R is great at representing and manipulating tabular data. In “traditional” R, this was handled in data.frame, while in modern “tidyverse” R this is handled via a tibble.
A tibble is a two (or possibly more) dimensional table of data.
library(tidyverse)
census <- tibble("City"=c("Paris", "Paris", "Paris", "Paris",
"London", "London", "London", "London",
"Rome", "Rome", "Rome", "Rome"),
"year"=c(2001, 2008, 2009, 2010,
2001, 2006, 2011, 2015,
2001, 2006, 2009, 2012),
"pop"=c(2.148, 2.211, 2.234, 2.244,
7.322, 7.657, 8.174, 8.615,
2.547, 2.627, 2.734, 2.627))
This has created a tibble that we have assigned to the variable census. The column names are the keys (City, year and pop), while the data for each column is given in the values (the lists).
You can print a summary of the tibble via;
census
This will output
City year pop
<chr> <dbl> <dbl>
Paris 2001 2.148
Paris 2008 2.211
Paris 2009 2.234
Paris 2010 2.244
London 2001 7.322
London 2006 7.657
London 2011 8.174
London 2015 8.615
Rome 2001 2.547
Rome 2006 2.627
Note that R will default to interpreting numbers as floating point (dbl). While this is correct for the pop (population) column, this is the wrong choice for the year. A better choice would be an integer. To force this, use as.integer to set the data type for the year column;
census <- tibble("City"=c("Paris", "Paris", "Paris", "Paris",
"London", "London", "London", "London",
"Rome", "Rome", "Rome", "Rome"),
"year"=as.integer(c(2001, 2008, 2009, 2010,
2001, 2006, 2011, 2015,
2001, 2006, 2009, 2012)),
"pop"=c(2.148, 2.211, 2.234, 2.244,
7.322, 7.657, 8.174, 8.615,
2.547, 2.627, 2.734, 2.627))
census
will print
City year pop
<chr> <int> <dbl>
Paris 2001 2.148
Paris 2008 2.211
Paris 2009 2.234
Paris 2010 2.244
London 2001 7.322
London 2006 7.657
London 2011 8.174
London 2015 8.615
Rome 2001 2.547
Rome 2006 2.627
You access the contents of a tibble mostly by column, e.g.
census["City"]
will return a tibble of just a single column containing the City data.
You can also access the columns by their index, e.g.
census[1]
will return the first column, so is identical to census["City"].
You can also extract multiple columns by specifying them via c( ), e.g.
census[c("City", "year")]
will return a tibble with the City and year columns.
To access data by rows, you need to pass in the row index followed by a comma, e.g.
census[1, ]
will return a tibble containing just the first row of data.
You can use ranges to get several rows, e.g.
census[1:5, ]
would return the first five rows, while
census[seq(2, 10, 2), ]
would return the even rows from 2 to 10.
You can access specific rows and columns via [row, column], e.g.
census[1, 1]
returns a tibble containing just the first row and first column, while
census[seq(2, 10, 2), "year"]
would return the year column of the even rows from 2 to 10, and
census[5, 2:3]
would return the second and third columns of the fifth row.
The above functions all return a tibble that is a subset of the whole tibble. You can extract the data for a single column as a list via [[ ]] or $, e.g.
census[[1]]
census[["City"]]
census$City
and can then extract data from those lists via sub-indexing, e.g.
census$City[1]
would return the City column data for the first row.
We can start to ask questions of our data using the filter function.
census %>% filter(City=="Paris")
City year pop
<chr> <int> <dbl>
Paris 2001 2.15
Paris 2008 2.21
Paris 2009 2.23
Paris 2010 2.24
(note that we didn’t need to put double quotes around City in the filter - it knows that this is a column name. Also, look here if you need to refresh your knowledge of the %>% operator).
This has returned a new tibble, which you can then access using the same methods as above, e.g.
(census %>% filter(City=="Paris"))["year"]
You can also test if the rows of a tibble match a condition, e.g.
census["City"] == "Paris"
returns a set of TRUE / FALSE values for each row, depending on whether the City value of that row was equal to Paris.
City
[1,] TRUE
[2,] TRUE
[3,] TRUE
[4,] TRUE
[5,] FALSE
[6,] FALSE
[7,] FALSE
[8,] FALSE
[9,] FALSE
[10,] FALSE
[11,] FALSE
[12,] FALSE
New columns can be added to a tibble simply by assigning them by index (as you would for a dictionary);
census["continental"] <- census["City"] != "London"
census
City year pop continental
chr> <int> <dbl> <lgl>
Paris 2001 2.15 TRUE
Paris 2008 2.21 TRUE
Paris 2009 2.23 TRUE
Paris 2010 2.24 TRUE
London 2001 7.32 FALSE
London 2006 7.66 FALSE
London 2011 8.17 FALSE
London 2015 8.62 FALSE
Rome 2001 2.55 TRUE
Rome 2006 2.63 TRUE
Rome 2009 2.73 TRUE
Rome 2012 2.63 TRUE
EXERCISE
Create the
tibblecontaining the census data for the three cities.Select the data for the year 2001. Which city had the smallest population that year?