Tokenising

The first step to analysing text in R is to convert it into a form that will make it easier to process. This includes tidying the text, and arranging it into a tidy tibble.

Let’s start by importing the tidyverse and also the tidytext library.

library(tidyverse)
library(tidytext)

Next, we need to create some text that we will analyse. Let’s start with the first few lines of Hamlet’s famous soliloquy. We will create this as a vector, with each element in the vector being one line of the speech.

text = c("To be, or not to be--that is the question:",
         "Whether 'tis nobler in the mind to suffer",
         "The slings and arrows of outrageous fortune",
         "Or to take arms against a sea of troubles",
         "And by opposing end them.")

Our next step is to convert this vector into a tibble. We want the tibble to have two columns:

  1. line will hold the line number. There are five lines, so this will be the count from 1:5.
  2. text will hold each line of text.
text <- tibble(line=1:5, text=text)
text

We now want to rearrange this tibble into a tidy text format. Remember, tidy data is data where;

  1. every variable has its own column,
  2. every observation has its own row,
  3. each value must have its own cell.

So what is the “observation” when analysing text? In tidy text, an observation is called a “token”. A token is the smallest unit of text that you want to analyse. For example, a token could be;

  1. individual letters
  2. individual words
  3. combination of neighbouring words (n-grams)
  4. individual lines
  5. individual paragraphs

etc. The token is the smallest unit of text that you need to use for your analysis, i.e. the atomic unit. It is up to you to decide what the appropriate token type is for your analysis. For example, if we were counting the number of times each word appeared in the text, then the token would be a word. If we were analysing average line lengths, then the token would be a line.

Let’s count the number of times each word appears. For this analysis, the token (and thus observation) is a word. Each observation must have its own row, meaning that we need to transform the tibble so that there is one word per row. But what about the first rule that every variable has its own column? What are the variables for this text?

There are two variables:

  1. The line number, which is in the column line
  2. The actual token (word), which we will put in a column called word.

We therefore need to transform our tibble so that we have two columns; line and word. And each word should be on a separate row.

We could do this manually. Fortunately the tidytext library supplies the function unnest_tokens, which can automatically do this for us.

tokens <- text %>% unnest_tokens(word, text)
tokens

By default unnest_tokens will tokenise by words. You can change this by passing different options to the function (e.g. as we will do later when we will tokenise by n-grams). The first argument is the name of the column in which to place the tokens, while the second argument is the name of the column that contains the text to tokenise.

Note that unnest_tokens has automatically lower-cased the words and removed punctuation.

Now that the text is tidy, it is easy to count the number of occurrences of each word. We can do this using the count function from the tidyverse’s dplyr package. We will pass sort=TRUE so that the resulting tibble is sorted with the most common words at the top.

tokens %>% count(word, sort=TRUE)

Reading text from a file

Tidying the text makes all subsequent analysis significantly easier. But, of course, before we can tidy the text, we have to load it into R. Typing it out into a vector, as we did above, would not be practical for large amounts of text!

Fortunately R comes with many functions to read text from files. The readLines function from core R reads text from a file into a vector of lines. For example, I’ve put the whole of Hamlet’s soliloquy online at https://chryswoods.com/text_analysis_r/hamlet.txt. We can load this via;

lines <- readLines("https://chryswoods.com/text_analysis_r/hamlet.txt")
lines
##  [1] "HAMLET"                                            
##  [2] "A monologue from the play by William Shakespeare"  
##  [3] ""                                                  
##  [4] "To be, or not to be--that is the question:"        
##  [5] "Whether 'tis nobler in the mind to suffer"         
##  [6] "The slings and arrows of outrageous fortune"       
##  [7] "Or to take arms against a sea of troubles"         
##  [8] "And by opposing end them. To die, to sleep--"      
##  [9] "No more--and by a sleep to say we end"             
## [10] "The heartache, and the thousand natural shocks"    
## [11] "That flesh is heir to. 'Tis a consummation"        
## [12] "Devoutly to be wished. To die, to sleep--"         
## [13] "To sleep--perchance to dream: ay, there's the rub,"
## [14] "For in that sleep of death what dreams may come"   
## [15] "When we have shuffled off this mortal coil,"       
## [16] "Must give us pause. There's the respect"           
## [17] "That makes calamity of so long life."              
## [18] "For who would bear the whips and scorns of time,"  
## [19] "Th' oppressor's wrong, the proud man's contumely"  
## [20] "The pangs of despised love, the law's delay,"      
## [21] "The insolence of office, and the spurns"           
## [22] "That patient merit of th' unworthy takes,"         
## [23] "When he himself might his quietus make"            
## [24] "With a bare bodkin? Who would fardels bear,"       
## [25] "To grunt and sweat under a weary life,"            
## [26] "But that the dread of something after death,"      
## [27] "The undiscovered country, from whose bourn"        
## [28] "No traveller returns, puzzles the will,"           
## [29] "And makes us rather bear those ills we have"       
## [30] "Than fly to others that we know not of?"           
## [31] "Thus conscience does make cowards of us all,"      
## [32] "And thus the native hue of resolution"             
## [33] "Is sicklied o'er with the pale cast of thought,"   
## [34] "And enterprise of great pitch and moment"          
## [35] "With this regard their currents turn awry"         
## [36] "And lose the name of action. -- Soft you now,"     
## [37] "The fair Ophelia! -- Nymph, in thy orisons"        
## [38] "Be all my sins remembered."

We will now convert this into a tibble. One column, called line, will have the line number (from 1 to length(lines)), while the lines themselves will be placed into the column called text.

hamlet <- tibble(line=1:length(lines), text=lines)
hamlet

We can now tokenise the text using unnest_tokens again, and then count the numbers;

hamlet_tokens <- hamlet %>% unnest_tokens(word, text)
hamlet_tokens %>% count(word, sort=TRUE)

Analysis is made difficult because the text contains lots of short words, like “the”, “of” and “and”, which form the scaffolding of the sentences without necessarily containing meaning in and of themselves. These words, which are often called “stop words”, are sometimes not needed for textual analysis, and should be removed. Fortunately the tidytext library provides a data set of English stop words;

data(stop_words)
stop_words

We can remove these stop words from hamlet_tokens by performing an anti-join between hamlet_tokens and stop_words. An anti-join combines two tibbles, returning only the rows in the first tibble that are NOT in the second tibble. We use the anti-join function that is part of dplyr.

important_hamlet_tokens <- hamlet_tokens %>% anti_join(stop_words)
important_hamlet_tokens

Now, when we count the words, we will only get the meaningful words;

important_hamlet_tokens %>% count(word, sort=TRUE)

The top words, “sleep”, “death”, “die”, “life”, are a good insight into the meaning behind this speech.

We can combine all of the above steps into a single pipeline;

hamlet %>% unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word, sort=TRUE)   

Downloading texts

To do more text analysis we need to find some larger texts. Fortunately, the gutenbergr package makes it easy to download books from Project Gutenberg.

library(gutenbergr)

Lets start by downloading “The Adventures of Sherlock Holmes”. The Gutenberg ID of this book is 1661 (found via the search engine).

sherlock <- gutenberg_download(1661)
sherlock

We can now perform the same analysis as above, namely tokenising by word, removing stop words, and then counting words to find the most common;

sherlock %>% 
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>% 
  count(word, sort=TRUE)

This has shown all words, including words that only appeared once. We can narrow this down by using the filter function from dplyr to filter in only words that appear more than 75 times;

counts <- sherlock %>% 
            unnest_tokens(word, text) %>% 
            anti_join(stop_words) %>% 
            count(word, sort=TRUE) %>%
            filter(n > 75)
counts

To plot this data, we need to convert the words from being stored as text, to be represented as factors. We do this using the core R reorder function. This will order the factors according to the number of occurances of each word. As we are changing the tibble, we need to use the dplyr mutate to edit the column in place;

counts <- counts %>% mutate(word = reorder(word, n))

The data is now ready to be rendered as a bar graph. We can do this using ggplot, using the aesthetic of placing the number of occurances on the x axis, the word on the y axis, using a geom_col() column plot, and not adding a label on the y-axis;

counts %>% ggplot(aes(n, word)) + geom_col() + labs(y=NULL)

EXERCISE

Go onto Project Gutenberg and find a book that you want to analyse. Download the book and create a graph of the most common words (ignoring stop words).

Example answer