At the beginning of this workshop we looked at tokenising texts. The “token” is the smallest unit of text that you want to analyse, i.e. it could be a word, line, paragraph etc.
It is very common that you want to tokenise using tokens that represent pairs of words. This is because words correlate with one another, and the choice of first word can significantly change the meaning of the second word. For example, putting “not” before “happy” significantly changes the meaning! A pair of words is called a “bigram”. More generally, a token comprising n words is called an “n-gram” (or “ngram”). Tokenising on bigrams or n-grams enable you to capture examine the correlations, and more importantly, the immediate context around each word.
It is very simple to use the unnest_tokens function to tokenise by n-grams. Let’s start by first loading up The Adventures of Sherlock Holmes;
library(tidyverse)
library(tidytext)
library(gutenbergr)
sherlock <- gutenberg_download(1661) %>%
mutate(linenumber=row_number())
The unnest_tokens function accepts an argument token that specifies the type of token to use. This can have a range of values, including but not limited to;
words - tokenise based on words - this is the defaultngrams - tokenise on n-grams. Use n to specify the number of wordslines - tokenise by linessentences - tokenise by sentencetweets - tokenise by word, preserving usernames, hashtags and URLs.To tokenise by bigrams, we set token equal to ngrams and n equal to 2.
sherlock_bigrams <- sherlock %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
sherlock_bigrams
In this case we have placed the bigrams into the column bigram.
Now we want to remove pairs of words where one in the pair is a stop word. To do this we must first separate the bigram column into two, which we will call word1 and word2. Note that the separator of a bigram is always a space, so sep = " " is safe to use.
sherlock_bigrams <- sherlock_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
sherlock_bigrams
In this case we have placed the two words into columns word1 and word2 respectively. Next we need to filter only the rows where neither word1 or word2 have a word in stop_words. We can’t use anti_join directly, so instead have to apply the filter twice, first including rows where word1 is not in the word column of stop_words, and then including rows where word2 is not in the word column of stop_words. Note that %in% is the operator to see if a value is in a column of values.
data(stop_words)
sherlock_bigrams <- sherlock_bigrams %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
sherlock_bigrams
Next, we can see that there are a lot of NAs in the tibble. These are caused by blank lines or other tokenisation issues. We can remove them by filtering twice again, including rows only where word1 is not NA, and where word2 is not NA.
sherlock_bigrams <- sherlock_bigrams %>%
filter(!is.na(word1)) %>%
filter(!is.na(word2))
sherlock_bigrams
Finally(!) we will rejoin the word1 and word2 columns into a single “bigram” column, using the “unite” function. In this case, we join using a single space (sep=" "), as this is the default separator for n-grams.
sherlock_bigrams <- sherlock_bigrams %>%
unite(bigram, word1, word2, sep=" ")
sherlock_bigrams
Note that we could have written all of the above steps as a single pipeline;
sherlock_bigrams <- sherlock %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!is.na(word1)) %>%
filter(!is.na(word2)) %>%
unite(bigram, word1, word2, sep=" ")
sherlock_bigrams
Also note that you would follow a similar process, but including word3, word4 etc., if you wanted to examine n-grams with a larger number of words.
We can now count the bigrams as before, e.g.
bigram_counts <- sherlock_bigrams %>% count(bigram, sort=TRUE)
bigram_counts
Counting bigrams is a very good way of finding names of important people and places in a text, plus commonly used described items (e.g. “speckled band”). This is because there is a high degree of correlation between the number of times the two words in a name or common description appear.
We could examine this in more detail. For example, we know that all of the names of important streets in the book will be called “something st” or “something street”. We could find these by filtering on word2, using a regular expression to look for st or street;
sherlock_streets <- sherlock %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!is.na(word1)) %>%
filter(!is.na(word2)) %>%
filter(str_detect(word2, "^st(reet)?$")) %>%
unite(bigram, word1, word2, sep=" ") %>%
count(bigram, sort=TRUE)
sherlock_streets
(see how the regular expression “^st(reet)?$” enabled us to filter both “st” and also “street” in a single line).
We can now easily see that the most important streets in The Adventures of Sherlock Holmes are Baker Street, Lord Street and Neville Street.
We can go beyond just counting, and can draw graphs that visualise the connections and correlations between words in a text. To do this, we need to tokenise on bigrams, but now not unite the individual word1 and word2 back into a single bigram column. Instead, we leave word1 and word2 as separate columns, and then ask count to count the number of times each word1-word2 pair appears in the text;
bigram_counts <- sherlock %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!is.na(word1)) %>%
filter(!is.na(word2)) %>%
count(word1, word2, sort=TRUE)
bigram_counts
This table can now be converted into a directed graph. A directed graph is one where nodes (in this case individual words) and connected via edges (in this case, the number of times word1 is followed by word2). The graph_from_data_frame function from the igraph package can create a directed graph from a tibble.
To prevent the graph becoming too large, we will only graph pairs of words that appear seven or more times (by filtering on n >= 7).
library(igraph)
bigram_graph <- bigram_counts %>%
filter(n >= 7) %>%
graph_from_data_frame()
bigram_graph
## IGRAPH 0728adc DN-- 51 30 --
## + attr: name (v/c), n (e/n)
## + edges from 0728adc (vertex names):
## [1] sherlock->holmes st ->simon baker ->street lord ->st
## [5] red ->headed st ->clair hosmer ->angel miss ->hunter
## [9] neville ->st irene ->adler miss ->stoner stoke ->moran
## [13] briony ->lodge copper ->beeches remarked->holmes scotland->yard
## [17] boscombe->pool dear ->fellow frock ->coat henry ->baker
## [21] holmes ->sat jabez ->wilson swandam ->lane dr ->grimesby
## [25] fuller's->earth headed ->league holmes ->laughing lysander->stark
## [29] opium ->den orange ->pips
The graph_from_data_frame takes a tibble where the first two columns name the nodes and specify their connection, while the third (and subsequent) columns provide edge attributes (e.g. here the weight of the edge is the number of times the pair of words appears).
Now that we have built the directed graph, we can visualise it using the ggraph function from the
ggraph package.
There is a grammar for these graphs, e.g. we pass the data into ggraph, and then add layers, such as geom_edge_link to draw the edges, geom_node_point to draw the nodes (words), and geom_node_text to add text to each node where the label comes from the name of each node.
library(ggraph)
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
This graph isn’t very pretty. By looking online at the options to improve ggraph, and comparing against graphs you like online (e.g. this one), we can improve this to;
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = grid::arrow(type = "closed", length = unit(2, "mm")),
end_cap = circle(1, "mm")) +
geom_node_point(color = "lightblue", size = 2) +
geom_node_text(aes(label = name), size = 2) +
theme_void()
The arrows show the direction of the correlation, with the weights of the arrows showing the number of times each bigram was used. We can clearly see that “Sherlock Holmes” is a heavily-used bigram. But we can also get more information about surrounding words, e.g. “Sherlock Holmes laughing”, “Sherlock Holmes sat” and “remarked Holmes” were also common phrases. This has also surfaced extra information, e.g. we have found the Red Headed League!
Also, with this finer control we can reduce the font sizes and other details sufficiently that we can look at more bigram associations, e.g.
bigram_graph <- bigram_counts %>%
filter(n >= 3) %>%
graph_from_data_frame()
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = grid::arrow(type = "closed", length = unit(1, "mm")),
end_cap = circle(0.5, "mm")) +
geom_node_point(color = "lightblue", size = 0.5) +
geom_node_text(aes(label = name), size = 1) +
theme_void()
Cool? :-)
How useful is this kind of graph to help you understand the contents of the text?