Sentiment Analysis

Sentiment analysis involves trying to analyse the feelings or sentiments expressed within a piece of text. Sentiments could be interpreted broadly, e.g. with words being classed as “positive” or “negative”, or more nuance could be included, e.g. with words classed as expressing “joy”, “fear”, “sadness” etc.

Sentiment analysis depends on building a dictionary of words and classifying those words according to sentiment. Several such dictionaries have been created, typically via crowdsourcing or other large-scale analysis. The tidytext library provides the get_sentiments function. This can be used to download one of many different sentiments dictionaries into a tibble.

Amongst others, tidytext provides access to three general purpose sentiments dictionaries;

  1. AFINN from Finn Årup Nielsen,
  2. bing from Bing Liu and collaborators, and
  3. nrc from Saif Mohammad and Peter Turney.

For example, we can download the nrc dictionary using;

library(tidyverse)
library(tidytext)
sentiments <- get_sentiments("nrc")
sentiments

You can see that this really is just a dictionary of single words, with each word assigned one or more sentiments (e.g. “sadness”, “anger” etc.).

Note that this assignment of words to sentiments is based on individual words only, and based on a fairly conservative modern usage of those words. For example, “badly” is classified as both “negative” and “sadness”, which, in my opinion, wouldn’t capture the sentiment in sentences such as “I badly want that!”. Equally, “backwards” is classified as “disgust” and “negative”, which is not how I would see it in sentences such as “they skillfully drove backwards around the course”. So, when using these dictionaries, be mindful of how they were created, the linguistic style that they were generated against, and whether or not they would be appropriate for your text. You should also be careful of license, as different dictionaries are distributed with different licenses, and different obligations on their use (e.g. citation, payment etc).

Note also that different dictionaries will use different sentiments, and will classify them in different ways. For nrc the sentiments are in the sentiments column. We can get the full list of sentiments and the number of words classified to each one via;

sentiments %>% count(sentiment)

Matching sentiments to words

We can perform a sentiment analysis by attaching a sentiment to every word in a block of text. For example, let’s recreate the tidy text tibble of all of the non-stopwords in the Advertures of Sherlock Holmes;

library(gutenbergr)
data(stop_words)

sherlock <- gutenberg_download(1661) %>%
              unnest_tokens(word, text) %>%
              anti_join(stop_words)
sherlock

We can classify each word by sentiment by joining together the tidy text sherlock tibble with the dictionary of sentiments in sentiments. We do this using dplyr’s inner_join function. This will join together two tibbles via their common column (in this case, word), creating new rows as needed if a word appears twice.

sherlock_sentiments <- sherlock %>% inner_join(sentiments)
sherlock_sentiments

Here you can see that this has added the sentiment column, and repeated words where they have multiple sentiments, e.g. “scandal” being both “fear” and “negative”.

We can then count how many words of different sentiments there are using filter and count.

sherlock_sentiments %>% 
  filter(sentiment=="positive") %>%
  count(word, sort=TRUE)      

You could get the total number of words of each type using;

sherlock_sentiments %>%
  filter(sentiment=="joy") %>%
  count(word) %>% 
  summarise(total=sum(n))

Alternatively (and much more simply) we could just count the number of occurrences of each sentiment in the sentiments column.

sherlock_sentiments %>% count(sentiment)

We could plot this using a similar technique as the last section (this time converting the sentiments into factors)

sherlock_sentiments %>%
  count(sentiment) %>%
  mutate(sentiment = reorder(sentiment, n)) %>%
  ggplot(aes(n, sentiment)) + geom_col() + labs(y=NULL)

Exercise

  1. Perform a similar sentiment analysis for your chosen book from Project Gutenberg. How does the number of words of different sentiments in your book compare to that for the Adventures of Sherlock Holmes. Does this difference match what you would intuitively expect?

Example answer here

Investigating change of sentiment through a text

You can also use sentiment analysis to investigate how sentiment change throughout a text. To do this, we will first need to add a line number to each word in our text. We can do this by mutating the tibble downloaded from Project Gutenberg and adding a linenumber column which is just the row number in the tibble (obtained via the row_number function).

sherlock <- gutenberg_download(1661) %>% 
              mutate(linenumber=row_number()) %>% 
              unnest_tokens(word, text) %>% anti_join(stop_words)
sherlock

We can now add in the sentiments…

sherlock_sentiments <- sherlock %>% 
                         inner_join(sentiments)
sherlock_sentiments

What we want to do now is to count the number of words with different sentiments for differnet blocks of the text. One way to do this is to divide the text up into blocks of, e.g 80 lines. We can count the number of sentiments in each block, and then track how this count changes from block to block in the text. Note that our choice of block size of 80 lines is arbitrary. It is best to experiment with different block sizes to see what makes sense.

We can divide into block using linenumber %/% 80, which will modulo divide each line number by 80. Thus lines 0-79 will be assigned to block 0, lines 80 to 159 to block 1 etc.

sherlock_blocks <- sherlock_sentiments %>% 
                     count(block=linenumber%/%80, sentiment)
sherlock_blocks

Next, we can now pivot the tibble so that the sentiments are the columns, and we have one row per block of text. We do this using the pivot_wider function from the tidyr package. In this case, we want to pivot the tibble so that the names of the new columns come from the text values in the current “sentiment” column, and the values in those columns come from the current “n” column (which contains the number of words of each sentiment). It may be that some sentiments don’t appear in a block, so we need to ensure that any missing sentiments are filled in with a default value of 0.

sherlock_blocks <- sherlock_blocks %>% 
                     pivot_wider(names_from = sentiment, 
                                 values_from = n, 
                                 values_fill = 0)
sherlock_blocks

We can now do things like calculate the difference between the number of positive and negative sentiments in each block, which could be placed into a new column called “sentiment”;

sherlock_blocks <- sherlock_blocks %>%
                         mutate(sentiment = positive - negative)
sherlock_blocks

This could then be graphed, with the “sentiment” column on the y axis, and the block number on the x axis;

sherlock_blocks %>% ggplot(aes(block, sentiment)) + geom_col()

This shows that the Adventures of Sherlock Holmes was more positive at the beginning, but had more periods of negativity as the text progressed.

You can also look at the trajectory of specific sentiments, e.g.

sherlock_blocks %>% ggplot(aes(block, joy)) + geom_col()

Exercise

  1. Draw a graph of the difference between positive and negative sentiments for your chosen book from Project Gutenberg.

Example answer here

  1. Explore different number of lines within each block. How does this affect the graph?

Example answer here