Topic 17 Text Mining using R
“We Facebook users have been building a treasure lode of big data that government and corporate researchers have been mining to predict and influence what we buy and for whom we vote. We have been handing over to them vast quantities of information about ourselves and our friends, loved ones and acquaintances”-Douglas Rushkoff
17.1 Introduction to Text Mining
Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.
Welbers et al. (2017) provides a mild introduction to Text Analytics using R
Text mining has gained momentum and is used in analytics worldwide
Sentiment Analysis
Predicting Stock Market and other Financial Applications
Customer influence
News Analytics
Social Network Analysis
Customer Service and Help Desk
17.1.1 Text Data
Text data is ubiquitous in social media analytics.
Traditional media, social media, survey data, and numerous other sources.
- Twitter, Facebook, Surveys, Reported Data (Incident Reports)
Massive quantity of text in the modern information age.
The mounting availability of and interest in text data has been the development of a variety of statistical approaches for analysing this data.
17.1.2 Generic Text Mining System
- Following figure demonstrates a general text mining system (Source: Feldman & Sanger (2007))
::include_graphics("fig-2.png") knitr
17.2 Mining Twitter Text Data using R
Twitter is one of the most popular social media platform for information sharing. Some examples from rforresearch tweets Follow (rforresearch?) Tweets by rforresearch17.2.1 Obtaining Twitter Data
Twitter provides an API access for their data feed.
User is required to create an app and obtain access codes for the API.
See here https://cran.r-project.org/web/packages/rtweet/vignettes/auth.html for a detailed introduction on obtaining API credentials.
Data access is limited in various ways (days, size of data etc.).
See here https://developer.twitter.com/en/docs.html for full documentation.
R provides packages like twitteR and rtweet provide programming interface using R to access the data.
Twitter doesnt allow data sharing but the API setup is straight forward. Twitter does allow sharing of the tweet IDs (available in the data folder)
17.3 Download Data
- We will download tweets with ‘#auspol’; a popular hashtag used in Australia to talk about current affairs and socio-political issues
- Example here uses rtweet package to download data
- After setting up the API
- Create the token
search_tweets
can be used without creating the token, it will create a token using the default app for your Twitter account
library(rtweet)
<- create_token(app = "your_app_name", consumer_key = "your_consumer_key",
token consumer_secret = "your_consumer_secret", access_token = "your_access_token",
access_secret = "your_access_secret")
- Use the
search_tweets
function to download the data, convert to data frame and save
= search_tweets(q = "#auspol", n = 15000, type = "recent", include_rts = FALSE,
rt since = "2020-09-14")
= as.data.frame(rt)
rt2 saveRDS(rt2, file = "tweets_auspol.rds")
17.4 Data Pre-processing
We will conduct data pre-processing in this stage
Some steps (depend on the types of problems analysed)
- Create a corpus
- Change encoding
- Convert to lower case
- Remove hashtags
- Remove URLs
- Remove @ mentions
- Remove punctuations
- Remove stop words
- Stemming can also be conducting (avoided in this example)
library(tm)
= readRDS("tweets_auspol.rds")
rt # encoding
$text <- sapply(rt$text, function(row) iconv(row, "latin1", "ASCII", sub = ""))
rt
# build a corpus, and specify the source to be character vectors
= Corpus(VectorSource(rt$text))
myCorpus
# convert to lower case
= tm_map(myCorpus, content_transformer(tolower))
myCorpus # remove punctuation
<- tm_map(myCorpus, removePunctuation)
myCorpus # remove numbers
<- tm_map(myCorpus, removeNumbers)
myCorpus
# remove URLs
= function(x) gsub("http[^[:space:]]*", "", x)
removeURL = function(x) gsub("https[^[:space:]]*", "", x)
removeURLs
# remove hashtags
= function(x) gsub("#\\S+", "", x)
removehash
# remove @
<- function(x) gsub("@\\w+", "", x) #only removes the '@'
removeats
# remove numbers and punctuations
= function(x) gsub("[^[:alpha:][:space:]]*", "", x)
removeNumPunct
# leading and trailing white spaces
= function(x) gsub("^[[:space:]]*", "", x) ## Remove leading whitespaces
wspace1 = function(x) gsub("[[:space:]]*$", "", x) ## Remove trailing whitespaces
wspace2 = function(x) gsub(" +", " ", x) ## Remove extra whitespaces
wspace3
<- function(x) gsub("im", "", x)
removeIms
= tm_map(myCorpus, content_transformer(removeURL)) #url
myCorpus = tm_map(myCorpus, content_transformer(removeURLs)) #url
myCorpus
<- tm_map(myCorpus, content_transformer(removehash)) #hashtag
myCorpus <- tm_map(myCorpus, content_transformer(removeats)) #mentions
myCorpus
= tm_map(myCorpus, content_transformer(removeNumPunct)) #number and punctuation (just in case some are left over)
myCorpus
= tm_map(myCorpus, content_transformer(removeIms)) #Ims
myCorpus
= tm_map(myCorpus, content_transformer(wspace1))
myCorpus = tm_map(myCorpus, content_transformer(wspace2))
myCorpus = tm_map(myCorpus, content_transformer(wspace3)) #other white spaces
myCorpus
# remove extra whitespace
= tm_map(myCorpus, stripWhitespace)
myCorpus
# remove extra stopwords
= c(stopwords("English"), stopwords("SMART"), "rt", "ht", "via", "amp",
myStopwords "the", "australia", "australians", "australian", "auspol")
= tm_map(myCorpus, removeWords, myStopwords)
myCorpus
# generally a good idea to save the processed corpus now
save(myCorpus, file = "auspol_sep.RData")
= data.frame(text = get("content", myCorpus), row.names = NULL)
data_tw2 = cbind(data_tw2, ID = rt$status_id)
data_tw2 = cbind(Date = as.Date(rt$created_at), data_tw2)
data_tw2 # look at the data frame, still some white spaces left so let's get rid of them
$text = gsub("\r?\n|\r", "", data_tw2$text)
data_tw2$text = gsub(" +", " ", data_tw2$text)
data_tw2head(data_tw2)
# save data
saveRDS(data_tw2, file = "processed_data.rds")
17.5 Some Visualisation
- Bar Chart of top words
library(tm)
load("auspol_sep.RData")
# Build TDM
= TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
tdm = as.matrix(tdm)
m = sort(rowSums(m), decreasing = T)
word.freq # plot term freq
= rowSums(as.matrix(tdm))
term.freq1 = subset(term.freq1, term.freq1 >= 50)
term.freq = data.frame(term = names(term.freq), freq = term.freq)
df = transform(df, term = reorder(term, freq))
df library(ggplot2)
library(ggthemes)
= ggplot(head(df, n = 20), aes(x = reorder(term, -freq), y = freq)) + geom_bar(stat = "identity",
m2 aes(fill = term)) + theme(legend.position = "none") + ggtitle("Top 20 words in tweets #auspol \n (14 Sep to 20 Sep 2020)") +
theme(axis.text = element_text(size = 12, angle = 90, face = "bold"), axis.title.x = element_blank(),
title = element_text(size = 15))
= m2 + xlab("Words") + ylab("Frequency") + theme_wsj() + theme(legend.position = "none",
m2 text = element_text(face = "bold", size = 10))
m2
- Word Cloud 1
library(wordcloud)
library(RColorBrewer)
= brewer.pal(7, "Dark2")
pal wordcloud(words = names(word.freq), freq = word.freq, min.freq = 5, max.words = 1000,
random.order = F, colors = pal)
- Word Cloud 2
library(wordcloud2)
# some data re-arrangement
= data.frame(word = names(term.freq1), freq = term.freq1)
term.freq2 = term.freq2[term.freq2$freq > 5, ]
term.freq2 # figure-3.4
wordcloud2(term.freq2)
The word cloud reflects the discussion around the Federal Government’s new Energy Policy announced during the data time period
17.6 Sentiment Analysis
Classification of Sentiment Analysis Methods
17.6 shows various classification of sentiment analysis methods (Collomb, Costea, Joyeux, Hasan, & Brunie (2014))
17.6.1 Method
Lexicon (or Dictionary) based method used in the following illustration.
Sentence level classification
Eight emotions to classify: “anger, fear, anticipation, trust, surprise, sadness, joy, and disgust”
Lexicon Used: NRC Emotion Lexicon.
See http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
See Mohammad & Turney (2010) for more details on the lexicon.
# required libraries sentiment analysis
library(syuzhet)
library(lubridate)
library(ggplot2)
library(scales)
library(reshape2)
library(dplyr)
library(qdap)
# convert data to dataframe for analysis
= readRDS("processed_data.rds")
data_sent = data_sent[!apply(data_sent, 1, function(x) any(x == "")), ] #remove rows with empty values
data_sent = data_sent[wc(data_sent$text) > 4, ] #more than 4 words per tweet
data_sent = data_sent$text tw2
- Conduct Sentiment Analysis and Visualise
= get_nrc_sentiment(tw2) #this can take some time
mySentiment = cbind(data_sent, mySentiment)
tweets_sentiment # save the results
saveRDS(tweets_sentiment, file = "tw_sentiment.rds")
- Plot
= data.frame(colSums(tweets_sentiment[, c(4:13)]))
sentimentTotals names(sentimentTotals) = "count"
= cbind(sentiment = rownames(sentimentTotals), sentimentTotals)
sentimentTotals rownames(sentimentTotals) = NULL
# plot
ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) + geom_bar(aes(fill = sentiment),
stat = "identity") + theme(legend.position = "none") + xlab("Sentiment") + ylab("Total Count") +
ggtitle("Sentiment Score for Sample Tweets") + theme_minimal() + theme(axis.text = element_text(size = 15,
face = "bold"))
17.7 Topic Modelling
This section will use bi-term topic modelling method to demonstrate topic modelling exercise.
Biterm Topic Modelling (BTM) (Yan, Guo, Lan, & Cheng (2013)) is useful for short text like the twitter data we have in this example.
- A BTM is a word co-occurance based topic model that learns topics by modelling word-word patterns (biterms)
- BTM models biterm occurences in a corpus
- More details here https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf
- A good example here http://www.bnosac.be/index.php/blog/98-biterm-topic-modelling-for-short-texts
# load packages and rearrange data
library(udpipe)
library(data.table)
library(stopwords)
library(BTM)
library(textplot)
library(ggraph)
# rearrange to get doc id
= data_sent[, c(3, 2)]
data_tm colnames(data_tm)[1] = "doc_id"
# use parts of sentence (Nouns, Adjectives, Verbs for TM) Method is
# computationally intensive and can take several minutes.
<- udpipe(data_tm, "english", trace = 1000)
anno <- as.data.table(anno)
biterms <- biterms[, cooccurrence(x = lemma, relevant = upos %in% c("NOUN", "ADJ",
biterms "VERB") & nchar(lemma) > 2 & !lemma %in% stopwords("en"), skipgram = 3), by = list(doc_id)]
set.seed(999)
<- subset(anno, upos %in% c("NOUN", "ADJ", "VERB") & !lemma %in% stopwords("en") &
traindata nchar(lemma) > 2)
<- traindata[, c("doc_id", "lemma")]
traindata # fit 10 topics (other parameters are mostly default)
<- BTM(traindata, biterms = biterms, k = 10, iter = 2000, background = FALSE,
model trace = 2000)
# extract biterms for plotting
= terms(model, type = "biterms")$biterms
biterms1
# The model, biterms, biterms1 were saved to create the plot in this markdown
# document.
- Plot the topics with 20 terms and labelled by the proportion
plot(model, subtitle = "#auspol 14-20 Sep 2020", biterms = biterms1, labels = paste(round(model$theta *
100, 2), "%", sep = ""), top_n = 20)
- Other analysis which can be conducted may include, clustering analysis, co-word clusters, network analysis etc. Other Topic Modelling methods can also be implemented.