Topic 17 Text Mining using R
“We Facebook users have been building a treasure lode of big data that government and corporate researchers have been mining to predict and influence what we buy and for whom we vote. We have been handing over to them vast quantities of information about ourselves and our friends, loved ones and acquaintances”-Douglas Rushkoff
17.1 Introduction to Text Mining
Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.
Welbers et al. (2017) provides a mild introduction to Text Analytics using R
Text mining has gained momentum and is used in analytics worldwide
Predicting Stock Market and other Financial Applications
Social Network Analysis
Customer Service and Help Desk
17.1.1 Text Data
Text data is ubiquitous in social media analytics.
Traditional media, social media, survey data, and numerous other sources.
- Twitter, Facebook, Surveys, Reported Data (Incident Reports)
Massive quantity of text in the modern information age.
The mounting availability of and interest in text data has been the development of a variety of statistical approaches for analysing this data.
17.1.2 Generic Text Mining System
- Following figure demonstrates a general text mining system (Source: Feldman & Sanger (2007))
17.2 Mining Twitter Text Data using RTwitter is one of the most popular social media platform for information sharing. Some examples from rforresearch tweets Follow (rforresearch?) Tweets by rforresearch
17.2.1 Obtaining Twitter Data
Twitter provides an API access for their data feed.
User is required to create an app and obtain access codes for the API.
See here https://cran.r-project.org/web/packages/rtweet/vignettes/auth.html for a detailed introduction on obtaining API credentials.
Data access is limited in various ways (days, size of data etc.).
See here https://developer.twitter.com/en/docs.html for full documentation.
R provides packages like twitteR and rtweet provide programming interface using R to access the data.
Twitter doesnt allow data sharing but the API setup is straight forward. Twitter does allow sharing of the tweet IDs (available in the data folder)
17.3 Download Data
- We will download tweets with ‘#auspol’; a popular hashtag used in Australia to talk about current affairs and socio-political issues
- Example here uses rtweet package to download data
- After setting up the API
- Create the token
search_tweetscan be used without creating the token, it will create a token using the default app for your Twitter account
library(rtweet) <- create_token(app = "your_app_name", consumer_key = "your_consumer_key", token consumer_secret = "your_consumer_secret", access_token = "your_access_token", access_secret = "your_access_secret")
- Use the
search_tweetsfunction to download the data, convert to data frame and save
= search_tweets(q = "#auspol", n = 15000, type = "recent", include_rts = FALSE, rt since = "2020-09-14") = as.data.frame(rt) rt2 saveRDS(rt2, file = "tweets_auspol.rds")
17.4 Data Pre-processing
We will conduct data pre-processing in this stage
Some steps (depend on the types of problems analysed)
- Create a corpus
- Change encoding
- Convert to lower case
- Remove hashtags
- Remove URLs
- Remove @ mentions
- Remove punctuations
- Remove stop words
- Stemming can also be conducting (avoided in this example)
library(tm) = readRDS("tweets_auspol.rds") rt # encoding $text <- sapply(rt$text, function(row) iconv(row, "latin1", "ASCII", sub = "")) rt # build a corpus, and specify the source to be character vectors = Corpus(VectorSource(rt$text)) myCorpus # convert to lower case = tm_map(myCorpus, content_transformer(tolower)) myCorpus # remove punctuation <- tm_map(myCorpus, removePunctuation) myCorpus # remove numbers <- tm_map(myCorpus, removeNumbers) myCorpus # remove URLs = function(x) gsub("http[^[:space:]]*", "", x) removeURL = function(x) gsub("https[^[:space:]]*", "", x) removeURLs # remove hashtags = function(x) gsub("#\\S+", "", x) removehash # remove @ <- function(x) gsub("@\\w+", "", x) #only removes the '@' removeats # remove numbers and punctuations = function(x) gsub("[^[:alpha:][:space:]]*", "", x) removeNumPunct # leading and trailing white spaces = function(x) gsub("^[[:space:]]*", "", x) ## Remove leading whitespaces wspace1 = function(x) gsub("[[:space:]]*$", "", x) ## Remove trailing whitespaces wspace2 = function(x) gsub(" +", " ", x) ## Remove extra whitespaces wspace3 <- function(x) gsub("im", "", x) removeIms = tm_map(myCorpus, content_transformer(removeURL)) #url myCorpus = tm_map(myCorpus, content_transformer(removeURLs)) #url myCorpus <- tm_map(myCorpus, content_transformer(removehash)) #hashtag myCorpus <- tm_map(myCorpus, content_transformer(removeats)) #mentions myCorpus = tm_map(myCorpus, content_transformer(removeNumPunct)) #number and punctuation (just in case some are left over) myCorpus = tm_map(myCorpus, content_transformer(removeIms)) #Ims myCorpus = tm_map(myCorpus, content_transformer(wspace1)) myCorpus = tm_map(myCorpus, content_transformer(wspace2)) myCorpus = tm_map(myCorpus, content_transformer(wspace3)) #other white spaces myCorpus # remove extra whitespace = tm_map(myCorpus, stripWhitespace) myCorpus # remove extra stopwords = c(stopwords("English"), stopwords("SMART"), "rt", "ht", "via", "amp", myStopwords "the", "australia", "australians", "australian", "auspol") = tm_map(myCorpus, removeWords, myStopwords) myCorpus # generally a good idea to save the processed corpus now save(myCorpus, file = "auspol_sep.RData") = data.frame(text = get("content", myCorpus), row.names = NULL) data_tw2 = cbind(data_tw2, ID = rt$status_id) data_tw2 = cbind(Date = as.Date(rt$created_at), data_tw2) data_tw2 # look at the data frame, still some white spaces left so let's get rid of them $text = gsub("\r?\n|\r", "", data_tw2$text) data_tw2$text = gsub(" +", " ", data_tw2$text) data_tw2head(data_tw2) # save data saveRDS(data_tw2, file = "processed_data.rds")
17.5 Some Visualisation
- Bar Chart of top words
library(tm) load("auspol_sep.RData") # Build TDM = TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf))) tdm = as.matrix(tdm) m = sort(rowSums(m), decreasing = T) word.freq # plot term freq = rowSums(as.matrix(tdm)) term.freq1 = subset(term.freq1, term.freq1 >= 50) term.freq = data.frame(term = names(term.freq), freq = term.freq) df = transform(df, term = reorder(term, freq)) df library(ggplot2) library(ggthemes) = ggplot(head(df, n = 20), aes(x = reorder(term, -freq), y = freq)) + geom_bar(stat = "identity", m2 aes(fill = term)) + theme(legend.position = "none") + ggtitle("Top 20 words in tweets #auspol \n (14 Sep to 20 Sep 2020)") + theme(axis.text = element_text(size = 12, angle = 90, face = "bold"), axis.title.x = element_blank(), title = element_text(size = 15)) = m2 + xlab("Words") + ylab("Frequency") + theme_wsj() + theme(legend.position = "none", m2 text = element_text(face = "bold", size = 10)) m2
- Word Cloud 1
library(wordcloud) library(RColorBrewer) = brewer.pal(7, "Dark2") pal wordcloud(words = names(word.freq), freq = word.freq, min.freq = 5, max.words = 1000, random.order = F, colors = pal)
- Word Cloud 2
library(wordcloud2) # some data re-arrangement = data.frame(word = names(term.freq1), freq = term.freq1) term.freq2 = term.freq2[term.freq2$freq > 5, ] term.freq2 # figure-3.4 wordcloud2(term.freq2)
The word cloud reflects the discussion around the Federal Government’s new Energy Policy announced during the data time period
17.6 Sentiment Analysis
Classification of Sentiment Analysis Methods
17.6 shows various classification of sentiment analysis methods (Collomb, Costea, Joyeux, Hasan, & Brunie (2014))
Lexicon (or Dictionary) based method used in the following illustration.
Sentence level classification
Eight emotions to classify: “anger, fear, anticipation, trust, surprise, sadness, joy, and disgust”
Lexicon Used: NRC Emotion Lexicon.
See Mohammad & Turney (2010) for more details on the lexicon.
# required libraries sentiment analysis library(syuzhet) library(lubridate) library(ggplot2) library(scales) library(reshape2) library(dplyr) library(qdap) # convert data to dataframe for analysis = readRDS("processed_data.rds") data_sent = data_sent[!apply(data_sent, 1, function(x) any(x == "")), ] #remove rows with empty values data_sent = data_sent[wc(data_sent$text) > 4, ] #more than 4 words per tweet data_sent = data_sent$texttw2
- Conduct Sentiment Analysis and Visualise
= get_nrc_sentiment(tw2) #this can take some time mySentiment = cbind(data_sent, mySentiment) tweets_sentiment # save the results saveRDS(tweets_sentiment, file = "tw_sentiment.rds")
= data.frame(colSums(tweets_sentiment[, c(4:13)])) sentimentTotals names(sentimentTotals) = "count" = cbind(sentiment = rownames(sentimentTotals), sentimentTotals) sentimentTotals rownames(sentimentTotals) = NULL # plot ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) + geom_bar(aes(fill = sentiment), stat = "identity") + theme(legend.position = "none") + xlab("Sentiment") + ylab("Total Count") + ggtitle("Sentiment Score for Sample Tweets") + theme_minimal() + theme(axis.text = element_text(size = 15, face = "bold"))
17.7 Topic Modelling
This section will use bi-term topic modelling method to demonstrate topic modelling exercise.
Biterm Topic Modelling (BTM) (Yan, Guo, Lan, & Cheng (2013)) is useful for short text like the twitter data we have in this example.
- A BTM is a word co-occurance based topic model that learns topics by modelling word-word patterns (biterms)
- BTM models biterm occurences in a corpus
- More details here https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf
- A good example here http://www.bnosac.be/index.php/blog/98-biterm-topic-modelling-for-short-texts
# load packages and rearrange data library(udpipe) library(data.table) library(stopwords) library(BTM) library(textplot) library(ggraph) # rearrange to get doc id = data_sent[, c(3, 2)] data_tm colnames(data_tm) = "doc_id"
# use parts of sentence (Nouns, Adjectives, Verbs for TM) Method is # computationally intensive and can take several minutes. <- udpipe(data_tm, "english", trace = 1000) anno <- as.data.table(anno) biterms <- biterms[, cooccurrence(x = lemma, relevant = upos %in% c("NOUN", "ADJ", biterms "VERB") & nchar(lemma) > 2 & !lemma %in% stopwords("en"), skipgram = 3), by = list(doc_id)] set.seed(999) <- subset(anno, upos %in% c("NOUN", "ADJ", "VERB") & !lemma %in% stopwords("en") & traindata nchar(lemma) > 2) <- traindata[, c("doc_id", "lemma")] traindata # fit 10 topics (other parameters are mostly default) <- BTM(traindata, biterms = biterms, k = 10, iter = 2000, background = FALSE, model trace = 2000) # extract biterms for plotting = terms(model, type = "biterms")$biterms biterms1 # The model, biterms, biterms1 were saved to create the plot in this markdown # document.
- Plot the topics with 20 terms and labelled by the proportion
plot(model, subtitle = "#auspol 14-20 Sep 2020", biterms = biterms1, labels = paste(round(model$theta * 100, 2), "%", sep = ""), top_n = 20)
- Other analysis which can be conducted may include, clustering analysis, co-word clusters, network analysis etc. Other Topic Modelling methods can also be implemented.