By Dayne Sorvisto
June 13th, 2016.
Over a century ago there was a gold rush in the Klondike. People were excited about the prospect of traveling North and finding treasure troves of gold hidden in rivers and streams in the mountains that could be panned out (called placer mining). This is actually a simple mining process and the simplest to extract gold from raw sediment in water. Fast forward to 2016 and there’s a new kind of gold out there, buried deep inside mountains of unstructured raw data that can be combined with your organization’s internal data to understand customer behavior on a much more granular level.
Sentiment analysis (the science and art of summarizing what customers are feeling about your brand/product) is one of the many ways unstructured data on the web like Twitter feeds can be combined with data stored in your warehouses, ERP or CRM systems to gain valuable insights. The nice part about this is it doesn’t change the way you analyze data it just makes it a whole lot more exciting because it’s in real-time and you can learn how your customers are reacting to your marketing/promotions in a timely fashion.
In this blog I am going to compare the results of using an in-house developed sentiment analysis algorithm with some of the sentiment analysis algorithms available in the market such as IBM Watson’s sentiment analysis as a service via their Alchemy API (Bluemix).
The first step of real-time sentiment analysis is gathering data. In these examples we used Spark Streaming to capture data. Our in-house sentiment analyzer was developed in my favorite Python machine learning library Scikit-learn, and uses one of the most popular machine learning algorithms for text mining tasks called Support Vector Machines.
The first step in building our algorithm was translating the tweets we streamed into a numerical representation that the Support Vector Machine could process. Deciding which features of the data are important (for example length of tweet, number of “positive” words, etc.) is called feature extraction and also reduces the dimensionality of the data we want to analyze, making it far more efficient to process than a text file full of tweets. There are two approaches we used to improve the accuracy of our algorithm; the first approach was collecting as much data as possible from Twitter so we could train our model. We were very data focused during this project and collected over a gigabyte of labelled training data from an ensemble of online sources. It is important that this data be explored, e.g. some stop words like “not” and “can’t” give many clues as to the sentiment of a tweet and it’s important to reflect this in your feature extraction process. We did notice a performance increase by removing @ symbols, brackets and URLs appearing in the tweet. The second way we improved accuracy was by modifying how we exposed the data to the machine learning algorithm (this is the ‘feature extraction’ step mentioned previously).
We chose 5 major features that we found correlated very well with sentiment of short texts like tweets. Bi-grams (pairs of words that occur beside each other in the tweet) are also important for high performance. Features included the count of so-called “positive” tokens in the tweet based on a pre-computed database of scored words (each word is rated between 0 and 1 where 1 means very positive and 0 is very negative) called the open source MPQA Subjectivity Lexicon. It is also critical that the same feature extraction process be applied in the exact same way both to the training data and the streamed data so that the algorithm can learn the features and then apply what it learned. Determining which words in the tweet are positive depends on context, e.g. if a customer tweets “I just bought a pair of basketball shoes from #Sportchek wicked deal” then clearly the customer is expressing a positive sentiment but the slang word “wicked” by itself would be interpreted as negative with no context and can throw off the algorithm. Other features we chose to represent our tweets include the last ‘positive’ token in the tweet, the biggest polarity of a substring occurring in a hashtag, the number of negative words (although weighted less than the number of positive words) as well as more sophisticated measures like the number of smiley faces in the tweet. The numbers are put into a numerical array and are then input to the algorithm but in the training and prediction phase which requires an additional input, which should be encoded in the least biased way possible. “A unique thing about our algorithm is we used two classifiers instead of one. The first classifier determines if the text was expressing sentiment and to what degree. The important part is if the tweet has no sentiment then we can label the tweet “neutral”. This allows us to filter out fact based tweets and significantly improves the accuracy of the algorithm.
Our results were very positive. We benchmarked our Python algorithm against Text Blob, another Python library for sentiment analysis. We summarized the results in the following table.
Our algorithm agreed with TextBlob (our benchmark) 86% of the time on predicting negative sentiment and 96% on neutral tweets.
However, our algorithm offered noticeable improvements especially in minimizing false negatives. Drilling down into this table we can see a noticeable improvement on false negative tweets (tweets that appear negative but are actually expressing positive sentiment). This is significant because many false negatives result from slang. Here’s a simple example of a real tweet TextBlob labelled incorrectly.
We also did sentiment analysis using IBM Watson’s Bluemix which offers sentiment analysis as a cloud service through some API’s called Alchemy; our goal was to compare our results against our in-house algorithm.
Alchemy’s API can do a wide variety of natural language processing tasks including targeted sentiment analysis for situations where we’re interested in knowing the sentiment of a tweet with respect to a specific entity or dimension which may not necessarily be the same as the sentiment of the entire tweet as a whole, especially for tweets that express a mixed sentiment.
IBM Watson is accurate and reliable but there are some downsides, one being their pricing. You’re charged per API call and the number of API calls you can make depends on which Bluemix account you purchase (enterprise is available for under $200 per month). Depending on your needs this may or may not be a constraint on the amount of data you can analyze and might include occasional downtime.
The exact details of how the Watson algorithm does sentiment analysis are not publically available, it’s a “black box” and for the security minded this can be an issue. One of the advantages of our in-house sentiment analyzer is that we designed it for transparency and customizability. We chose SVMs to do the machine learning not just because of their high accuracy for natural language processing tasks but because many machine learning algorithms have a tendency to be “black boxes” in that they make decisions based on statistical inferences made during the training phase which might not even be known to the designers themselves.
As expected, Bluemix was highly accurate, agreeing with our intuition when we visualized and drilled down into the data. One of the major pros of Bluemix was it does do real-time sentiment analysis (although combining this with internal data may not be real-time) and is easy to use and set-up. It took under 30 minutes to run the Alchemy demo given on the IBM website which involves installing Python and a MongoDB database to store the results.
We produced the following visualization searching for recent “Sportchek” tweets using IBM Watson.
The colors represent a statistical sample that was taken (because there was such a large volume of data, we can’t display it all). The sample is statistically speaking, a representative of the population of all tweets as a whole.
We used Excel’s rand() formula to generate our sample for us. We came up with the following table suggesting our predictions were similar to Watson’s with approximately two-thirds of our Tweets being positive so there was no correlation between our results and Watsons although our sample size was considerably larger and broader as we used Spark Streaming which may explain the discrepancy.
We also explored some of the packages available in R for batch sentiment analysis. R is a statistical programming language similar to Python and with a very active open source development community. R is very popular among data scientists and is particularly good at visualization. Here is a word cloud created using a simple R script. We searched for the hashtag #Sportchek and displayed the top results.
We tried the following R packages: TwitteR, tm, RColorBrewer and sentR. The last package provides a simple pre-trained Naïve Bayes classifier but could be substituted for a custom in-house algorithm.
The Twitter API is a massive bottleneck for any application that needs to use it and collecting enough data to train the model can be a major issue. If you’re using Watson you will be limited by both IBM and Twitter’s APIs and must agree to their policies and terms of service which might influence your decision if you need a more robust enterprise solution. For production use cases I’d suggest using the Twitter firehose.
The tm package in R is very popular among data scientists for doing text mining because of its corpus function which makes it trivial to transform text into a bag of words model that can be used as input into most machine learning algorithms available with R.
The code we wrote was under 50 lines and could easily be ported to Spark Streaming with SparkR.
We also tried a couple open source projects and proprietary solutions with mixed results including Rosette and Sentiment140 which are free for academics but charges for commercial use.
With all these different options available for doing sentiment analysis, it’s hard to keep track and find the best solution for your business needs so we summarized our analysis in a table:.
|In-house (Python)||TextBlob||Watson on Bluemix|
|Accuracy||Very Accurate||Accurate||Very Accurate|
|Security||On-prem||On-prem||SaaS with SSL optional|
|Real-Time||Yes if used with Spark Streaming instance||No||Yes if used with Spark Streaming API|
|Comments||High learning curve||Benchmark||Easy installation|
I hope this blog has provided you some insight on real-time sentiment analysis and some of the options available. There are pros and cons to choosing an in-house algorithm over something like Bluemix which offers sentiment analysis as a cloud service but ultimately your choice will depend on your business needs.