As I was thinking about appropriate topic for my first Medium post, there came an opportunity from a company (name cannot be disclosed) with a business problem. We were mailed a dataset and on a zoom call we were given instructions to solve the problem. This blog post discusses Twitter sentiment analysis that I performed on the humongous dataset (More than 2.8 Million tweets) and my 2 day journey which discusses about my experience. So let’s begin !!
It was a typical twitter mined dataset with no label values which makes our problem unsupervised. Here’s a snapshot form the data which is more than sufficient to understand everything about it.
Problem Statement and Focus
Given above dataset, my main task would be to follow a simplistic approach to predict if a user’s sentiment is positive or negative. I’ll be focusing maily on 2 attributes which is “created_at” and “text”. Also, I won’t go into the coding details for this post and bore you guys up as I as well never see the code and just focus on the approach itself.
Time Series Analysis
Whenever, I see a date/time column in my data, it cheers me up a little as after working with few projects, I know you can have good amount of insights from data that depends on date/time.
So, I broke my column “created_at” in “hours” and “date” fields and created a new data which had an additional field as “count” (just a normal counter). Then I simply merged my data based on the hour field to see how many tweets there are in my data and got these curves.
This shows a nice sinusoidal pattern of twitter posts at different hours. One can easily figure out from this curve that max frequency of posts is between 15–22 hrs .Then I also exported data to csv and try to get some insights with “tableau” but didn’t get much! Here’s the table which clearly says that our data has most of the tweets from year 2017.
This was the section where maximum computation and efforts was required. Note that I won’t be going deep in implementation details of any model or vector representation. Let’s begin then !!
I followed the standard approach of removing the stop-words first. I didn’t use any fancy stop-words list but the standard NLTK list, which was good enough. Then I initialized my Stemmer which I’ve been using for a while now i.e. “SnowballStemmer”. You could also go with “PorterStemmer” but “SnowballStemmer” also works fine. Then finally to store lemma of our sentences, I used “WordNetLemmatizer”.
We always have some special symbols, numbers, http/https links etc in our text data which will be nothing but garbage for our model. It’s better to remove them and nothing can replace python “Regex” to get this job done in no time. You can write complex Regex initializers but here as well, I kept it simple and initialized it as:
I also want to highlight that there’s also one extra task of emoticon handling of our twitter feeds which is very important. Most people try to remove emoticons just like stop-words but I think emoticons are essential sources of information and should be converted to normal text and used. I haven’t deal with emoticons for this problem due to time and resource constraint. Finally, I ended my pre-processing by removing HTML tags from our text and doing a spellcheck by importing “spellchecker”. Spellchecking is very important for tweets as people have habit of tweeting in annoyingly weird way (Ex: “Thnks 4 ur rply”). If this gets avoided, model will make no sense. After all that garbage removed from our data, the total number of unique words in our corpus were now only 397,761 !!
With all these words, we created our vector vocabulary with gensim’s word2vec model and saved it for later. This saved model will be later used for training K-Means.
Another completely different approach that I opted for was TF-IDF which focuses more on important words from corpus. Here I created 2 vectorizers, one with no parameters and the other with 2-grams.
Here, data used was cleansed-data(all the techniques that we discussed in “Word2Vec” section).Vector representation of data was again kept aside for K-Means clustering model.
Before jumping right into the model, I wanted to play around the words that were in our corpus. I took the “Word2Vec” vocabulary of words and tried to create a word cloud from “wordcloud” library. Here’re couple of them….
K-Means Clustering Algorithm
We did all the above hard work just to get this boy running and give us the best results. Before moving to what I did, I’d like to give some honorable mentions for other techniques that you can follow. I could’ve used “DBSCSCAN” , Transformer based approach such as “BERT”, “ALBERT” by Google AI which are currently state-of-the-art algorithms for NLP problems. I wanted to keep things extremely simple and focus on people who’ve just started to get their hands dirty with NLP.
Okay, so “K-Means” is a very simple centroid based clustering algorithm. I used “K-Means++” which ensures smart initialization of centroids which if not used can lead to centroid-sensitive model.
I trained my model on 3 vector representations of my corpus (1 “Word2Vec” and 2 Vectorizers from “TF-IDF”) This gave me pretty nice results on figuring out if a word is a corpus depicts a positive or a negative sentiment.
There’re some things that could’ve lead to even better model results. More rigorous “Word2Vec” model tuning would’ve better prediction. Also, running K-Means model for larger number of iterations might result in better fitting. In short, I could’ve played around hyper parameters at every aspect of the project which would definitely improve the overall performance and optimized result . I encourage you people to look for “grid-search” and other advanced tuning methods such as “hyperopt” and “hyperas”.