Twitter Sentiment Analysis

Problem Statement

The aim of this project is to classify tweets based on their polarity mainly into three categories positive, negative or neutral.

Sentiment Analysis

Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials.
Consumers can use sentiment analysis to research products and services before a purchase. Production companies can use the public opinion to determine acceptance of their products and the public demand. Movie-goers can decide whether to watch a movie or not after going through other people’s reviews.

Goal

Input: It will be a tweet in textual format downloaded from the twitter API.
Output: It should be the label that specifies the polarity (positive, negative, neutral) of the given tweet.

Challenges

Usernames are mentioned more often than not. Usually they consist of some alphabets and numbers, and do not contribute much towards sentiment classification, except for increasing the size of the feature vector.
URLs too are not required in our task.
Repeated letters People often repeat letters in some words, in order to stress upon a particular emotion. For example:- sad, saaaad, saaaddd. All of them mean the same, yet it is not possible to distinguish between them if guided only by their spellings.
Hashtags Words in hashtags may be read different from the same word without the hash tag.
Punctuation and additional spaces.

Approach/Flow

Flow

Download Tweets

The tweets are downloaded using the Twitter API.

Parsing

We have totally 9684 tweets for training the algorithm and 8987 as testing tweets.
Out of these tweets, few tweets are incomplete which are not useful.
We removed those tweets and updated the data set.

Tokenizing

We didn't use ARK Tokenizer for tokenizing the tweets.
Instead, we coded the tokenizer following the steps given in the paper "Sentiment Analysis of Twitter Data" by Apoorv Agarwal et al.
So after parsing the original tweets will be processed by above steps and will be written to separate file which will be used in later stage.

Preprocessing

Steps involved in pre-processing:

Replacing emoticons with weights without disturbing it's polarity.
Replacing usernames, URLs with symbols like ||T||, ||U||.
Replacing words with more than 3 continuous repeating characters with only 3 occurrence.
Removing all the stop words and punctuation.
Replacing all hashtags with the name in it.
Replacing all the abbreviations with their full-forms.

Feature Vectors

With the processed dataset, we created feature vectors.
The basic feature that was considered was of unigrams.
A list of all unique unigrams(tokenized words) across the training set was constructed and it formed the feature vector for all the tweets.

Enhancing Feature Vectors

To improve the efficiency of the algorithm, we have added few more features to the existing one.
In addition to the unique unigrams, few features like the following were also added
- Count of hashtags.
- Bigrams.
- Trigrams.
- Count of emoticons.
- Count of Negative words.

SVM

The Feature Vector was written into a file in the format expected by libsvm classifier.
A SVM Classifier was used and trained with the training file as input and the classifier is built.
That classifier is used to predict the polarity of tweets in testing file.

Twitter Sentiment Analysis

The Good, The Bad and The Neutral!

Twitter Sentiment Analysis

Problem Statement

Sentiment Analysis

Goal

Challenges

Approach/Flow

Download Tweets

Parsing

Tokenizing

Preprocessing

Feature Vectors

Enhancing Feature Vectors

SVM

Results

Tags

Twitter Sentiment Analysis

Problem Statement

Sentiment Analysis

Goal

Challenges

Approach/Flow

Download Tweets

Parsing

Tokenizing

Preprocessing

Feature Vectors

Enhancing Feature Vectors​

SVM

Results

Tags

Enhancing Feature Vectors