Twitter Sentiment Analysis
The aim of this project is to classify tweets based on their polarity mainly into three categories positive, negative or neutral.
- Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials.
- Consumers can use sentiment analysis to research products and services before a purchase. Production companies can use the public opinion to determine acceptance of their products and the public demand. Movie-goers can decide whether to watch a movie or not after going through other people’s reviews.
- Input: It will be a tweet in textual format downloaded from the twitter API.
- Output: It should be the label that specifies the polarity (positive, negative, neutral) of the given tweet.
- Usernames are mentioned more often than not. Usually they consist of some alphabets and numbers, and do not contribute much towards sentiment classification, except for increasing the size of the feature vector.
- URLs too are not required in our task.
- Repeated letters People often repeat letters in some words, in order to stress upon a particular emotion. For example:- sad, saaaad, saaaddd. All of them mean the same, yet it is not possible to distinguish between them if guided only by their spellings.
- Hashtags Words in hashtags may be read different from the same word without the hash tag.
- Punctuation and additional spaces.
The tweets are downloaded using the Twitter API.
- We have totally 9684 tweets for training the algorithm and 8987 as testing tweets.
- Out of these tweets, few tweets are incomplete which are not useful.
- We removed those tweets and updated the data set.
- We didn't use ARK Tokenizer for tokenizing the tweets.
- Instead, we coded the tokenizer following the steps given in the paper "Sentiment Analysis of Twitter Data" by Apoorv Agarwal et al.
- So after parsing the original tweets will be processed by above steps and will be written to separate file which will be used in later stage.
Steps involved in pre-processing:
- Replacing emoticons with weights without disturbing it's polarity.
- Replacing usernames, URLs with symbols like ||T||, ||U||.
- Replacing words with more than 3 continuous repeating characters with only 3 occurrence.
- Removing all the stop words and punctuation.
- Replacing all hashtags with the name in it.
- Replacing all the abbreviations with their full-forms.
- With the processed dataset, we created feature vectors.
- The basic feature that was considered was of unigrams.
- A list of all unique unigrams(tokenized words) across the training set was constructed and it formed the feature vector for all the tweets.
Enhancing Feature Vectors
- To improve the efficiency of the algorithm, we have added few more features to the existing one.
- In addition to the unique unigrams, few features like the following were also added
- Count of hashtags.
- Count of emoticons.
- Count of Negative words.
- The Feature Vector was written into a file in the format expected by libsvm classifier.
- A SVM Classifier was used and trained with the training file as input and the classifier is built.
- That classifier is used to predict the polarity of tweets in testing file.
Information Retrieval and Extraction Course
Natural Language Processing
Support Vector Machine