Data science in a crude sense is all about making sense of any kind of raw data. From numbers to texts. Data acquisition, data preprocessing and data visualisation are some of the major parts of Data Science. And in this information age, with ever growing flow of information and data it has become a rather the need of the hour to handle the flow and provide useful insights and inferences of the data.
Natural Language Processing(NLP) is an applications of Data Science. NLP, in layman terms, is a process of manipulating a textual data. It’ll make a lot of sense once we take an actual practical example, in this article we’ll be looking at one of the most applied application of NLP i.e Sentiment Analysis. We’ll be using one of the Reddit's sub-reddits and understand the sentiment of the posts posted in it.
Before we go further, a few prerequisites:
- Python programming (beginner or intermediate)
- Installation and basic understanding of nltk packages.
Step 1: Starting with the basic imports of packages
- The IPython’s display module lets us control the clearing of printed output inside loops.
- pprint module lets us print JSON and lists.
- matplotlib and seaborn lets us to visualise the data and style it respectively.
- praw is the Reddit API wrapper which lets us loop through the sub-reddits
In order to use the API wrapper and import the data from the subreddits we gotta first create something called as ‘developer account’.
- Log into your Reddit account
- Navigate to https://www.reddit.com/prefs/apps/
- Click on the button that says “are you a developer? create an app…”
1. Enter a name (username works)
2. Select “script”
3. Use the localhost address of your jupyter notebook as a redirect URL
- Once you click “create app”, you’ll see where your Client ID and Client Secret are.
Alright after being done with that we gotta create a Reddit client which goes like this.
Fill in your respective client_id and client_secret in between the commas(yes, commas included).
We gonna initialise a set(). And the reason is set() to avoid duplicates which may be running multiple times. The latter code block is iterating over the “new” posts in the chosen subreddit (I have chosen /r/politics. But you may choose some other topic), and by setting the limit to None we can get up to 1000 headlines. When I ran it there were 974 new ones.
The simplicity of the tasks till this point is credited to the praw package. It runs alot of tasks in the background and lets us use a really simple interface. For instance rate limiting and organizing the JSON responses.
For the curious minds, if you want to obtain more than 1000 headlines at once using some tricks which I’ll leave to your accord. Hint: Learn about the streaming version implementation.
Step 2: Labeling our Data
NLTK’s built-in Vader Sentiment Analyzer will simply rank a piece of text as positive, negative or neutral using a lexicon of positive and negative words.
We can utilize this tool by first creating a Sentiment Intensity Analyzer (SIA) to categorize our headlines, then we’ll use the polarity_scores method to get the sentiment.
We’ll append each sentiment dictionary to a results list, which we’ll transform into a dataframe.
Along with the ‘headline’ there are 4 other labels in the output, namely ‘compound’, ‘neg’, ‘neu’, ‘pos’. The compound label represents the sentiment in the range of -1(Extremely negative) to 1(Extremely Positive). The other 3 represent the sentiment score of each category in the headline.
We will consider posts with a compound value greater than 0.2 as positive and less than -0.2 as negative. There’s some testing and experimentation that goes with choosing these ranges, and there is a trade-off to be made here. If you choose a higher value, you might get more compact results (less false positives and false negatives), but the size of the results will decrease significantly.
Step 3: Stats
The large number of neutral headlines is due to two main reasons:
- The assumption that we made earlier where headlines with compound value between 0.2 and -0.2 are considered neutral. The higher the margin, the larger the number of neutral headlines.
- We used general lexicon to categorize political news. The more correct way is to use a political-specific lexicon, but for that we would either need a human to manually label data, or we would need to find a custom lexicon already made.
Another interesting observation is the number of negative headlines, which could be attributed to the media’s behavior, such as the exaggeration of titles for clickbait. Another possibility is that our analyzer produced a lot of false negatives.
There’s definitely places to explore for improvements, but let’s move on for now.
Step 4: Tokenizers and Stopwords
Now that we gathered and labeled the data, let’s talk about some of the basics of preprocessing data to help us get a clearer understanding of our dataset.
First of all, let’s talk about tokenizers. Tokenization is the process of breaking a stream of text up into meaningful elements called tokens. You can tokenize a paragraph into sentences, a sentence into words and so on.
In our case, we have headlines, which can be considered sentences, so we will use a word tokenizer:
This is a simple English stopword list that contains most of the common filler words that just add to our data size for no additional info. Further down the line, you’ll most likely use a more advanced stopword list that’s ideal for your use case, but NLTK’s is a good start.
We can grab all of the positive label headlines from our dataframe, hand them over to our function, then call NLTK’s `FreqDist` function to get the most common words in the positive headlines:
Now that’s for positive words, for negative words the code is exactly the same with just different appropriate variables. You can come up with more variables and inferences
This was a quick and easy dive in into the application of NLP for Sentiment Analysis. Sentiment Analysis is a vast universe in its own. I encourage you to explore the domain on your own. I’ll be posting more on NLP and other Data Science topics. The complete code of this project is on my github.