Improving Projects — Part III: NLP

5 min readJan 10, 2021

This post is part of efforts on improving my technical projects. Each post describes how I identified suspicious spots and updated them. You can find Part I: Regressors (Price) and Part II: Time Series (Sales) here. If you want to discuss anything, please feel free to leave a comment or message me on Linkedin.

Brief Background

This project extracts support phrases for English tweets’ sentimental labels. The training data includes 27,481 tweets with sentiments and selected texts (support phrases). Sentiments divide into 40% tweets being neutral, 31% being positive and 28% being negative.

Updates (01/09/2021)

Investigate top common words, special characters, and proportional length of selected texts
Clean part of stop words, special characters and punctuations according to investigation results
Compare results from Named-entity Recognition trained on cleaned and raw data for each sentiment
Train a four-layer Neural Network with Roberta on raw data

Code: https://github.com/QingchuanLyu/Tweet-Sentiment-Extraction

Investigate and Clean Data

Previously, I didn’t think about cleaning data, because my goal, extracting support phrases, means it’s good to retain special characters/punctuations in texts in order to achieve a higher Jaccard Index. This time, I decided to verify my assumption.

First, I checked if all tweets were in English with a language detector (“langdetect” library). The results showed 2064 tweets were not in English, but they were just in non-standard English, such as slang:

Then, I checked ten top common words in the support phrases of positive tweets. However, a bar graph showed that top three common words were all stop words: “to”, “I”, “a.” To understand common words in each sentiment better, I started cleaning steps:

I. Removed customized stop words: only removed stop words that couldn’t signal any sentiment, e.g., I removed “no,” “but,” “why” from NLTK’s stop word list, because they signaled negative sentiments; then added “I” and “with,” because they didn’t signal any sentiment.

II. Stemming: chopped off affixes. I didn’t use lemmantizer here, because lemmantizer often requires a “pos” to indicate a word is adjective or noun for accurate performance (the default is “noun”).

III. Removed part of special characters/punctuations. I didn’t remove asterisks because the repetition of * meant a curse. In fact, there’s a tweet consisting of four asterisks and was marked as “negative.” This made sense to me.

With the cleaning steps above, the top common words of support phrases for each sentiment were below:

Interestingly, “love” is the most common word of positive tweets, “miss” is the most common word of negative tweets, and ‘but’ is the most common word of neutral tweets.

Next, I checked the difference between the length of texts and their support phrases (length in terms of word counts). A simple computation shows that over 90% of neutral tweets have the same support phrases as the whole text, while the proportion was only less than 1/4 for positive and negative texts. This made me train models for different sentiments separately later:

Train data with Named-entity Recognition

Besides CNN and NER, another popular NLP technique is the Bag-of-words approach. I didn’t use Bag-of-words approach, because the foundation of this approach is word frequency. This algorithm will be confused by a tweet with repetitive words , such as “I…I…I am ok,” where “I” doesn’t signal any sentiment.

Here, the key idea of using NER here is treating support phrases as “entities.” As special characters and punctuations only counted towards a very small number of words in texts, I wanted to see if NER would recognize entities more often without these noise that didn’t contribute to sentiments most of time. Also, it would be interesting to see if NER did better with stemmed words. Therefore, I decided to train NER with both cleaned and raw text data. The training-test split ratio was 80–20 percents. The process of building and training NER with Spacy was standard:

Friendly reminder: remove leading and trailing spaces in texts for NER

Each row of training data was stored as (text, {“entities”: [[start_position, end_position, ‘selected_text’]]}), where start/end position indicated the starting/ending points of support phrases.

By training NER models for each sentiment, I got 0.68 average Jaccard Index with cleaned texts and 0.78 average Jaccard Index with raw texts. This showed that Spacy’s language class was able to collaborate punctuations and special characters. This was confirmed by Spacy’s official webpage. Another finding was that the performance of NER on the unseen 3K additional test data was not that good (0.60 Jaccard Index). This was not a complete shock: one of problems of NER was that this schema only worked well on words that it saw in training data, and therefore NER required a rich training database. How about CNN and RoBERTa?

Train data with CNN embedded with RoBERTa

RoBERTa is a complicated bi-directional encoder pre-training system, and already takes care of special characters, stopping words and punctuations. Therefore, I trained raw data with a convolutional neural network embedded with RoBERTa for each sentiment. The training process with Keras and TensorFlow was also standard:

Sentiments were stored as part of tokenized ids with the total number of tweets categorized in a sentiment.

The runtime of such a model was pretty long, est. 1 hour 48 minutes on my laptop. I had to move the model to an online API with a GPU accelerator. This gave 0.74 average Jaccard Index, only slightly lower than that in NER. However, this CNN embedded with RoBERTa model did outperform NER on the unseen test data, with 0.712 average Jaccard Index.

Further Thought

Two ideas I will try later in this project:

Try Time Series techniques with NLP models
Try Question-Answering models, too
Instead of training models to recognize support phrases, maybe I should try training models to identify words/phrases that don’t signal sentiments