Fictometer : A simple and explainable algorithm for sentiment analysis

Kushal Shah
3 min readMar 26, 2022

Our ability to solve NLP problems using AI algorithms has advanced a lot, but our understanding of human language is yet to catch up. This is because most of the AI algorithms operate like a blackbox and even though we know the model architecture and all the parameter values, figuring out the underlying features that these algorithms use for a given task is very challenging. But for many practical applications, it is necessary for the AI to be explainable, without which these fancy algorithms just remain an academic exercise.

In 2019, we found a very simple algorithm for text classification which can help in identifying whether a given piece of text belongs to the fiction or non-fiction genre. We call this the Fictometer! So if you take an excerpt from your favourite novel or your favourite research paper, Fictometer will be able to tell the difference between the two. And this works with amazingly high accuracy (~ 96%)!

Of course, there are other algorithms which can perform the same task, but what is special about our work is that we were able to figure out two very simple features that can help in achieving very high accuracy. And these two features are simply the adjective/pronoun ratio and the adverb/adjective ratio in the given document.

The circles represent non-fiction text and the crosses represent fiction. These samples are taken from the Brown corpus. As can be clearly seen, the two genres are clearly divided on this graph which makes classification very easy.

Lets see how to implement this algorithm using the Brown corpus data from NLTK. There are two parts to the code. First part is about data preparation using the NLTK Brown corpus and second part is application of Logistic Regression.

Data Preparation using NLTK Brown Corpus

First comes the libraries we need to import for the data. The Brown corpus is available through NLTK.

The Fictometer algorithm is essentially based on Parts-Of-Speech (POS) tagging, which is a fundamental aspect of NLP. There are various ways in which POS tagging can be done for a given text, but broadly speaking we can either have universal POS tags (noun, adjective, adverb, pronoun, etc) or finer tags which differentiate between various types of nouns, adjectives, etc. For our task, the UPOS tags are good enough and so next we write a function to count the number of various UPOS tags in a given text.

Next we start reading text from the NLTK Brown corpus and create a DataFrame which contains information about the number of different POS tags for each text in the corpus.

Once we have all the UPOS tag information in our DataFrame, we need to calculate the two ratios that are our model features : adjective/pronoun and adverb/adjective.

The Brown corpus has several sub-categories and so we need to identify each of them as “fiction” or “non-fiction” depending on its contents.

Training and Testing using Logistic Regression

Great! So our data is ready for being trained and tested using any ML algorithm. We choose Logistic Regression for its relative simplicity. And it works amazingly well!

Next we drop all the unnecessary columns from our DataFrame, extract the input and output values, split them into a training and testing set and fit the Logistic Regression model using the training data.

Finally its time to see the results of our hard work!

There can be many practical applications of this Fictometer algorithm, but I believe the most important one is in news media to help in catching manipulative news articles which use emotional language instead of being straight-forward and simply fact-based. According a short study we did using news articles from reputed news sources, about 15% of the articles published belong to this emotionally manipulative category.

Jupyter notebook:

News App based on our work

Fictometer Python Package

Fictometer Portal

--

--

Kushal Shah
Kushal Shah

Written by Kushal Shah

Studied at IIT Madras, and was earlier faculty at IIT Delhi. Learn coding my Python Pal : https://www.pythonpal.org

Responses (4)