Sentiment Analysis in Pharo using a real data set

You are a movie reviewer, and a colleague has just sent to you a set of files with hundreds of reviews to determine their sentiments, for example classify them into positive or negative. You read that machine learning can help here processing massive amounts of data by using a classifier. But computers are not good with textual data, so all these reviews needs to be converted into a friendly format for the machine (hint: vectors of numbers). How do we go from hundreds of text files to an object which can predict new inputs? Meet TF-IDF + Naive Bayes: an algorithm which penalizes words that appear frequently in most of the texts, and a machine learning classifier which has proven to be useful for natural language processing.

The whole idea of the TF-IDF invention is to measure the importance of words in documents (so-called “corpus” in the vocabulary). So if we just can “teach” the machine what words are important for sentiment analysis, then we could classify sentiments in your colleague’s reviews. Teaching means that something was learned before. This is our dataset, which was enriched with knowledge. Fortunately, there were people who already annotated sentiments of IMDB reviews to help in our task.

Probably you would also like to do other high-level analysis of text, like hot topics detection, or any quantitative analysis (meaning: which can be ranked). Although there is no all-in-one recipe, most chances are that there is a standardized workflow for you, which could include: Lowercasing words, remove stop words, punctuations, abbreviations, apostrophe, single characters, stemming or term recognition.

So the basic idea is to go from text to vectors (with TF-IDF) so it can be applied to a predictor algorithm. Later, in a second part, we will use Naïve Bayes as classifier and, of course, you can try to generalize to other types of algorithms like SVM or Neural Networks.

Dataset

We are going to use the IMDB Large Movie Review Dataset with 50,000 reviews where 1 review = 1 file. They are divided in two folders: one for training (25k) and another one for testing (25k). Additionally, both the training and testing sets are sub-divided into positive (12,5k) and negative (12,5k) annotated reviews. The reviews here are ranked between 1 and 10 stars. A review is considered positive if it has more than 7 stars, and negative if it has less than 4 stars, there are no reviews with 5 or 6 stars.

The IMDB dataset is commonly used in a Natural Language Processing (NLP) task named “binary sentiment classification”. Summarizing: It is used when you want to build something to identify between two types or classes of “sentiments”, positive or negative. You could also expand the classification into as many classes as you could get. In this case you could consider to classify using up to 8 classes.

To start working with the dataset, download and uncompress the files to the Pharo image directory (which is where your .image file is located) as follows:

wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar zxvf aclImdb_v1.tar.gz

Now you should have a folder named aclImdb/ with all the files ready to analyze.

Setup

You can launch Pharo, for this article we use Pharo 10 but the process should work for Pharo 11 too:

./pharo-ui Pharo.image &

Let’s first install the AI packages in Pharo:

EpMonitor disableDuring: [ 
  Metacello new
    baseline: 'AIPharo';
    repository: 'github://pharo-ai/ai/src';
    onWarningLog;
    load ]

Data Exploration

To bring some context, we could say that in the Data Science pipeline there are some typical steps for classification tasks. They can be grouped into 3 big stages: Data Engineering (Exploration, Wrangling, Cleansing, Preparation), Machine Learning (Model Learning, Model Validation) and Operations (Model Deployment, Data Visualization).

Now let’s begin the stage commonly named as “Data wrangling”. This is what popular libraries like pandas does. A first step here is data exploration and data sourcing. The uncompressed dataset has the following directory structure:

acImdb\
    test\
        neg\
        pos\
    train\
        neg\
        pos\

With the following expression, open a Pharo Inspector on the result of the train reviews (highlight and evaluate with Cmd + I or Ctrl + I):

('aclImdb/train/' asFileReference childrenMatching: 'neg;pos')
  collect: [ : revFileDir | revFileDir children collect: #contents ].

And it looks like this:

You have there two main containers. One contains the negative reviews (very funny to read indeed), and the other one the positive ones. Hold on this information for later.

Sourcing the annotations

Classification tasks includes some kind of annotation somewhere, which you can use as “predictor” to train a model. Hopefully, your raw data includes a column with it. In this case the stars (i.e. the classes) are in the file name of each review (which has the pattern reviewID_reviewStarRating.txt) so if you want to enrich your classifier with more classes, you could check the file name star rating depending if it’s greater than 7 or lesser than 4. We will adapt our previous expression to add a sentiment polarity value of value 1 (positive sentiment) and with value 0 if it’s negative. But we do not need to check the file name star rating, this information is already available in the directory name, so we adapt our script to associate the polarity to each review:

| reviews |
reviews := (#('train' 'test') collect: [ : setName | 
  (('aclImdb/' , setName , '/') asFileReference childrenMatching: 'neg;pos') 
    collect: [ : revFileDir | 
	| polarity |
        polarity := (revFileDir basename endsWith: 'neg') 
                           ifTrue: [ 0 ] ifFalse: [ 1 ].			
	revFileDir children 
	     collectDisplayingProgress: [ : file | file contents -> polarity ] ] ]).

In a real project now it could be a good time to create a ReviewsCollector class, and create a basic protocol for loading and reading reviews. You could also consider using internally a DataFrame instead of “plain” Collections, specially if you want to augment each review in the dataset with features to be calculated. Here we will concentrate in the raw workflow rather than building an object model.

Note : A Pharo/Smalltalk session to typically involves evaluation of expressions directly in the Inspector evaluator. You can copy & paste scripts from this post and re-evaluate the whole workflow from the start each time (if you have enough time), but I encourage to use the Inspector, which is more in line with Exporatory Data Analysis (EDA). At the end of your working session, you can save the image, or just build a script for reproducibility. In this post we will also checkpoint each step for better reproducibility, using the built-in Pharo serializer.

Duplicates removal

To start cleaning the dataset, one of the first tasks we could do is to check if there are duplicates, and remove them from our dataset. We use the message #asSet to remove duplicates:

| reviews dedupReviews |
reviews := (#('train' 'test') collect: [ : setName | 
  (('aclImdb/' , setName , '/') asFileReference childrenMatching: 'neg;pos') 
    collect: [ : revFileDir | 
	| polarity |
        polarity := (revFileDir basename endsWith: 'neg') 
                           ifTrue: [ 0 ] ifFalse: [ 1 ].			
	revFileDir children 
	     collectDisplayingProgress: [ : file | file contents -> polarity ] ] ]).
dedupReviews := reviews deepFlatten asSet.

After this operation we end with 49,582 unique reviews which you can download from here: acimdb_49582_nodups (27.5 Mb) zipped in FUEL serialization format (use FLMaterializer to load it into the image)Download

Special artifacts removal

After manual inspection we can see our dataset contains artifacts, such as HTML tags. In this case it means the data was scrapped from HTML web pages, so it would not be detected by our word tokenizer which can recognize separators and special characters but not HTML tags. You could discover tags by exploring with the Pharo Inspector (Cmd + I or Ctrl+ I) with a script like this:

| reviews dedupReviews |
reviews := (#('train' 'test') collect: [ : setName | 
  (('aclImdb/' , setName , '/') asFileReference childrenMatching: 'neg;pos') 
    collect: [ : revFileDir | 
	| polarity |
        polarity := (revFileDir basename endsWith: 'neg') 
                           ifTrue: [ 0 ] ifFalse: [ 1 ].			
	revFileDir children 
	     collectDisplayingProgress: [ : file | file contents -> polarity ] ] ]).
dedupReviews := reviews deepFlatten asSet.
dedupReviews anySatisfy: [ : assoc | 
	| reviewText |
	reviewText := assoc key.
	(reviewText findTokens: ' ') anySatisfy: [ : word | word beginsWith: '<br' ] ]

So, if we pick a random review, our idea is to go from:

Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience.< br/>Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form.

Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience.Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form.

And we can do it with a simple expression which splits the whole sentence String by the HTML BR pattern and then join the splitted substrings:

| reviews dedupReviews cleanedReviews |
reviews := (#('train' 'test') collect: [ : setName | 
	(('aclImdb/' , setName , '/') asFileReference childrenMatching: 'neg;pos') 
		collect: [ : revFileDir | 
			| polarity |
		    polarity := (revFileDir basename endsWith: 'neg') ifTrue: [ 0 ] ifFalse: [ 1 ].			
			revFileDir children 
		     collectDisplayingProgress: [ : file | file contents -> polarity ] ] ]).
dedupReviews := reviews deepFlatten asSet.
cleanedReviews := dedupReviews collectDisplayingProgress: [ : docAssoc | 
	(docAssoc key findBetweenSubstrings: #('<br />')) joinUsing: '' ].

So #findBetweenSubstrings: can detect multiple patterns, tokenize the receiver, and then we join them again to get rid of noise patterns. Of course you can adapt and play with the expression to your own needs. I feel it is a good starting point and it avoids nasty regular expressions. Other non-sense text artifacts you might want to check are: ‘\n’, EOL, ‘^M’, ‘\r’.

To generalize for other artifacts, use the #removeSpecialArtifacts: method.

| reviews dedupReviews cleanedReviews tokenizedReviews |
reviews := (#('train' 'test') collect: [ : setName | 
 (('aclImdb/' , setName , '/') asFileReference childrenMatching: 'neg;pos') 
   collect: [ : revFileDir | 
     | polarity |
     polarity := (revFileDir basename endsWith: 'neg') 
        ifTrue: [ 0 ] ifFalse: [ 1 ].		
     revFileDir children 
	collectDisplayingProgress: [ : file | file contents -> polarity ] ] ]).
dedupReviews := reviews deepFlatten asSet.
cleanedReviews := dedupReviews collectDisplayingProgress: #removeSpecialArtifacts.

"This serialization step is optional and could take some time to complete"
FLSerializer 
	serialize: cleanedReviews 
	toFileNamed: 'acImdb_49582_nodups_noartfcts.fuel'.

cleanedReviews

Notice you cannot clean such artifacts directly with a (typical) tokenizer, because tokenization involves detection of punctuation Characters: If you apply tokenization first, you could lose common (written) language expressions which includes punctuation, for example a smiley 🙂

Punctuation, Special characters (Tokenization)

The next logical step is to transform each of the cleaned reviews Collection (which is composed of Strings “rows”, where a row = a document), into sequences of words, a process called whitespace tokenization, so they only contain words without “noise”.

When it comes to analysis of special characters and punctuation is when things become very interesting. From a näive point of view, just removing all separators would be simple, clean and enough. But language systems are much more complicated, specially when you bring into the analysis variables such as idiom, alphabet types, or even noise. For example: If you are doing more finer semantic (linguistic) analysis then punctuation could be significative, because the target language affects the meaning of a sentence.

Removal of punctuation and special characters is done sending the #tokenize message to any Collection of String. We can see it in action evaluating the following expression :

| reviews dedupReviews cleanedReviews wordTokenizer tokenizedReviews |
reviews := (#('train' 'test') collect: [ : setName | 
	(('aclImdb/' , setName , '/') asFileReference childrenMatching: 'neg;pos') 
		collect: [ : revFileDir | 
			| polarity |
		    polarity := (revFileDir basename endsWith: 'neg') ifTrue: [ 0 ] ifFalse: [ 1 ].			
			revFileDir children 
		     collectDisplayingProgress: [ : file | file contents -> polarity ] ] ]).
dedupReviews := reviews deepFlatten asSet.
wordTokenizer := AIWordTokenizer specialArtifacts.
cleanedReviews := dedupReviews collectDisplayingProgress: [ : docAssoc | 
	(docAssoc key removeSpecialArtifacts: wordTokenizer) -> docAssoc value ].
tokenizedReviews := cleanedReviews collectDisplayingProgress: [ : docAssoc | 
	docAssoc key tokenize -> docAssoc value ].

"This serialization step is optional and could take some time to complete"
FLSerializer 
	serialize: tokenizedReviews 
	toFileNamed: 'acImdb_49582_nodups_noartfcts_tokenized.fuel'.

tokenizedReviews

You can download the resulting cleaned dataset so far here: acimdb_49k_nodups_noartfcts (27.2 Mb) zipped in FUEL serialization format (use FLMaterializer to load it into the image)Download

Stopwords

Words such as “the”, “of”, “a”, etc could be removed in two ways: By hand (using premade stopwords lists) or by the automagical (statistical) use of TF-IDF. But read, here there are two excellent different opinions from the pros and cons of removing stop words. TL;DR: Removing stopwords with TF-IDF depends of the context and the goal of your task. We can check if the TF-IDF algorithm will “automatically” rank low the very frequent terms which appear in many documents.

If you decided to go with the stopword removal, the stopwords package in Pharo which provides multiple stopword premade lists. We can use a default list of stopwords, but you can use another one you prefer.

AIStopwords forEngish.
AIStopwords forSpanish.

To explore other lists

AIStopwords listSummary.

So our script so far with stopword removal:

| reviews dedupReviews cleanedReviews wordTokenizer tokenizedReviews |
reviews := (#('train' 'test') collect: [ : setName | 
 (('aclImdb/' , setName , '/') asFileReference childrenMatching: 'neg;pos') 
   collect: [ : revFileDir | 
     | polarity |
     polarity := (revFileDir basename endsWith: 'neg') 
        ifTrue: [ 0 ] ifFalse: [ 1 ].		
     revFileDir children 
	collectDisplayingProgress: [ : file | file contents -> polarity ] ] ]).
dedupReviews := reviews deepFlatten asSet.
wordTokenizer := AIWordTokenizer specialArtifacts.
cleanedReviews := dedupReviews collectDisplayingProgress: [ : docAssoc | 
	(docAssoc key removeSpecialArtifacts: wordTokenizer) -> docAssoc value ].
tokenizedReviews := cleanedReviews collectDisplayingProgress: [ : docAssoc | 
	docAssoc key tokenizeWithoutStopwords -> docAssoc value ].

"This serialization step is optional and could take some time to complete"
FLSerializer 
	serialize: tokenizedReviews 
	toFileNamed: 'acImdb_49582_nodups_noartfcts_tokenized.fuel'.

tokenizedReviews

To ignore stopwords removal just replace #tokenizeWithoutStopwords with #tokenize.

So far a first part covering reading and cleaning data for a classification task. In a next article we will see how to classify these reviews with a classifier.