User Tools

Site Tools

Return to Home page

Natural Languages Processing - Sentiment Analysis


This page shows an introduction to one of the various techniques of Natural Languages Processing: Sentiment Analysis.
The goals of this assignment are:

  • Providing an overview of this field
  • Showing three different implementations of a Sentiment Analysis engine and comparing their performances
  • Analyzing the limitations that a Sentiment Analysis engine is subjected to


Sentiment analysis (also known as opinion mining) is a field of natural language processing that deals with building systems for identifying and extracting opinions from the text. It is one of the most known and famous techniques of the NLP.
On this page, with the term “Sentiment Analysis” we mean determining whether a piece of writing is positive or negative. This can seem like a really easy and useless task that every person can easily do without the need for machine learning and AI but in the era of big data being able to automatize this job and apply it to a huge quantity of data can be a turning point from many points of views.
The main applications of the Sentiment Analysis are social media monitoring, customer support, brand monitoring, product analysis, and market research.

Let's see a basic example to understand what a Sentiment Analysis engine does. Let's consider two sentences coming from different reviews of a phone, the engine must be able to analyze the text and assign the label “positive” or “negative” to the text in order to mine the sentiment beyond the text.

Input text Real Sentiment Predicted Sentiment
“This phone is simply incredible, it takes beautiful pictures and has a big display” positive positive
“This phone is ugly, doesn't ring and has a small and fragile display” negative positive

In this example, the engine was able to correctly classify the first sentence while it didn't understand that the second review was a negative one.

For this assignment, I decided to implement three different Sentiment Analysis engines based on different approaches. I decided to use Python since is the most powerful programming language and user-friendly about machine learning, NLP, and classification algorithm since there are hundreds of powerful libraries that can provide great support to the developer. The dataset that I selected comes from reviews of cellphones and related accessories in JSON format which is really easy to be managed using python. Since the dataset contains several hundreds of thousands of reviews I created a script that reduces the number of reviews to about twenty thousand in order to avoid huge running time for the algorithms. Even if having a large quantity of data is better, having too many reviews would have led to huge time for the algorithms to be run. The script can be downloaded in the last section of the page. The link for downloading the dataset is shown in the last section. Since the algorithms' performances must be evaluated, a dataset that didn't contains only reviews' text but also the real mark the user assigned to the product has been chosen. In this way, it can be easy to understand if the prediction of the reviews has a positive result.

Engine Implementation

In this section, three different engine implementations are showed and compared. The three implementations are:

  • Lexicon Based
  • Machine Learning
  • Turney's Algorithm

These different approaches have been selected to show how Sentiment Analysis can be done from a very easy and intuitive way to a more complex and sophisticated one. It must be noticed that these are just three possibilities among many others. I selected these ones because they are all based on different theories and have different strengths and weaknesses.
The data coming from the dataset is a list of JSON reviews, each having the following format:

"reviewerID": "A30TL5EWN6DFXT", "asin": "120401325X", "reviewerName": "christina", "helpful": [0, 0], 
"reviewText": "They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again", 
"overall": 4.0, "summary": "Looks Good", "unixReviewTime": 1400630400, "reviewTime": "05 21, 2014"

For the assignment purpose, the useful fields are the reviewText which contains the text of the review, and the overall which contains the customer evaluation using a scale from 1.0 to 5.0.

Data Preparation

Before diving into the details of the algorithms it's very important to discuss the data preparation. Sentiment Analysis can be seen as a classification problem and this leads to the need of having pre-processed data to feed the algorithms with. In this case, this is valid only for the first two approaches since Turney's Algorithm doesn't need to clean and prepare the input text.
The data processing in lexicon-based and machine learning approaches is needed since algorithms work on single words which then are compared or classified. The process applied to each review is the following:

  • Review tokenization
  • Stop-words and punctuations removal
  • Stem extraction

All these techniques can be easily applied in python programming, using the NLTK toolkit.
In both the proposed algorithms the applied method for data cleaning is the following: the function process_review() receives as input a text and then sequentially calls tokenize_review(), remove_stopwords_punctuations(), and get_stem() which respectively divide the review in tokens (words), remove the single token if present in a stop-word or punctuation dictionary and finally extracts the stem of every single word (e.g. word “waiting” becomes “wait”).
The code of the explained functions is shown below:

def process_review(t):
    reviews_tokens = tokenize_review(t)
    reviews_clean = remove_stopwords_punctuations(reviews_tokens)
    reviews_stem = get_stem(reviews_clean)
def tokenize_review(t):
    review_tokens = word_tokenize(t, language='english')
    return review_tokens
stopwords_english = stopwords.words('english')
punctuations = string.punctuation
def remove_stopwords_punctuations(review_tokens):
    reviews_clean = []
    for word in review_tokens:
        if word not in stopwords_english and word not in punctuations:
    return reviews_clean
stemmer = PorterStemmer()
def get_stem(reviews_clean):
    reviews_stem = []
    for word in reviews_clean:
        stem_word = stemmer.stem(word)
    return reviews_stem

As an example, let's consider the text:

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

After the tokenization the result is:

['They', 'look', 'good', 'and', 'stick', 'good', '!', 'I', 'just', 'do', "n't", 'like', 'the', 'rounded', 'shape', 'because', 'I', 'was', 'always', 'bumping', 'it', 'and', 'Siri', 'kept', 'popping', 'up', 'and', 'it', 'was', 'irritating', '.', 'I', 'just', 'wo', "n't", 'buy', 'a', 'product', 'like', 'this', 'again'] 

Which after the removal of stop words and puntuation is:

['They', 'look', 'good', 'stick', 'good', 'I', "n't", 'like', 'rounded', 'shape', 'I', 'always', 'bumping', 'Siri', 'kept', 'popping', 'irritating', 'I', 'wo', "n't", 'buy', 'product', 'like']

And finally after stemming, looks like:

['they', 'look', 'good', 'stick', 'good', 'i', "n't", 'like', 'round', 'shape', 'i', 'alway', 'bump', 'siri', 'kept', 'pop', 'irrit', 'i', 'wo', "n't", 'buy', 'product', 'like']

This is what the chosen algorithms will analyze and process in order to understand the sentiment beyond the text.



The first algorithm that has been implemented is the most intuitive and basic one. The concept is really easy, each word of a review is searched inside a list of positive and negative words. If the number of words found in the positive words list is higher than the number of words present in the negative words list then the whole review is considered positive. Vice versa the review is considered negative. The limits of this solution are big and very evident but in order to appreciate more advanced approaches is important to understand also the basic ones. The lists of positive and negative words are very diffused and easy to find. The selected ones have been picked because they have a great number of words and also because are present misspelled words which can easily improve the performances and robustness of the engine. The links to the lists of words are available in the last section Is important to notice that only the last 50 reviews are used to evaluate the model. This can seem strange since this approach can be considered as an unsupervised algorithm with no need for training data, but in order to compare its properties with the supervised algorithms that will be presented later, is necessary to have also the same testing data for all algorithms.

The main steps of the algorithm are the following: two lists of reviews are obtained using loadReviews(). From these lists, 50 positive and 50 negative reviews are extracted to evaluate the model. Then a for each review a value is assigned to easily access the sentiment of the review (1 = positive, 0 = negative). The lists of positive and negative words are loaded using load_words() and saved in two lists. Finally, the real algorithm is implemented. For each word of each review is checked the presence in one of the two lists and the relative counter is incremented in case the word has been found. In the end, according to the number of the counters, the sentiment of the review is decided. Moreover, some statistical measures are printed.

if __name__ == "__main__":
    # select the set of positive and negative reviews
    all_positive_reviews, all_negative_reviews = loadReviews("./Cell_Phones_and_Accessories_5_filter.json")
    # IMPORTANT - this algorithm is an unsupervised algorithm! No need for train data
    # so I use the last 50 reviews to be compliant with the other 2 algorithms
    test_pos = all_positive_reviews[-50:]
    test_neg = all_negative_reviews[-50:]
    test_x = test_pos + test_neg
    test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))
    # looa and clean positive and negative words
    positive_words = load_words("positive-words.txt")
    negative_words = load_words("negative-words.txt")
    correct = 0
    incorrect = 0
    i = 0
    tp = 0
    tn = 0
    fp = 0
    fn = 0
    for r in test_x:
        p = 0
        n = 0
        # clean the review
        tmp_r = process_review(r)
        # for each word of the review check if present in positive or negative list of words
        for elem in tmp_r:
            if elem in positive_words:
                p += 1
            elif elem in negative_words:
                n += 1
        # depending on number of pos and neg words check if correct predition
        if p > n and test_y[i] == 1:
            correct += 1
            tp += 1
        elif p > n and test_y[i] == 0:
            incorrect += 1
            fp += 1
        elif n >= p and test_y[i] == 0:
            correct += 1
            tn += 1
        elif n >= p and test_y[i] == 1:
            incorrect += 1
            fn += 1
        i += 1
    print("Accuracy: " + str(correct / (correct + incorrect)))
    print("Recall: " + str(tp / (tp + fn)))
    print("Precision: " + str(tp / (tp + fp)))

In the following portion of code, the loading from the lists of positive and negative words is shown. The function open the file and starts reading line by line since each word is on a separated line. Is important to notice that the empty lines and the lines starting with ';' (which implies a comment) are skipped. In all other cases if the word is not a stop-word or a symbol of puntuaction is added to a list.

def load_words(fileName):
    # work with files
    file = open(fileName, 'r')
    Lines = file.readlines()
    words_buffer = []
    # Strips the newline character
    for l in Lines:
        # skip blank line and comments starting with ;
        if l == [] or l[0] == ';':
        l = l.strip('\n')
        l = l.strip('\t')
        if l not in stopwords_english and l not in punctuations:
    return words_buffer

In the following piece of code, the function that loads the reviews from the JSON file is explained. It exploits the JSON library present in python and fills two lists, one with positive and one with negative reviews. The distinction is made using the field “overall”. If the review mark is higher or equal to 3.0 then the review is considered positive, else is considered negative.

def loadReviews(fileName):
    file = open(fileName)
    list_pos = []
    list_neg = []
    data = json.load(file)
    for elem in data:
        if float(elem["overall"]) >= 3.0:
    return list_pos, list_neg

Machine Learning - Naive Bayes

The second approach is based on machine learning. Sentiment Analysis is basically a classification problem, so it's possible to exploit well-known solutions to create the engine. I decided to use the Naive Bayes classifier because many different researches prove that is the most robust and precise classifier.
The Naive Bayes approach is based on posterior probability which can be computed with the following formula:
posterior_probability.jpg The posterior probability can be computed as the likelihood (probability that given a class “A” then element “B” belongs to “A”) multiplied by the prior (probability of having the class “A”) divided by the evidence (probability of having the element “B”).
To make a transpose to the Sentiment Analysis case, the “A” can be seen as “positive” or “negative” while “B” represents a word. The posterior probably becomes the product of the probability of a word being positive or negative, multiplied by the probability of having a positive or negative review, all divided by the probability of a having the given word.
The more complex element to be computed is the likelihood since it requires the construction of a frequency dictionary.
Using more simple words, the algorithm uses the training data to count how many times a word appears in positive and negative reviews. Then this information can be used to understand if the words of a review are more likely to be in a positive or negative sentence. It must be noted that the evidence is not used in the algorithm since it is constant for all elements.
An example is shown in order to clarify the theory. Assume the following sentences are composing the training data, after the arrow the “clean” version is reported:

  • “The phone is good” ⇒ [“phone”,“good”]
  • “This item is terrible” ⇒ [“item”,“terrible”]
  • “This telephone is really good” ⇒ [“telephone”, “really”, “good”]

Of course, the first and last sentences are positive reviews while the second one is negative. When we create the frequency dictionary, its content is the following:

Positive Negative
phone 1 0
good 2 0
item 0 1
terrible 0 1
telephone 1 0
really 1 0

When a new review must be predicted, for each word of the review a lookup in the frequency dictionary is made and the polarity of the single word is evaluated. For example, the review “The item I bought is really really good” will be evaluated as positive since the average sentiment of the words is positive.

In the real implementation, the logarithm of the likelihood and prior has been used since they provide smaller values without modifying the meaning of the value. When hundreds of thousands of data feeds the algorithm is crucial not to work with reasonable values. This lead to the use of a different formula with respect to the one shown above, since the properties of the logarithms can be exploited. The prior will be called logprior and will be computed in this way: logprior.jpg Then the likelihood of a word becomes loglikelihood and is computed using this formula: loglikelihood.jpg Eventually, to evaluate the polarity of a review is computed by summing the loglikelihood of each word that composes the review. polarity.jpg

Since this is a supervised algorithm there is the need of dividing the dataset into training and testing classes. This is done after the call to loadReviews(). The first 24343 positive and negative reviews are used as training while as shown before the last 50 are used to test the model. Then the frequency dictionary is created and then the model is tested using the function train_naive_bayes() which returns the prior and the likelihood. Finally, each test review is used to evaluate the algorithm using naive_bayes_predict().

if __name__ == "__main__":
    # select the set of positive and negative reviews
    all_positive_reviews, all_negative_reviews = loadReviews("./Cell_Phones_and_Accessories_5_filter.json")
    # split the data to train and test sets
    test_pos = all_positive_reviews[-50:]
    train_pos = all_positive_reviews[:24343]
    test_neg = all_negative_reviews[-50:]
    train_neg = all_negative_reviews[:24343]
    train_x = train_pos + train_neg
    test_x = test_pos + test_neg
    # set label to 1 for positive reviews and 0 for negative ones
    train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
    test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))
    # create the frequency counting how many times a word appears in a positive/negative review
    frequency = create_frequency(train_x, train_y)
    # train the model
    logprior, loglikelihood = train_naive_bayes(frequency, train_x, train_y)
    i = 0
    correct = 0
    incorrect = 0
    tp = 0
    tn = 0
    fp = 0
    fn = 0
    for review in test_x:
        p = naive_bayes_predict(review, logprior, loglikelihood)
        # count true positive, true negative, false positive and false negative -> recall, precision
        if p > 0 and test_y[i] == 1:
            tp += 1
        elif p > 0 and test_y[i] == 0:
            fp += 1
        elif p < 0 and test_y[i] == 0:
            tn += 1
            fn += 1
        # count correct and incorrect prediction -> accuracy
        if (p > 0 and test_y[i] == 1) or (p <= 0 and test_y[i] == 0):
            correct += 1
            incorrect += 1
        i = i + 1
    print("Accuracy: " + str(correct / (correct + incorrect)))
    print("Recall: " + str(tp / (tp + fn)))
    print("Precision: " + str(tp / (tp + fp)))

The following piece of code shows the function used to create the frequency dictionary. The key of the dictionary is a tuple (word, pos/neg) while the value is the number of repetition of the word. create_frequency() is a very basic function, for each word of each review is checked if the tuple (word, pos/neg) is already present and it increases the number of the counter. The second element of the tuple is computed using the sentiment of the review.

def create_frequency(reviews, labels):
    freq_d = {}
    # Create frequency dictionary
    # ZIP creates a tuple (text, pos/neg)
    for review, y in zip(reviews, labels):
        # Before counting the frequency we preprocess the text
        for word in process_review(review):
            pair = (word, y)
            # if present increase the count else add to the dict
            if pair in freq_d:
                freq_d[pair] += 1
                freq_d[pair] = 1
    return freq_d

This part of the code is the function that is used to train the model. It computes the loglikelihood and the logprior. The first step is to calculate the number of unique words in the frequency model, then the total number of positive and negative word presence. Also, the number of positive reviews and negative reviews is counted and it is used to compute the prior which is the number of positive reviews divided by the number of negative ones. Using logarithms properties this can be done as a sum. Finally, the probability of a word to be positive and negative is computed and this is used to compute the loglikelihood of that word.

def train_naive_bayes(freq, reviews, review_labels):
    loglikelihood = {}
    # calculate the number of unique words in vocab
    unique_words = set([pair[0] for pair in freq.keys()])
    V = len(unique_words)
    # calculate N_pos and N_neg
    N_pos = N_neg = 0
    for pair in freq.keys():
        # if the word was in a positive review
        if pair[1] > 0:
            N_pos += freq[pair]
            N_neg += freq[pair]
    # Calculate the number of documents (reviews)
    # shape[0] return the # of rows of an array
    D = review_labels.shape[0]
    # Calculate D_pos, the number of positive documents (reviews)
    D_pos = sum(review_labels)
    # Calculate D_neg, the number of negative documents (reviews)
    D_neg = D - sum(review_labels)
    # Calculate logprior
    logprior = np.log(D_pos) - np.log(D_neg)
    # for each word
    for word in unique_words:
        # get the positive and negative frequency of the word
        freq_pos = freq.get((word, 1), 0)
        freq_neg = freq.get((word, 0), 0)
        # calculate the probability that word is positive, and negative
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)
        # calculate the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)
    return logprior, loglikelihood

Finally, the function naive_bayes_predict is used to understand the sentiment of a single review using the trained model. It simply computes the loglikelihood of every single word of the review and it adds the obtained values. The returned value is used to understand if a review is positive (prob >= 0) or negative (prob < 0).

def naive_bayes_predict(review, logprior, loglikelihood):
    # Process the review to get a list of words
    word_l = process_review(review)
    prob = 0
    # Add the logprior
    prob += logprior
    for word in word_l:
        if word in word_l and word in loglikelihood:
            prob += loglikelihood[word]
    return prob

Turney's Algorithm

The last approach that is presented is based on the work done by professor Peter D. Turney, presented in this paper. The main idea is to work on sets of two or three words instead of working on single words as shown in the previous algorithms. These sets of words are compared with two keywords: “excellent” and “poor”. The comparison is made through some statistical indicators that will be explained later. The reference words “excellent” and “poor” were chosen because, in the five-star review rating system, it is common to define one star as “poor” and five stars as “excellent”. During the development, other keywords have been tried (e.g. “great” instead of “excellent”) but the best results have been obtained with the words selected by professor Turney.
The basics steps of this algorithm are:

  • Extract phrasal lexicon from the reviews
  • Learn the polarity of each phrase
  • Rate the review based on the average polarity of the phrases

First, a part-of-speech tagger is applied to the review. Two consecutive words are extracted from the review if their tags conform to any of the patterns in the table shown below. The JJ tags indicate adjectives, the NN tags are nouns, the RB tags are adverbs, and the VB tags are verbs. The second pattern, for example, means that two consecutive words are extracted if the first word is an adverb and the second word is an adjective, but the third word (which is not extracted) cannot be a noun. The table has been defined by professor Turney in order to extract some meaningful parts of the review. He studied and understood that the following are the part of the text where the real sentiment information is typically stored.

The extraction of part-of-speech elements is not a trivial task and it's part of the NLP word. During the development, this task has been done using the NLTK toolkit which has a function allowing an easy extraction.

PMI-IR uses Pointwise Mutual Information (PMI) is a measure of association that measures the similarity of pairs of words or phrases. It is computed according to this formula.

The semantic orientation of a phrase is then computed as

Finally, an average of the SO of each phrase is made to obtain the sentiment of the review.
In practice, the SO computation is made using the NEAR operation, and using the properties of the logarithm becomes:

This change was done by professor Turney because of the search engine he was using. Using python the NEAR operator can be easily emulated using a regular expression. Let's now see the algorithm implementation.
The main of the program is really easy. It simply calls make_datasets() and then create an instance of the Turney class passing the dataset as parameter. A class has been used in order to make easier the sharing of some variables and structures. Finally, the turney() function is call, which actually runs the algorithm.

if __name__ == "__main__":
    FILE_PATH = './Cell_Phones_and_Accessories_5_filter.json'
    datasets = make_datasets(FILE_PATH)
    turney = Turney(datasets)

The function make_datasets() receivers as parameter the file name which is used in the loadReviews() function shown in the previous methods. Then the dataset is created. It is a dictionary of two keys, “train” which contains the training data, and “test” which contains the 50 reviews to be evaluated. Each of these two keys has as value another dictionary containing again two keys, “pos” indicating the positive reviews and “neg” representing the negative ones. The dataset is filled in the same way done in the machine learning approach.

def make_datasets(fileName):
    all_positive_reviews, all_negative_reviews = loadReviews(fileName)
    dataset = {'train': {'neg': [], 'pos': []}, 'test': {'neg': [], 'pos': []}}
    dataset['train']['pos'] = (all_positive_reviews[:24343])
    dataset['train']['neg'] = (all_negative_reviews[:24343])
    dataset['test']['pos'] = (all_positive_reviews[-50:])
    dataset['test']['neg'] = (all_negative_reviews[-50:])
    return dataset

This section of the code shows how the phrasal lexicons are extracted. The postag parameter is the output of nltk.pos_tag(nltk.word_tokenize(text)) which tokenizes a review and then recognizes the part-of-speech for each token. The parameter is used to find the patterns shown in the table above. Every time a phrasal lexicon is found is appended to the tag_patter list which is returned to the calling function.

def find_pattern(postag):
    tag_pattern = []
    for k in range(len(postag) - 2):
        if postag[k][1] == "JJ" and (postag[k + 1][1] == "NN" or postag[k + 1][1] == "NNS"):
            tag_pattern.append("".join(postag[k][0]) + " " + "".join(postag[k + 1][0]))
        elif ((postag[k][1] == "RB" or postag[k][1] == "RBR" or postag[k][1] == "RBS") and postag[k + 1][1] == "JJ" and
              postag[k + 2][1] != "NN" and postag[k + 2][1] != "NNS"):
            tag_pattern.append("".join(postag[k][0]) + " " + "".join(postag[k + 1][0]))
        elif postag[k][1] == "JJ" and postag[k + 1][1] == "JJ" and postag[k + 2][1] != "NN" and postag[k + 2][1] != "NNS":
            tag_pattern.append("".join(postag[k][0]) + " " + "".join(postag[k + 1][0]))
        elif (postag[k][1] == "NN" or postag[k][1] == "NNS") and postag[k + 1][1] == "JJ" and postag[k + 2][1] != "NN" and postag[k + 2][1] != "NNS":
            tag_pattern.append("".join(postag[k][0]) + " " + "".join(postag[k + 1][0]))
        elif ((postag[k][1] == "RB" or postag[k][1] == "RBR" or postag[k][1] == "RBS") and (
                postag[k + 1][1] == "VB" or postag[k + 1][1] == "VBD" or postag[k + 1][1] == "VBN" or postag[k + 1][1] == "VBG")):
            tag_pattern.append("".join(postag[k][0]) + " " + "".join(postag[k + 1][0]))
    return tag_pattern

The NEAR operator used by professor Turney in his theory and studies has been implemented using a regular expression library in python. It simply looks inside the text string, looking for the presence of the word parameter (which will be “excellent” or “poor”) together with one of the phrase extracted by the previous function.

def near_operator(phrase, word, text):
        string = word + r'\W+(?:\w+\W+){0,400}?' + phrase + r'|' + phrase + r'\W+(?:\w+\W+){0,400}?' + word
        freq_phrase_near_word = (len(re.findall(string, text)))
        return freq_phrase_near_word

Finally, the implementation of the class Turney is shown. When an object is instantiated, the init() function is called, receiving the dataset. It save the dataset to an internal variable and instantiates all the variables needed in the algorithm. It must be noticed that the value 0.01 for some variables is set in order to avoid divisions by 0 as explained in the professor's paper.
Let's now analyse the core of the algorithm: turney() function. It consists of a loop taking all test reviews, positive before and negative after, and extract the phrasal lexicon using the function previously explained. In case none of them are found the review is not classified. Else, all the positive and negative training reviews are scanned, and the value of the NEAR operator is computed with respect to “excellent” and “poor” for each phrasal lexicon of the test review. Finally, the number of “excellent” and “poor” in the review under test is evaluated. This implies that the calculate_sentiment() function can be called, and the mathematical formula can be computed to evaluate the average sentiment of the review. In the end, it's checked if the prediction was correct and the accuracy is computed.

class Turney(object):
    def __init__(self, dataset):
        self.datasets = dataset
        self.pos_phrases_hits = []
        self.neg_phrases_hits = []
        self.pos_hits = 0.01
        self.neg_hits = 0.01
        self.accuracy = 0
    def turney(self):
        tp = 0
        fp = 0
        tn = 0
        fn = 0
        for boolean, test_klass in enumerate(['pos', 'neg']):
            for i, data in enumerate(self.datasets['test'][test_klass]):
                print(str(i) + " out of " + str(len(self.datasets['test'][test_klass])) + " --> round " + str(boolean))
                phrases = find_pattern(nltk.pos_tag(nltk.word_tokenize(data)))
                if len(phrases) == 0:
                self.pos_phrases_hits = [0.01] * len(phrases)
                self.neg_phrases_hits = [0.01] * len(phrases)
                self.pos_hits = 0.01
                self.neg_hits = 0.01
                for train_klass in ['pos', 'neg']:
                    for text in self.datasets['train'][train_klass]:
                        for ind, phrase in enumerate(phrases):
                            self.pos_phrases_hits[ind] += near_operator(phrase, "excellent", text)
                            self.neg_phrases_hits[ind] += near_operator(phrase, "poor", text)
                            self.pos_hits += text.count("excellent")
                            self.neg_hits += text.count("poor")
                res = self.calculate_sentiment(boolean)
                # compute if correct prediction
                if res == 1 and boolean==0:
                    fp += 1
                elif res == 1 and boolean==1:
                    tp += 1
                elif res == 0 and boolean == 0:
                    fn += 1
                elif res == 0 and boolean==1:
                    tn += 1
        print("Accuracy: " + str(self.accuracy/100))
        print("Recall: " + str(tp / (tp + fn)))
        print("Precision: " + str(tp / (tp + fp)))
    def calculate_sentiment(self, is_negative=0):
        polarities = [0] * len(self.pos_phrases_hits)
        for i in range(len(self.pos_phrases_hits)):
            polarities[i] = math.log(
                (self.pos_phrases_hits[i] * self.neg_hits) / (self.neg_phrases_hits[i] * self.pos_hits), 2)
        avg = sum(polarities) / len(polarities)
        if (avg > 0 and is_negative == 0) or (avg < 0 and is_negative == 1):
            self.accuracy += 1
            return 1
        return 0


In this section, the performances of the three methods are analyzed and compared using statistical indexes and personal considerations coming from the development.
The metrics used for the evaluations are:

  • Accuracy ⇒ defines how many correct evaluations are made over the total number of evaluations (percentage of correctly evaluated reviews)
  • Recall ⇒ defines the ability of the algorithm to assign an item of class “C” the label “C”
  • Precision ⇒ detects the number of correctly evaluated elements of a class “C” over the total number of elements assigned to “C”

The recall and precision metrics must be computed for each possible evaluation class, so in this section are reported both positive and negative values for these indicators.


The lexicon-based approach has achieved the following values:

  • Accuracy = 62.0 %
  • Recall Positive = 84.0 %
  • Precision Positive = 58.3 %
  • Recall Negative= 40.0 %
  • Precision Negative= 71.4 %

The confusion matrix is:

Predict class
Actual class Class = Positive Class = Negative
Class = Positive 42 8
Class = Negative 30 20

The metrics shown above make clear that the algorithm has big problems in distinguishing the negative reviews. This is due to the fact that many reviews describe a negative concept without the use of very negative words. Moreover many times a negative word can be substituted by the use of the negation. “Ugly” can be written as “not beautiful” and in this case, the engine will count a positive and a negative word, and this changes completely many results. Finally, the meaning of the word is not taken into consideration. To make an example, the sentence “I would like to go home” is neutral but according to the algorithm it is evaluated as a positive one since the word “like” is present. On the other hand, the high precision of the negative evaluations shows that when a sentence is predicted to be negative most of the time the prediction is correct. Regarding the positive reviews, the behavior is the opposite. Almost all the positive reviews are classified as such but the number of reviews considered positive which instead was negative is not low. As stated before, this method is not much reliable and the metrics clearly show it but has been a good example to understand the limits of such a basic algorithm.
One last strength of this approach is the time needed for the analysis. It takes very few seconds to produce an output since it doesn't require the training of a model.

Machine Learning - Naive Bayes

The Naive Bayes classifier has achieved the following values:

  • Accuracy = 79.0 %
  • Recall Positive = 68.0 %
  • Precision Positive = 87.1 %
  • Recall Negative= 90.0 %
  • Precision Negative= 73.7 %

The confusion matrix is:

Predict class
Actual class Class = Positive Class = Negative
Class = Positive 34 16
Class = Negative 5 45

As it could be expected, the results of the machine learning algorithm are much more robust and reliable with respect to the lexicon-based one. Having more than 20'000 reviews to train the model allows having a solid algorithm, able to correctly predict the sentiment of the reviews. Even in this case the evaluation of the negative reviews is more complex and produces more errors than with the positive ones. The values of recall and precision are showing behavior similar to the previous method but with very high percent values. Since the algorithm works on single words possible mistakes can be done also by the machine learning method. The speed of the algorithm is much lower with respect to the previous solution since thousands of reviews must be analyzed to create the frequency dictionary and train the model. The needed time can go from 20/30 seconds using a powerful pc to some minutes with older hardware. For sake of completeness, the time for running the algorithm can be reduced exponentially if the training model is saved on the disk and loaded when needed. In this way, there is no time wasted in the training but only the evaluation of the new review must be done.

Turney's Algorithm

The Turney's algorithm has achieved the following values:

  • Accuracy = 57.0 %
  • Recall Positive = 70.0 %
  • Precision Positive = 36.8 %
  • Recall Negative= 41.9 %
  • Precision Negative= 74.2 %

The confusion matrix is:

Predict class
Actual class Class = Positive Class = Negative
Class = Positive 21 9
Class = Negative 36 26

The results of Turney's Algorithm are the worst obtained. The accuracy is not an acceptable value and the number of wrongly predicted reviews is big. This could seem strange since this approach is the only one working with sets of words instead of single words but other limitations rise using this algorithm. For sure the number of needed reviews is much higher than the one used in this assignment. Turney worked with the AltaVista Advanced Search engine, which indexes approximately 350 million web pages against the 20 thousand used in this case. This implies that a phrase lexicon is very difficult to be found on the training data and as consequence, no information can be extracted. Let's take an example, the phrase “very inconvenient” present in a test review is the only one present in the whole dataset and even if its meaning is really negative, the engine won't be able to assign a value to it. Obviously, this risk is much reduced if working with millions of reviews but it would be impossible for a standard pc to analyze so much data. Moreover, this theory has been developed almost 20 years ago, when machine learning was at its birth and it's understandable that its quality is much lower compared to more advanced techniques.
Moreover, the performance of the algorithm is terrible since each phrasal lexicon of each test review must be searched within each train review and compared with “excellent” and “poor”. This requires a huge amount of time that can go from 5 minutes to more than half an hour.

Limitations and Challanges

The algorithms shown before are just three possible approaches toward the world of Sentiment Analysis but there are dozens of powerful and robust algorithms. Even if this branch of the NLP is improving day after day, some challenges related to the natural languages and to the sematic are not yet solved. In this section, a brief list of these limitations is presented.

The first limitation is related to the presence of irony and sarcasm. The irony consists in stating the opposite of what one thinks in order to ridicule or underline concepts. Irony implies criticism, but it differs markedly from sarcasm which also implies contempt. This often implies the use of positive words to underline a negative concept. Irony or sarcasm are very complex to be detected by the standard Sentiment Analysis engines. An example is shown to clarify the previous concept.
The review of an airline company could be:“Thank you Company! I'm so happy to wait 3 hours on the plane before take-off. I can't wait to fly again with this beautiful flight company”.
If this review is read by a human, it can be easily labeled as a negative review. This is not trivial if made by a computer, which extracts positive words from the text. Words like “thank”, “happy”, and “beautiful” will make the algorithm believe that the sentiment of the review is positive.
The solution to this limitation is not trivial at all, researchers are developing AI systems and Supervised Learning Approaches able to find and identify sarcasm and irony. This challenge of Sentiment Analysis is leading to a new branch of NLP related to this issue.

Another challenge is the use of idioms inside a text. Machine learning programs don’t necessarily understand a figure of speech. For example, an idiom like “not my cup of tea” will make the algorithm struggle because it understands things in the literal sense. Hence, when an idiom is used in a comment or a review, the sentence can be misconstrued by the algorithm or even ignored. To overcome this problem a Sentiment Analysis platform needs to be trained in understanding idioms. When it comes to multiple languages, this problem becomes even more complex and difficult to be managed.

The use of negations, as explained previously, is very complex to be managed. Negations, given by words such as not, never, cannot, were not, etc. can confuse the model. For example, a machine algorithm needs to understand that a phrase that says, “I can’t not go to the gym”, means that the person intends to go to the gym. This issue can be solved using advanced machine learning algorithms that can be trained to understand that double negatives outweigh each other and turn a sentence into a positive.

Eventually, a Sentiment Analysis engine cannot understand the context of the sentence. Let's make an example. The word “unpredictable” can be very negative if present in the review of a self-driving car. It means that the car is unsafe and not reliable. The same word can have a very positive meaning if used to describe the plot of a movie, which means that the movie is not predictable or boring. This is again a very complex task to be solved in the future and again it implies the use of artificial intelligence to understand the context of the whole text.

In general, the Sentiment Analysis field is very deep and most of it is still undiscovered. In the next years, new technologies and algorithms will allow to have a more and more precise evaluation of text by mean of a computer.

Downloads and Instructions

The following links allow downloading some .zip files containing the algorithms and the script mentioned above.

From here the positive and negative list of words can be downloaded. It can be done by clicking the right mouse button and then “Save as” or “Salva con nome”.

can be downloaded from the following link. IMPORTANT: In order to access the dataset a registration to the website is needed.

The procedure to run an algorithm is the following:

  1. Download and install python3 on your pc (if not already present)
  2. Download the algorithm zip archive, the script, the dataset, and for the lexicon-based approach also the lists of positive and negative words.
  3. Extract the zip files
  4. If the lexicon-based approach is the one chosen then copy the positive and negative words list into the directory
  5. Copy the dataset file (Cell_Phones_and_Accessories_5.json) inside the script directory
  6. Open the terminal, browse to the script directory and launch the command
    1. python.exe .\ .\Cell_Phones_and_Accessories_5.json on WINDOWS
    2. python3 Cell_Phones_and_Accessories_5.json on MAC/LINUX
  7. The script will produce the file Cell_Phones_and_Accessories_5_filter.json
  8. Copy the new file and paste it to the directory of the algorithm
  9. Browse with the terminal to the directory of the algorithm and launch the command. NOTE: this operation could be done only once, is not needed every time an algorithm is launched
    1. python -m pip install numpy and python -m pip install nltk on WINDOWS
    2. sudo apt install python3-pip if pip is not already present on the PC and then pip3 install nltk numpy on MAC/LINUX
  10. Finally, execute the program by typing
    1. python.exe .\ on WINDOWS
    2. python3 on MAC/LINUX
  11. The output will be the statistical metrics shown in the comparison paragraph.

If you found any error, or if you want to partecipate to the editing of this wiki, please contact: admin [at]

You can reuse, distribute or modify the content of this page, but you must cite in any document (or webpage) this url:
/web/htdocs/ · Last modified: 2021/08/13 17:18 by zioskenz