compilers:nlp_sentiment_analysis
Return to Home page
If you found any error, or if you want to partecipate to the editing of this wiki, please contact: admin [at] skenz.it
You can reuse, distribute or modify the content of this page, but you must cite in any document (or webpage) this url: https://www.skenz.it/compilers/nlp_sentiment_analysis?do=diff&rev2%5B0%5D=1628857865&rev2%5B1%5D=1628857936&difftype=sidebyside
Differences
This shows you the differences between two versions of the page.
— | compilers:nlp_sentiment_analysis [2024/04/08 22:34] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Natural Languages Processing - Sentiment Analysis ====== | ||
+ | ===== Introduction ===== | ||
+ | This page shows an introduction to one of the various techniques of Natural Languages Processing: Sentiment Analysis.\\ | ||
+ | The goals of this assignment are: | ||
+ | * Providing an overview of this field | ||
+ | * Showing three different implementations of a Sentiment Analysis engine and comparing their performances | ||
+ | * Analyzing the limitations that a Sentiment Analysis engine is subjected to | ||
+ | |||
+ | ===== Overview ==== | ||
+ | Sentiment analysis (also known as opinion mining) is a field of natural language processing that deals with building systems for identifying and extracting opinions from the text. It is one of the most known and famous techniques of the NLP.\\ | ||
+ | On this page, with the term " | ||
+ | The main applications of the Sentiment Analysis are social media monitoring, customer support, brand monitoring, product analysis, and market research.\\ \\ | ||
+ | Let's see a basic example to understand what a Sentiment Analysis engine does. Let's consider two sentences coming from different reviews of a phone, the engine must be able to analyze the text and assign the label " | ||
+ | ^ Input text ^ Real Sentiment | ||
+ | |"This phone is simply incredible, it takes beautiful pictures and has a big display" | ||
+ | |"This phone is ugly, doesn' | ||
+ | In this example, the engine was able to correctly classify the first sentence while it didn't understand that the second review was a negative one. \\ \\ | ||
+ | For this assignment, I decided to implement three different Sentiment Analysis engines based on different approaches. I decided to use Python since is the most powerful programming language and user-friendly about machine learning, NLP, and classification algorithm since there are hundreds of powerful libraries that can provide great support to the developer. | ||
+ | The dataset that I selected comes from amazon.com reviews of cellphones and related accessories in JSON format which is really easy to be managed using python. Since the dataset contains several hundreds of thousands of reviews I created a script that reduces the number of reviews to about twenty thousand in order to avoid huge running time for the algorithms. Even if having a large quantity of data is better, having too many reviews would have led to huge time for the algorithms to be run. The script can be downloaded in the last section of the page. | ||
+ | The link for downloading the dataset is shown in the last section. | ||
+ | Since the algorithms' | ||
+ | |||
+ | |||
+ | ===== Engine Implementation ==== | ||
+ | In this section, three different engine implementations are showed and compared. The three implementations are: | ||
+ | * Lexicon Based | ||
+ | * Machine Learning | ||
+ | * Turney' | ||
+ | These different approaches have been selected to show how Sentiment Analysis can be done from a very easy and intuitive way to a more complex and sophisticated one. It must be noticed that these are just three possibilities among many others. I selected these ones because they are all based on different theories and have different strengths and weaknesses. \\ | ||
+ | The data coming from the dataset is a list of JSON reviews, each having the following format: \\ \\ | ||
+ | < | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | }</ | ||
+ | For the assignment purpose, the useful fields are the // | ||
+ | |||
+ | |||
+ | |||
+ | ==== Data Preparation ==== | ||
+ | Before diving into the details of the algorithms it's very important to discuss the data preparation. Sentiment Analysis can be seen as a classification problem and this leads to the need of having pre-processed data to feed the algorithms with. In this case, this is valid only for the first two approaches since Turney' | ||
+ | The data processing in lexicon-based and machine learning approaches is needed since algorithms work on single words which then are compared or classified. The process applied to each review is the following: | ||
+ | * Review tokenization | ||
+ | * Stop-words and punctuations removal | ||
+ | * Stem extraction | ||
+ | All these techniques can be easily applied in python programming, | ||
+ | In both the proposed algorithms the applied method for data cleaning is the following: the function process_review() receives as input a text and then sequentially calls tokenize_review(), | ||
+ | The code of the explained functions is shown below: | ||
+ | <code python> | ||
+ | def process_review(t): | ||
+ | reviews_tokens = tokenize_review(t) | ||
+ | reviews_clean = remove_stopwords_punctuations(reviews_tokens) | ||
+ | reviews_stem = get_stem(reviews_clean) | ||
+ | </ | ||
+ | <code python> | ||
+ | def tokenize_review(t): | ||
+ | review_tokens = word_tokenize(t, | ||
+ | return review_tokens | ||
+ | </ | ||
+ | <code python> | ||
+ | stopwords_english = stopwords.words(' | ||
+ | punctuations = string.punctuation | ||
+ | |||
+ | def remove_stopwords_punctuations(review_tokens): | ||
+ | reviews_clean = [] | ||
+ | |||
+ | for word in review_tokens: | ||
+ | if word not in stopwords_english and word not in punctuations: | ||
+ | reviews_clean.append(word) | ||
+ | return reviews_clean | ||
+ | </ | ||
+ | <code python> | ||
+ | stemmer = PorterStemmer() | ||
+ | |||
+ | def get_stem(reviews_clean): | ||
+ | reviews_stem = [] | ||
+ | |||
+ | for word in reviews_clean: | ||
+ | stem_word = stemmer.stem(word) | ||
+ | reviews_stem.append(stem_word) | ||
+ | |||
+ | return reviews_stem | ||
+ | </ | ||
+ | |||
+ | \\ | ||
+ | As an example, let's consider the text: < | ||
+ | After the tokenization the result is: < | ||
+ | Which after the removal of stop words and puntuation is: < | ||
+ | And finally after stemming, looks like: < | ||
+ | This is what the chosen algorithms will analyze and process in order to understand the sentiment beyond the text. | ||
+ | |||
+ | |||
+ | ==== Implementations ==== | ||
+ | |||
+ | === Lexicon-Based === | ||
+ | The first algorithm that has been implemented is the most intuitive and basic one. The concept is really easy, each word of a review is searched inside a list of positive and negative words. If the number of words found in the positive words list is higher than the number of words present in the negative words list then the whole review is considered positive. Vice versa the review is considered negative. The limits of this solution are big and very evident but in order to appreciate more advanced approaches is important to understand also the basic ones. | ||
+ | The lists of positive and negative words are very diffused and easy to find. The selected ones have been picked because they have a great number of words and also because are present misspelled words which can easily improve the performances and robustness of the engine. The links to the lists of words are available in the last section | ||
+ | Is important to notice that only the last 50 reviews are used to evaluate the model. This can seem strange since this approach can be considered as an unsupervised algorithm with no need for training data, but in order to compare its properties with the supervised algorithms that will be presented later, is necessary to have also the same testing data for all algorithms.\\ \\ | ||
+ | The main steps of the algorithm are the following: two lists of reviews are obtained using // | ||
+ | |||
+ | <code python> | ||
+ | if __name__ == " | ||
+ | # select the set of positive and negative reviews | ||
+ | all_positive_reviews, | ||
+ | |||
+ | # IMPORTANT - this algorithm is an unsupervised algorithm! No need for train data | ||
+ | # so I use the last 50 reviews to be compliant with the other 2 algorithms | ||
+ | test_pos = all_positive_reviews[-50: | ||
+ | test_neg = all_negative_reviews[-50: | ||
+ | |||
+ | test_x = test_pos + test_neg | ||
+ | test_y = np.append(np.ones(len(test_pos)), | ||
+ | |||
+ | # looa and clean positive and negative words | ||
+ | positive_words = load_words(" | ||
+ | negative_words = load_words(" | ||
+ | |||
+ | correct = 0 | ||
+ | incorrect = 0 | ||
+ | i = 0 | ||
+ | tp = 0 | ||
+ | tn = 0 | ||
+ | fp = 0 | ||
+ | fn = 0 | ||
+ | for r in test_x: | ||
+ | p = 0 | ||
+ | n = 0 | ||
+ | # clean the review | ||
+ | tmp_r = process_review(r) | ||
+ | # for each word of the review check if present in positive or negative list of words | ||
+ | for elem in tmp_r: | ||
+ | if elem in positive_words: | ||
+ | p += 1 | ||
+ | elif elem in negative_words: | ||
+ | n += 1 | ||
+ | |||
+ | # depending on number of pos and neg words check if correct predition | ||
+ | if p > n and test_y[i] == 1: | ||
+ | correct += 1 | ||
+ | tp += 1 | ||
+ | elif p > n and test_y[i] == 0: | ||
+ | incorrect += 1 | ||
+ | fp += 1 | ||
+ | elif n >= p and test_y[i] == 0: | ||
+ | correct += 1 | ||
+ | tn += 1 | ||
+ | elif n >= p and test_y[i] == 1: | ||
+ | incorrect += 1 | ||
+ | fn += 1 | ||
+ | |||
+ | i += 1 | ||
+ | |||
+ | print(" | ||
+ | print(" | ||
+ | print(" | ||
+ | </ | ||
+ | In the following portion of code, the loading from the lists of positive and negative words is shown. The function open the file and starts reading line by line since each word is on a separated line. Is important to notice that the empty lines and the lines starting with ';' | ||
+ | <code python> | ||
+ | def load_words(fileName): | ||
+ | # work with files | ||
+ | file = open(fileName, | ||
+ | Lines = file.readlines() | ||
+ | words_buffer = [] | ||
+ | # Strips the newline character | ||
+ | for l in Lines: | ||
+ | # skip blank line and comments starting with ; | ||
+ | if l == [] or l[0] == ';': | ||
+ | continue | ||
+ | l = l.strip(' | ||
+ | l = l.strip(' | ||
+ | |||
+ | if l not in stopwords_english and l not in punctuations: | ||
+ | words_buffer.append(l) | ||
+ | file.close() | ||
+ | return words_buffer | ||
+ | </ | ||
+ | In the following piece of code, the function that loads the reviews from the JSON file is explained. It exploits the JSON library present in python and fills two lists, one with positive and one with negative reviews. The distinction is made using the field " | ||
+ | <code python> | ||
+ | def loadReviews(fileName): | ||
+ | file = open(fileName) | ||
+ | list_pos = [] | ||
+ | list_neg = [] | ||
+ | data = json.load(file) | ||
+ | for elem in data: | ||
+ | if float(elem[" | ||
+ | list_pos.append(elem[" | ||
+ | else: | ||
+ | list_neg.append(elem[" | ||
+ | return list_pos, list_neg | ||
+ | </ | ||
+ | === Machine Learning - Naive Bayes === | ||
+ | The second approach is based on machine learning. Sentiment Analysis is basically a classification problem, so it's possible to exploit well-known solutions to create the engine. I decided to use the Naive Bayes classifier because many different researches prove that is the most robust and precise classifier. \\ | ||
+ | The Naive Bayes approach is based on posterior probability which can be computed with the following formula: \\ | ||
+ | {{ https:// | ||
+ | The posterior probability can be computed as the likelihood (probability that given a class " | ||
+ | To make a transpose to the Sentiment Analysis case, the " | ||
+ | The more complex element to be computed is the likelihood since it requires the construction of a frequency dictionary. \\ | ||
+ | Using more simple words, the algorithm uses the training data to count how many times a word appears in positive and negative reviews. Then this information can be used to understand if the words of a review are more likely to be in a positive or negative sentence. It must be noted that the evidence is not used in the algorithm since it is constant for all elements.\\ | ||
+ | An example is shown in order to clarify the theory. Assume the following sentences are composing the training data, after the arrow the " | ||
+ | * "The phone is good" => [" | ||
+ | * "This item is terrible" | ||
+ | * "This telephone is really good" => [" | ||
+ | Of course, the first and last sentences are positive reviews while the second one is negative. When we create the frequency dictionary, its content is the following: | ||
+ | | ^ Positive | ||
+ | ^ phone | 1 | 0 | | ||
+ | ^ good | 2 | 0 | | ||
+ | ^ item | 0 | 1 | | ||
+ | ^ terrible | ||
+ | ^ telephone | ||
+ | ^ really | ||
+ | |||
+ | When a new review must be predicted, for each word of the review a lookup in the frequency dictionary is made and the polarity of the single word is evaluated. | ||
+ | For example, the review "The item I bought is really really good" will be evaluated as positive since the average sentiment of the words is positive. \\ \\ | ||
+ | In the real implementation, | ||
+ | {{ https:// | ||
+ | Then the likelihood of a word becomes // | ||
+ | {{ https:// | ||
+ | Eventually, to evaluate the polarity of a review is computed by summing the loglikelihood of each word that composes the review. | ||
+ | {{ https:// | ||
+ | |||
+ | Since this is a supervised algorithm there is the need of dividing the dataset into training and testing classes. This is done after the call to // | ||
+ | |||
+ | <code python> | ||
+ | if __name__ == " | ||
+ | # select the set of positive and negative reviews | ||
+ | all_positive_reviews, | ||
+ | |||
+ | # split the data to train and test sets | ||
+ | test_pos = all_positive_reviews[-50: | ||
+ | train_pos = all_positive_reviews[: | ||
+ | test_neg = all_negative_reviews[-50: | ||
+ | train_neg = all_negative_reviews[: | ||
+ | |||
+ | train_x = train_pos + train_neg | ||
+ | test_x = test_pos + test_neg | ||
+ | # set label to 1 for positive reviews and 0 for negative ones | ||
+ | train_y = np.append(np.ones(len(train_pos)), | ||
+ | test_y = np.append(np.ones(len(test_pos)), | ||
+ | |||
+ | # create the frequency counting how many times a word appears in a positive/ | ||
+ | frequency = create_frequency(train_x, | ||
+ | |||
+ | # train the model | ||
+ | logprior, loglikelihood = train_naive_bayes(frequency, | ||
+ | |||
+ | i = 0 | ||
+ | correct = 0 | ||
+ | incorrect = 0 | ||
+ | tp = 0 | ||
+ | tn = 0 | ||
+ | fp = 0 | ||
+ | fn = 0 | ||
+ | for review in test_x: | ||
+ | p = naive_bayes_predict(review, | ||
+ | |||
+ | # count true positive, true negative, false positive and false negative -> recall, precision | ||
+ | if p > 0 and test_y[i] == 1: | ||
+ | tp += 1 | ||
+ | elif p > 0 and test_y[i] == 0: | ||
+ | fp += 1 | ||
+ | elif p < 0 and test_y[i] == 0: | ||
+ | tn += 1 | ||
+ | else: | ||
+ | fn += 1 | ||
+ | |||
+ | # count correct and incorrect prediction -> accuracy | ||
+ | if (p > 0 and test_y[i] == 1) or (p <= 0 and test_y[i] == 0): | ||
+ | correct += 1 | ||
+ | else: | ||
+ | incorrect += 1 | ||
+ | i = i + 1 | ||
+ | |||
+ | print(" | ||
+ | print(" | ||
+ | print(" | ||
+ | </ | ||
+ | The following piece of code shows the function used to create the frequency dictionary. The key of the dictionary is a tuple (word, pos/neg) while the value is the number of repetition of the word. // | ||
+ | <code python> | ||
+ | def create_frequency(reviews, | ||
+ | freq_d = {} | ||
+ | # Create frequency dictionary | ||
+ | # ZIP creates a tuple (text, pos/neg) | ||
+ | for review, y in zip(reviews, | ||
+ | # Before counting the frequency we preprocess the text | ||
+ | for word in process_review(review): | ||
+ | pair = (word, y) | ||
+ | # if present increase the count else add to the dict | ||
+ | if pair in freq_d: | ||
+ | freq_d[pair] += 1 | ||
+ | else: | ||
+ | freq_d[pair] = 1 | ||
+ | return freq_d | ||
+ | </ | ||
+ | This part of the code is the function that is used to train the model. It computes the loglikelihood and the logprior. The first step is to calculate the number of unique words in the frequency model, then the total number of positive and negative word presence. Also, the number of positive reviews and negative reviews is counted and it is used to compute the prior which is the number of positive reviews divided by the number of negative ones. Using logarithms properties this can be done as a sum. Finally, the probability of a word to be positive and negative is computed and this is used to compute the loglikelihood of that word. | ||
+ | <code python> | ||
+ | def train_naive_bayes(freq, | ||
+ | loglikelihood = {} | ||
+ | |||
+ | # calculate the number of unique words in vocab | ||
+ | unique_words = set([pair[0] for pair in freq.keys()]) | ||
+ | V = len(unique_words) | ||
+ | |||
+ | # calculate N_pos and N_neg | ||
+ | N_pos = N_neg = 0 | ||
+ | for pair in freq.keys(): | ||
+ | # if the word was in a positive review | ||
+ | if pair[1] > 0: | ||
+ | N_pos += freq[pair] | ||
+ | else: | ||
+ | N_neg += freq[pair] | ||
+ | |||
+ | # Calculate the number of documents (reviews) | ||
+ | # shape[0] return the # of rows of an array | ||
+ | D = review_labels.shape[0] | ||
+ | |||
+ | # Calculate D_pos, the number of positive documents (reviews) | ||
+ | D_pos = sum(review_labels) | ||
+ | |||
+ | # Calculate D_neg, the number of negative documents (reviews) | ||
+ | D_neg = D - sum(review_labels) | ||
+ | |||
+ | # Calculate logprior | ||
+ | logprior = np.log(D_pos) - np.log(D_neg) | ||
+ | |||
+ | # for each word | ||
+ | for word in unique_words: | ||
+ | # get the positive and negative frequency of the word | ||
+ | freq_pos = freq.get((word, | ||
+ | freq_neg = freq.get((word, | ||
+ | |||
+ | # calculate the probability that word is positive, and negative | ||
+ | p_w_pos = (freq_pos + 1) / (N_pos + V) | ||
+ | p_w_neg = (freq_neg + 1) / (N_neg + V) | ||
+ | |||
+ | # calculate the log likelihood of the word | ||
+ | loglikelihood[word] = np.log(p_w_pos / p_w_neg) | ||
+ | |||
+ | return logprior, loglikelihood | ||
+ | </ | ||
+ | Finally, the function // | ||
+ | <code python> | ||
+ | def naive_bayes_predict(review, | ||
+ | |||
+ | # Process the review to get a list of words | ||
+ | word_l = process_review(review) | ||
+ | prob = 0 | ||
+ | |||
+ | # Add the logprior | ||
+ | prob += logprior | ||
+ | |||
+ | for word in word_l: | ||
+ | if word in word_l and word in loglikelihood: | ||
+ | prob += loglikelihood[word] | ||
+ | |||
+ | return prob | ||
+ | </ | ||
+ | |||
+ | === Turney' | ||
+ | The last approach that is presented is based on the work done by professor Peter D. Turney, presented in this [[https:// | ||
+ | The basics steps of this algorithm are: | ||
+ | * Extract phrasal lexicon from the reviews | ||
+ | * Learn the polarity of each phrase | ||
+ | * Rate the review based on the average polarity of the phrases | ||
+ | |||
+ | |||
+ | First, a part-of-speech tagger is applied to the review. Two consecutive words are extracted from the review if their tags conform to any of the patterns in the table shown below. The JJ tags indicate adjectives, the NN tags are nouns, the RB tags are adverbs, and the VB tags are verbs. The second pattern, for example, means that two consecutive words are extracted if the first word is an adverb and the second word is an adjective, but the third word (which is not extracted) cannot be a noun. The table has been defined by professor Turney in order to extract some meaningful parts of the review. He studied and understood that the following are the part of the text where the real sentiment information is typically stored. | ||
+ | {{ https:// | ||
+ | The extraction of part-of-speech elements is not a trivial task and it's part of the NLP word. During the development, | ||
+ | |||
+ | PMI-IR uses Pointwise Mutual Information (PMI) is a measure of association that measures the similarity of pairs of words or phrases. It is computed according to this formula. \\ \\ | ||
+ | {{ https:// | ||
+ | The semantic orientation of a phrase is then computed as \\ \\ | ||
+ | {{ https:// | ||
+ | Finally, an average of the SO of each phrase is made to obtain the sentiment of the review.\\ | ||
+ | In practice, the SO computation is made using the NEAR operation, and using the properties of the logarithm becomes: \\ \\ | ||
+ | {{ https:// | ||
+ | This change was done by professor Turney because of the search engine he was using. Using python the NEAR operator can be easily emulated using a regular expression. Let's now see the algorithm implementation.\\ | ||
+ | The main of the program is really easy. It simply calls // | ||
+ | <code python> | ||
+ | if __name__ == " | ||
+ | FILE_PATH = ' | ||
+ | datasets = make_datasets(FILE_PATH) | ||
+ | turney = Turney(datasets) | ||
+ | turney.turney() | ||
+ | </ | ||
+ | The function // | ||
+ | <code python> | ||
+ | def make_datasets(fileName): | ||
+ | all_positive_reviews, | ||
+ | dataset = {' | ||
+ | dataset[' | ||
+ | dataset[' | ||
+ | dataset[' | ||
+ | dataset[' | ||
+ | return dataset | ||
+ | </ | ||
+ | This section of the code shows how the phrasal lexicons are extracted. The //postag// parameter is the output of // | ||
+ | <code python> | ||
+ | def find_pattern(postag): | ||
+ | tag_pattern = [] | ||
+ | for k in range(len(postag) - 2): | ||
+ | if postag[k][1] == " | ||
+ | tag_pattern.append("" | ||
+ | |||
+ | elif ((postag[k][1] == " | ||
+ | postag[k + 2][1] != " | ||
+ | tag_pattern.append("" | ||
+ | |||
+ | elif postag[k][1] == " | ||
+ | tag_pattern.append("" | ||
+ | |||
+ | elif (postag[k][1] == " | ||
+ | tag_pattern.append("" | ||
+ | |||
+ | elif ((postag[k][1] == " | ||
+ | postag[k + 1][1] == " | ||
+ | tag_pattern.append("" | ||
+ | return tag_pattern | ||
+ | </ | ||
+ | The NEAR operator used by professor Turney in his theory and studies has been implemented using a regular expression library in python. It simply looks inside the //text// string, looking for the presence of the //word// parameter (which will be " | ||
+ | <code python> | ||
+ | def near_operator(phrase, | ||
+ | try: | ||
+ | string = word + r' | ||
+ | freq_phrase_near_word = (len(re.findall(string, | ||
+ | return freq_phrase_near_word | ||
+ | except: | ||
+ | return | ||
+ | </ | ||
+ | Finally, the implementation of the class //Turney// is shown. When an object is instantiated, | ||
+ | Let's now analyse the core of the algorithm: // | ||
+ | <code python> | ||
+ | class Turney(object): | ||
+ | |||
+ | def __init__(self, | ||
+ | self.datasets = dataset | ||
+ | self.pos_phrases_hits = [] | ||
+ | self.neg_phrases_hits = [] | ||
+ | self.pos_hits = 0.01 | ||
+ | self.neg_hits = 0.01 | ||
+ | self.accuracy = 0 | ||
+ | | ||
+ | |||
+ | def turney(self): | ||
+ | tp = 0 | ||
+ | fp = 0 | ||
+ | tn = 0 | ||
+ | fn = 0 | ||
+ | for boolean, test_klass in enumerate([' | ||
+ | for i, data in enumerate(self.datasets[' | ||
+ | print(str(i) + " out of " + str(len(self.datasets[' | ||
+ | |||
+ | phrases = find_pattern(nltk.pos_tag(nltk.word_tokenize(data))) | ||
+ | if len(phrases) == 0: | ||
+ | continue | ||
+ | self.pos_phrases_hits = [0.01] * len(phrases) | ||
+ | self.neg_phrases_hits = [0.01] * len(phrases) | ||
+ | self.pos_hits = 0.01 | ||
+ | self.neg_hits = 0.01 | ||
+ | |||
+ | for train_klass in [' | ||
+ | for text in self.datasets[' | ||
+ | for ind, phrase in enumerate(phrases): | ||
+ | self.pos_phrases_hits[ind] += near_operator(phrase, | ||
+ | self.neg_phrases_hits[ind] += near_operator(phrase, | ||
+ | self.pos_hits += text.count(" | ||
+ | self.neg_hits += text.count(" | ||
+ | res = self.calculate_sentiment(boolean) | ||
+ | # compute if correct prediction | ||
+ | if res == 1 and boolean==0: | ||
+ | fp += 1 | ||
+ | elif res == 1 and boolean==1: | ||
+ | tp += 1 | ||
+ | elif res == 0 and boolean == 0: | ||
+ | fn += 1 | ||
+ | elif res == 0 and boolean==1: | ||
+ | tn += 1 | ||
+ | |||
+ | print(" | ||
+ | print(" | ||
+ | print(" | ||
+ | | ||
+ | | ||
+ | def calculate_sentiment(self, | ||
+ | polarities = [0] * len(self.pos_phrases_hits) | ||
+ | for i in range(len(self.pos_phrases_hits)): | ||
+ | polarities[i] = math.log( | ||
+ | (self.pos_phrases_hits[i] * self.neg_hits) / (self.neg_phrases_hits[i] * self.pos_hits), | ||
+ | avg = sum(polarities) / len(polarities) | ||
+ | if (avg > 0 and is_negative == 0) or (avg < 0 and is_negative == 1): | ||
+ | self.accuracy += 1 | ||
+ | return 1 | ||
+ | return 0 | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ==== Comparison ==== | ||
+ | In this section, the performances of the three methods are analyzed and compared using statistical indexes and personal considerations coming from the development.\\ | ||
+ | The metrics used for the evaluations are: | ||
+ | * Accuracy => defines how many correct evaluations are made over the total number of evaluations (percentage of correctly evaluated reviews) | ||
+ | * Recall => defines the ability of the algorithm to assign an item of class " | ||
+ | * Precision => detects the number of correctly evaluated elements of a class " | ||
+ | The recall and precision metrics must be computed for each possible evaluation class, so in this section are reported both positive and negative values for these indicators. | ||
+ | === Lexicon-Based === | ||
+ | The lexicon-based approach has achieved the following values: | ||
+ | * Accuracy = 62.0 % | ||
+ | * Recall Positive = 84.0 % | ||
+ | * Precision Positive | ||
+ | * Recall Negative= 40.0 % | ||
+ | * Precision Negative= 71.4 % | ||
+ | The confusion matrix is: | ||
+ | | ^ Predict class ^ ::: ^ ::: ^ | ||
+ | ^ | ||
+ | ^ ::: | Class = Positive | ||
+ | ^ ::: | Class = Negative | ||
+ | The metrics shown above make clear that the algorithm has big problems in distinguishing the negative reviews. This is due to the fact that many reviews describe a negative concept without the use of very negative words. Moreover many times a negative word can be substituted by the use of the negation. " | ||
+ | One last strength of this approach is the time needed for the analysis. It takes very few seconds to produce an output since it doesn' | ||
+ | |||
+ | === Machine Learning - Naive Bayes === | ||
+ | The Naive Bayes classifier has achieved the following values: | ||
+ | * Accuracy = 79.0 % | ||
+ | * Recall Positive = 68.0 % | ||
+ | * Precision Positive | ||
+ | * Recall Negative= 90.0 % | ||
+ | * Precision Negative= 73.7 % | ||
+ | |||
+ | The confusion matrix is: | ||
+ | | ^ Predict class ^ ::: ^ ::: ^ | ||
+ | ^ | ||
+ | ^ ::: | Class = Positive | ||
+ | ^ ::: | Class = Negative | ||
+ | |||
+ | As it could be expected, the results of the machine learning algorithm are much more robust and reliable with respect to the lexicon-based one. Having more than 20'000 reviews to train the model allows having a solid algorithm, able to correctly predict the sentiment of the reviews. Even in this case the evaluation of the negative reviews is more complex and produces more errors than with the positive ones. The values of recall and precision are showing behavior similar to the previous method but with very high percent values. Since the algorithm works on single words possible mistakes can be done also by the machine learning method. | ||
+ | The speed of the algorithm is much lower with respect to the previous solution since thousands of reviews must be analyzed to create the frequency dictionary and train the model. The needed time can go from 20/30 seconds using a powerful pc to some minutes with older hardware. For sake of completeness, | ||
+ | |||
+ | === Turney' | ||
+ | The Turney' | ||
+ | * Accuracy = 57.0 % | ||
+ | * Recall Positive = 70.0 % | ||
+ | * Precision Positive | ||
+ | * Recall Negative= 41.9 % | ||
+ | * Precision Negative= 74.2 % | ||
+ | |||
+ | The confusion matrix is: | ||
+ | | ^ Predict class ^ ::: ^ ::: ^ | ||
+ | ^ | ||
+ | ^ ::: | Class = Positive | ||
+ | ^ ::: | Class = Negative | ||
+ | |||
+ | The results of Turney' | ||
+ | Moreover, the performance of the algorithm is terrible since each phrasal lexicon of each test review must be searched within each train review and compared with " | ||
+ | |||
+ | ===== Limitations and Challanges==== | ||
+ | The algorithms shown before are just three possible approaches toward the world of Sentiment Analysis but there are dozens of powerful and robust algorithms. Even if this branch of the NLP is improving day after day, some challenges related to the natural languages and to the sematic are not yet solved. In this section, a brief list of these limitations is presented. \\ \\ | ||
+ | The first limitation is related to the presence of **irony** and **sarcasm**. The irony consists in stating the opposite of what one thinks in order to ridicule or underline concepts. Irony implies criticism, but it differs markedly from sarcasm which also implies contempt. This often implies the use of positive words to underline a negative concept. Irony or sarcasm are very complex to be detected by the standard Sentiment Analysis engines. An example is shown to clarify the previous concept. \\ | ||
+ | The review of an airline company could be:''" | ||
+ | If this review is read by a human, it can be easily labeled as a negative review. This is not trivial if made by a computer, which extracts positive words from the text. Words like " | ||
+ | The solution to this limitation is not trivial at all, researchers are developing AI systems and Supervised Learning Approaches able to find and identify sarcasm and irony. This challenge of Sentiment Analysis is leading to a new branch of NLP related to this issue. \\ \\ | ||
+ | Another challenge is the use of **idioms** inside a text. Machine learning programs don’t necessarily understand a figure of speech. For example, an idiom like '' | ||
+ | The use of **negations**, | ||
+ | Eventually, a Sentiment Analysis engine cannot understand the **context** of the sentence. Let's make an example. The word ''" | ||
+ | In general, the Sentiment Analysis field is very deep and most of it is still undiscovered. In the next years, new technologies and algorithms will allow to have a more and more precise evaluation of text by mean of a computer. | ||
+ | |||
+ | |||
+ | ===== Downloads and Instructions==== | ||
+ | The following links allow downloading some .zip files containing the algorithms and the script mentioned above. | ||
+ | * [[https:// | ||
+ | * [[https:// | ||
+ | * [[https:// | ||
+ | * [[https:// | ||
+ | |||
+ | From here the positive and negative list of words can be downloaded. It can be done by clicking the right mouse button and then "Save as" or "Salva con nome". | ||
+ | * [[https:// | ||
+ | * [[https:// | ||
+ | |||
+ | can be downloaded from the following link. IMPORTANT: In order to access the dataset a **registration** to the website is needed. \\ | ||
+ | * [[https:// | ||
+ | |||
+ | The procedure to run an algorithm is the following: | ||
+ | - Download and install [[https:// | ||
+ | - Download the algorithm zip archive, the script, the dataset, and for the lexicon-based approach also the lists of positive and negative words. | ||
+ | - Extract the zip files | ||
+ | - If the lexicon-based approach is the one chosen then copy the positive and negative words list into the directory | ||
+ | - Copy the dataset file (// | ||
+ | - Open the terminal, browse to the //script// directory and launch the command | ||
+ | - '' | ||
+ | - '' | ||
+ | - The script will produce the file // | ||
+ | - Copy the new file and paste it to the directory of the algorithm | ||
+ | - Browse with the terminal to the directory of the algorithm and launch the command. NOTE: this operation could be done only once, is not needed every time an algorithm is launched | ||
+ | - '' | ||
+ | - '' | ||
+ | - Finally, execute the program by typing | ||
+ | - '' | ||
+ | - '' | ||
+ | - The output will be the statistical metrics shown in the comparison paragraph. | ||
+ | |||
+ | |||
+ | |||
If you found any error, or if you want to partecipate to the editing of this wiki, please contact: admin [at] skenz.it
You can reuse, distribute or modify the content of this page, but you must cite in any document (or webpage) this url: https://www.skenz.it/compilers/nlp_sentiment_analysis?do=diff&rev2%5B0%5D=1628857865&rev2%5B1%5D=1628857936&difftype=sidebyside
/web/htdocs/www.skenz.it/home/data/pages/compilers/nlp_sentiment_analysis.txt · Last modified: 2024/04/08 22:34 by 127.0.0.1