Sentiment analysis is an important field of natural language processing (NLP). It is used to identify the sentiment of a given piece of text – whether it is positive, negative, or neutral. Sentiment analysis has many applications, such as brand monitoring, market research, and customer service analysis. In this tutorial, we will learn how to build a sentiment analysis model with Python.
What is Sentiment Analysis?
Sentiment analysis is the process of identifying and extracting subjective information from a piece of text. This subjective information is usually related to emotions or opinions, and can be classified as positive, negative, or neutral. Sentiment analysis can be used to analyze customer feedback, social media posts, news articles, and other types of text.
Building a Sentiment Analysis Model with Python
In this tutorial, we will use the Natural Language Toolkit (NLTK) library along with the Python programming language to build a sentiment analysis model. NLTK is a powerful library that provides tools for NLP tasks such as tokenization, stemming, and tagging.
Step 1: Installing NLTK
The first step is to install NLTK. You can install NLTK using pip, which is a package manager for Python. Open your command prompt or terminal and type the following command:
pip install nltk
Step 2: Importing Libraries
After installing NLTK, we need to import the libraries that we will use in our sentiment analysis model. We will import the following libraries:
- nltk – This is the Natural Language Toolkit library.
- re – This library provides regular expression matching operations.
Let’s import these libraries:
import nltk import re
Step 3: Loading the Dataset
The next step is to load the dataset that we will use to build our sentiment analysis model. We will use the Movie Review dataset, which is a collection of movie reviews labeled as positive or negative. You can download this dataset from the NLTK library using the following command:
nltk.download('movie_reviews')
After downloading the dataset, we can load it using the following code:
from nltk.corpus import movie_reviews reviews = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]
The above code will load the movie reviews dataset into the variable named ‘reviews’. Each review is stored as a list of words, and the sentiment of each review is labeled as either ‘positive’ or ‘negative’.
Step 4: Cleaning the Data
Before we can build our sentiment analysis model, we need to clean the data by removing punctuation, converting all text to lowercase, and removing stop words. We can do this using the NLTK library.
The NLTK library provides a stop words list, which is a list of common words like ‘the’, ‘a’, and ‘an’ that can be removed from text because they do not carry much meaning. We will use the stop words list to remove stop words from our dataset. We will also use regular expressions to remove punctuation and convert all text to lowercase.
Let’s define a function that will clean our data:
def clean_text(text): text = re.sub('[^a-zA-Z]', ' ', text) text = text.lower() text = text.split() words = [word for word in text if word not in set(nltk.corpus.stopwords.words('english'))] return ' '.join(words)
The above function will remove all non-alphabetic characters, convert the text to lowercase, split the text into words, remove stop words, and then join the remaining words into a string.
Step 5: Creating Feature Sets
The next step is to create feature sets that our sentiment analysis model can use to make predictions. We will do this by converting the cleaned movie reviews into feature sets. Each feature set will be a dictionary that maps each word in the review to ‘True’ if it appears in the review, and ‘False’ if it does not.
Let’s define a function that will create our feature sets:
def create_feature_list(words): features = {} for word in words: features[word] = True return features
The above function will create a dictionary for each review with each word in the review mapped to ‘True’.
Step 6: Splitting the Data
Before we can train our sentiment analysis model, we need to split our dataset into training and testing sets. The training set will be used to train our model, and the testing set will be used to evaluate the performance of our model.
We will use the following code to split our dataset:
train_set = reviews[:1500] test_set = reviews[1500:]
The above code will split our dataset into a training set of 1500 reviews and a testing set of the remaining reviews.
Step 7: Training the Model
Now that we have created our feature sets and split our dataset, we can train our sentiment analysis model. We will use the Naive Bayes algorithm to train our model. The Naive Bayes algorithm is a probabilistic algorithm that uses Bayes’ theorem to make predictions.
Let’s train our model using the following code:
classifier = nltk.NaiveBayesClassifier.train([ (create_feature_list(review), sentiment) for (review, sentiment) in train_set ])
The above code will train our model using the training set. It will create feature sets for each review in the training set, and then use the Naive Bayes algorithm to make predictions based on these feature sets.
Step 8: Testing the Model
After training our sentiment analysis model, we need to test it using the testing set. We will use the following code to test our model:
accuracy = nltk.classify.accuracy(classifier, [ (create_feature_list(review), sentiment) for (review, sentiment) in test_set ])
The above code will evaluate the performance of our model using the testing set. It will create feature sets for each review in the testing set, and then compare the predicted sentiment with the actual sentiment. It will then calculate the accuracy of our model.
Step 9: Making Predictions
Now that we have trained and tested our sentiment analysis model, we can use it to make predictions on new reviews. We will use the following code to make predictions:
def predict_sentiment(text): cleaned_text = clean_text(text) feature_list = create_feature_list(cleaned_text.split()) return classifier.classify(feature_list)
The above function will accept a new text input, clean the text, create feature sets for the cleaned text, and then use our sentiment analysis model to make a prediction on the text.
In this tutorial, we learned how to build a sentiment analysis model with Python using the Natural Language Toolkit library. We cleaned the movie reviews dataset by removing stop words, converting text to lowercase, and removing punctuation. We also created feature sets for the cleaned reviews and trained our sentiment analysis model using the Naive Bayes algorithm. We tested our model using a testing set and achieved an accuracy of around 75%. Finally, we used our model to make predictions on new text inputs.
Want to learn more about Python, checkout the Python Official Documentation for detail.