Sentiment Analyzer – A Recipe

1. Main Dish (Idea) – The goal is to create a system that can automatically determine whether a movie review expresses a positive or negative sentiment, utilizing a dataset of movie reviews and their corresponding star ratings.

2. Ingredients (Concepts & Components) –

Data Source (Amazon Instant Video Reviews): A file containing a large collection of movie reviews and star ratings.
pandas Library: A kitchen tool for data preparation, cleaning, and structuring the data.
Natural Language Processing (NLP): The core technique for extracting meaning from text.
Text Cleaning (Text Processing): Cleaning up the raw text data to remove noise and inconsistencies:
- Lowercasing (changing all text to lowercase).
TF-IDF Vectorization (Spices): Transforms text reviews into numerical representations using the TF-IDF (Term Frequency-Inverse Document Frequency) method. This converts words into numerical vectors, allowing the model to understand them.
Logistic Regression (Oven): A machine learning algorithm used for sentiment classification.
LightGBM (Oven): A gradient boosting framework, also used for sentiment classification.
sklearn.model_selection (Testing and Training Platform): A tool to split the data into training and testing sets.
sklearn.metrics.classification_report: A tool to check how good the dish is, measuring the precision, recall, and F1 score.
Confusion Matrix (Heatmap): A graphical tool to visualize the performance of the model.
Imbalanced-learn (Balancer): A library for sampling datasets.
K-Means: A model to plot the cluster labels with respect to labels.
Model Persistence (Packaging): Saving the trained model for future use

3. Cooking Process (How It Works) –

Preparation Phase (Data Collection and Preprocessing):
- Load the movie review data from the text file.
- Extract the text of the reviews and their corresponding ratings.
- Create a DataFrame with columns for the text of the reviews and the labels.
- Filter the dataset to include only positive (5-star) and negative (1-star) reviews.
- Perform text cleaning to remove HTML tags, special characters and convert everything to lowercase.
- Use RandomUnderSampler to balance the dataset.
Flavor Infusion (TF-IDF Vectorization):
- Use TF-IDF to create vector representations of the cleaned review text, giving numerical meaning to each word.
Baking (Model Training):
- Split the vector data into training and testing sets.
- Use the training data to train logistic regression models.
- Also, using lightGBM.
Evaluation (Testing and Plating):
- Apply the trained logistic regression model to the test data to predict sentiments.
- Generate the classification report to show the accuracy, precision, recall, and F1-score.
- Generate a confusion matrix heatmap to visually assess the model’s performance.
- Plot the clusters with respect to the labels
Preservation (Model Deployment):
- Store the trained model and TF-IDF vectorizer using the pickle library for future use or deployment.

4. Serving Suggestion (Outcome) –

A trained model that can classify movie reviews as expressing either a positive (5-star) or negative (1-star) sentiment with an accuracy of about 93% using Logistic Regression and about 90% using LGBM.
A confusion matrix visualizing the accuracy of the model.
The model and vectorizer are ready to use and easy to deploy.

Leave a Reply Cancel reply