1. Main Dish (Idea) – The goal is to create a system that can automatically determine whether a movie review expresses a positive or negative sentiment, utilizing a dataset of movie reviews and their corresponding star ratings.
2. Ingredients (Concepts & Components) –
- Data Source (Amazon Instant Video Reviews): A file containing a large collection of movie reviews and star ratings.
pandasLibrary: A kitchen tool for data preparation, cleaning, and structuring the data.- Natural Language Processing (NLP): The core technique for extracting meaning from text.
- Text Cleaning (Text Processing): Cleaning up the raw text data to remove noise and inconsistencies:
- Lowercasing (changing all text to lowercase).
- TF-IDF Vectorization (Spices): Transforms text reviews into numerical representations using the TF-IDF (Term Frequency-Inverse Document Frequency) method. This converts words into numerical vectors, allowing the model to understand them.
- Logistic Regression (Oven): A machine learning algorithm used for sentiment classification.
- LightGBM (Oven): A gradient boosting framework, also used for sentiment classification.
sklearn.model_selection(Testing and Training Platform): A tool to split the data into training and testing sets.sklearn.metrics.classification_report: A tool to check how good the dish is, measuring the precision, recall, and F1 score.- Confusion Matrix (Heatmap): A graphical tool to visualize the performance of the model.
- Imbalanced-learn (Balancer): A library for sampling datasets.
- K-Means: A model to plot the cluster labels with respect to labels.
- Model Persistence (Packaging): Saving the trained model for future use
3. Cooking Process (How It Works) –
- Preparation Phase (Data Collection and Preprocessing):
- Load the movie review data from the text file.
- Extract the text of the reviews and their corresponding ratings.
- Create a DataFrame with columns for the text of the reviews and the labels.
- Filter the dataset to include only positive (5-star) and negative (1-star) reviews.
- Perform text cleaning to remove HTML tags, special characters and convert everything to lowercase.
- Use RandomUnderSampler to balance the dataset.
- Flavor Infusion (TF-IDF Vectorization):
- Use TF-IDF to create vector representations of the cleaned review text, giving numerical meaning to each word.
- Baking (Model Training):
- Split the vector data into training and testing sets.
- Use the training data to train logistic regression models.
- Also, using lightGBM.
- Evaluation (Testing and Plating):
- Apply the trained logistic regression model to the test data to predict sentiments.
- Generate the classification report to show the accuracy, precision, recall, and F1-score.
- Generate a confusion matrix heatmap to visually assess the model’s performance.
- Plot the clusters with respect to the labels
- Preservation (Model Deployment):
- Store the trained model and TF-IDF vectorizer using the pickle library for future use or deployment.
4. Serving Suggestion (Outcome) –
- A trained model that can classify movie reviews as expressing either a positive (5-star) or negative (1-star) sentiment with an accuracy of about 93% using Logistic Regression and about 90% using LGBM.
- A confusion matrix visualizing the accuracy of the model.
- The model and vectorizer are ready to use and easy to deploy.

