Posted in

Sentiment Analyzer – A Recipe

1. Main Dish (Idea) – The goal is to create a system that can automatically determine whether a movie review expresses a positive or negative sentiment, utilizing a dataset of movie reviews and their corresponding star ratings.

2. Ingredients (Concepts & Components)

  • Data Source (Amazon Instant Video Reviews): A file containing a large collection of movie reviews and star ratings.
  • pandas Library: A kitchen tool for data preparation, cleaning, and structuring the data.
  • Natural Language Processing (NLP): The core technique for extracting meaning from text.
  • Text Cleaning (Text Processing): Cleaning up the raw text data to remove noise and inconsistencies:
    • Lowercasing (changing all text to lowercase).
  • TF-IDF Vectorization (Spices): Transforms text reviews into numerical representations using the TF-IDF (Term Frequency-Inverse Document Frequency) method. This converts words into numerical vectors, allowing the model to understand them.
  • Logistic Regression (Oven): A machine learning algorithm used for sentiment classification.
  • LightGBM (Oven): A gradient boosting framework, also used for sentiment classification.
  • sklearn.model_selection (Testing and Training Platform): A tool to split the data into training and testing sets.
  • sklearn.metrics.classification_report: A tool to check how good the dish is, measuring the precision, recall, and F1 score.
  • Confusion Matrix (Heatmap): A graphical tool to visualize the performance of the model.
  • Imbalanced-learn (Balancer): A library for sampling datasets.
  • K-Means: A model to plot the cluster labels with respect to labels.
  • Model Persistence (Packaging): Saving the trained model for future use

3. Cooking Process (How It Works)

  1. Preparation Phase (Data Collection and Preprocessing):
    • Load the movie review data from the text file.
    • Extract the text of the reviews and their corresponding ratings.
    • Create a DataFrame with columns for the text of the reviews and the labels.
    • Filter the dataset to include only positive (5-star) and negative (1-star) reviews.
    • Perform text cleaning to remove HTML tags, special characters and convert everything to lowercase.
    • Use RandomUnderSampler to balance the dataset.
  2. Flavor Infusion (TF-IDF Vectorization):
    • Use TF-IDF to create vector representations of the cleaned review text, giving numerical meaning to each word.
  3. Baking (Model Training):
    • Split the vector data into training and testing sets.
    • Use the training data to train logistic regression models.
    • Also, using lightGBM.
  4. Evaluation (Testing and Plating):
    • Apply the trained logistic regression model to the test data to predict sentiments.
    • Generate the classification report to show the accuracy, precision, recall, and F1-score.
    • Generate a confusion matrix heatmap to visually assess the model’s performance.
    • Plot the clusters with respect to the labels
  5. Preservation (Model Deployment):
    • Store the trained model and TF-IDF vectorizer using the pickle library for future use or deployment.

4. Serving Suggestion (Outcome)

  • A trained model that can classify movie reviews as expressing either a positive (5-star) or negative (1-star) sentiment with an accuracy of about 93% using Logistic Regression and about 90% using LGBM.
  • A confusion matrix visualizing the accuracy of the model.
  • The model and vectorizer are ready to use and easy to deploy.

Leave a Reply

Your email address will not be published. Required fields are marked *