Utilizing NLP Techniques to Predict E-commerce Recommendation & Rating

Association:

iSchool, University of Toronto

Duration:

2 weeks

Machine Learning

Data Science

Python

Introduction

Within e-commerce, customer reviews are essential for guiding consumer behavior and informing business decisions. These reviews provide valuable insights into customer satisfaction, product quality, and areas for improvement. This project leverages a comprehensive dataset of women's clothing e-commerce reviews to develop predictive machine learning models. The dataset, sourced from Kaggle, contains 23,486 customer reviews alongside essential information regarding the purchases. By analyzing this rich dataset, the project aims to enhance the understanding of customer preferences and improve product recommendations.

Problem Statement

The primary objective of this project is to develop predictive machine learning models for two classification tasks: 

  • Recommendation Prediction: Predict whether a user recommends a product. This is a binary classification task where the target variable is "Recommended IND" (1 for recommended, 0 for not recommended). 

  • Star Rating Prediction: Predict the user's given star rating (1 to 5 stars) for the items. This task involves ordinal classification, where the model must accurately predict the specific star rating assigned by the customer. 

The insights from this project can help to improve customer satisfaction, enhance product recommendations, and ultimately drive better customer engagement and sales for e-commerce businesses.

Methodologies

The process begins with fundamental data preparation techniques optimized for natural language processing of customers’ reviews. Subsequently, the strategy employs a comprehensive text classification pipeline to transform unstructured review data into structured features suitable for machine learning. Multiple models, including Logistic Regression, Random Forest, XGBoost, and LightGBM, are evaluated. The approach balances computational efficiency and predictive accuracy, leveraging advanced techniques to address the subtleties of clothing reviews.

The key feature in this dataset is customers’ reviews; thus, the strategy for task 1 encompasses a text classification pipeline, including:

  • Sentiment Analysis: Using VADER to extract sentiment scores (positive, negative, neutral, compound) which quantifies the emotional tone of review titles and text.

  • Basic Text Statistics: Adding text length and word count as features, which can indicate review thoroughness/detail.

  • BERT Embeddings: Converting text into rich semantic vectors that capture contextual meaning and nuance.

This approach transforms unstructured text data into structured features while preserving semantic meaning, captures multiple dimensions of the review titles and text (sentiment, length, semantic content), and prepares the data in a format ready for machine learning algorithms. BERT embeddings are suitable for this challenge because clothing reviews often contain subtle language about fit, quality, and style that simpler bag-of-words approaches might miss, while sentiment analysis directly captures customer satisfaction signals.

# Function to get sentiment polarity
def get_sentiment(text):
    vader = SentimentIntensityAnalyzer()
    sentiment = vader.polarity_scores(text)
    return sentiment["pos"], sentiment["neu"], sentiment["neg"], sentiment["compound"]

# Apply sentiment analysis to training and test data
train_data[["Sent_Pos_Review", "Sent_Neu_Review", "Sent_Neg_Review", "Sent_Compound_Review"]] = train_data["Review Text Cleaned"].apply(lambda x: pd.Series(get_sentiment(x)))
train_data[["Sent_Pos_Title", "Sent_Neu_Title", "Sent_Neg_Title", "Sent_Compound_Title"]] = train_data["Title Cleaned"].apply(lambda x: pd.Series(get_sentiment(x)))

test_data[["Sent_Pos_Review", "Sent_Neu_Review", "Sent_Neg_Review", "Sent_Compound_Review"]] = test_data["Review Text Cleaned"].apply(lambda x: pd.Series(get_sentiment(x)))
test_data[["Sent_Pos_Title", "Sent_Neu_Title", "Sent_Neg_Title", "Sent_Compound_Title"]] = test_data["Title Cleaned"].apply(lambda x: pd.Series(get_sentiment(x)))

# Calculate review length and word count
train_data["Review_Length"] = train_data["Review Text Cleaned"].apply(lambda x: len(str(x)))
train_data["Word_Count_Review"] = train_data["Review Text Cleaned"].apply(lambda x: len(str(x).split()))
train_data["Title_Length"] = train_data["Title Cleaned"].apply(lambda x: len(str(x)))
train_data["Word_Count_Title"] = train_data["Title Cleaned"].apply(lambda x: len(str(x).split()))

test_data["Review_Length"] = test_data["Review Text Cleaned"].apply(lambda x: len(str(x)))
test_data["Word_Count_Review"] = test_data["Review Text Cleaned"].apply(lambda x: len(str(x).split()))
test_data["Title_Length"] = test_data["Title Cleaned"].apply(lambda x: len(str(x)))
test_data["Word_Count_Title"] = test_data["Title Cleaned"].apply(lambda x: len(str(x).split()))

# Create copies without text columns and "Rating" for final feature set
train_features = train_data.drop(columns=["Title", "Review Text", "Title Cleaned", "Review Text Cleaned", "Rating"])
test_features = test_data.drop(columns=["Title", "Review Text", "Title Cleaned", "Review Text Cleaned", "Rating"])

# Load Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate BERT embeddings
train_review_embeddings = model.encode(train_data["Review Text Cleaned"].tolist())
train_title_embeddings = model.encode(train_data["Title Cleaned"].tolist())
test_review_embeddings = model.encode(test_data["Review Text Cleaned"].tolist())
test_title_embeddings = model.encode(test_data["Title Cleaned"].tolist())

# When creating DataFrames from embeddings, explicitly name the columns
embed_dim = train_review_embeddings.shape[1]
review_embed_cols = [f'review_embed_{i}' for i in range(embed_dim)]
title_embed_cols = [f'title_embed_{i}' for i in range(embed_dim)]

X_train_review_bert = pd.DataFrame(train_review_embeddings, columns=review_embed_cols)
X_train_title_bert = pd.DataFrame(train_title_embeddings, columns=title_embed_cols)
X_test_review_bert = pd.DataFrame(test_review_embeddings, columns=review_embed_cols)
X_test_title_bert = pd.DataFrame(test_title_embeddings, columns=title_embed_cols)

# Reset index on all DataFrames before concatenation
train_features_reset = train_features.reset_index(drop=True)
test_features_reset = test_features.reset_index(drop=True)
X_train_review_bert_reset = X_train_review_bert.reset_index(drop=True)
X_train_title_bert_reset = X_train_title_bert.reset_index(drop=True)
X_test_review_bert_reset = X_test_review_bert.reset_index(drop=True)
X_test_title_bert_reset = X_test_title_bert.reset_index(drop=True)

# Concatenate with explicit column references
X_train_final = pd.concat([train_features_reset, X_train_review_bert_reset, X_train_title_bert_reset], axis=1)
X_test_final = pd.concat([test_features_reset, X_test_review_bert_reset, X_test_title_bert_reset], axis=1)

# Verify that both datasets have the same number of columns
print(f"X_train_final shape: {X_train_final.shape}")
print(f"X_test_final shape: {X_test_final.shape}")

For task 2, the approach is adjusted slightly as there are fewer features involved. This task only uses review title and text. Accordingly, a comprehensive text processing pipeline is applied:

  • Sentiment Analysis: Using TextBlob to map the polarity to negative, neutral, and positive categories.

  • TF-IDF Vectorization: Considering bigrams and 4-grams, the cleaned 'Review Text' is converted into numerical features

  • Sentiment Intensity Calculation: The program uses VADER to calculate sentiment intensity scores for the cleaned 'Review Text' and 'Title' columns.

  • Emotion Score Extraction with Feature Engineering: First, NRCLex is used to calculate emotion scores for the cleaned 'Review Text' and 'Title' columns. Afterward, the program expands the emotion scores into separate columns for each emotion.

These steps collectively help in creating a robust feature set that can effectively capture the nuances in review data, such as the reviewer's feelings, making it suitable for predicting ratings accurately.

# Function to get sentiment polarity and map to negative, positive, neutral
def get_sentiment(text):
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity

    # Map polarity to negative, neutral, positive
    if polarity < -0.05:
        return 1, 0, 0  # Negative
    elif polarity > 0.05:
        return 0, 0, 1  # Positive
    else:
        return 0, 1, 0  # Neutral

# Apply sentiment analysis to training and test data
train_data_task2[["Sent_Neg_Review", "Sent_Neu_Review", "Sent_Pos_Review"]] = train_data_task2["Review Text Cleaned"].apply(lambda x: pd.Series(get_sentiment(x)))
train_data_task2[["Sent_Neg_Title", "Sent_Neu_Title", "Sent_Pos_Title"]] = train_data_task2["Title Cleaned"].apply(lambda x: pd.Series(get_sentiment(x)))

test_data_task2[["Sent_Neg_Review", "Sent_Neu_Review", "Sent_Pos_Review"]] = test_data_task2["Review Text Cleaned"].apply(lambda x: pd.Series(get_sentiment(x)))
test_data_task2[["Sent_Neg_Title", "Sent_Neu_Title", "Sent_Pos_Title"]] = test_data_task2["Title Cleaned"].apply(lambda x: pd.Series(get_sentiment(x)))

# Calculate review length and word count
train_data_task2["Review_Length"] = train_data_task2["Review Text Cleaned"].apply(lambda x: len(str(x)))
train_data_task2["Word_Count_Review"] = train_data_task2["Review Text Cleaned"].apply(lambda x: len(str(x).split()))
train_data_task2["Title_Length"] = train_data_task2["Title Cleaned"].apply(lambda x: len(str(x)))
train_data_task2["Word_Count_Title"] = train_data_task2["Title Cleaned"].apply(lambda x: len(str(x).split()))

test_data_task2["Review_Length"] = test_data_task2["Review Text Cleaned"].apply(lambda x: len(str(x)))
test_data_task2["Word_Count_Review"] = test_data_task2["Review Text Cleaned"].apply(lambda x: len(str(x).split()))
test_data_task2["Title_Length"] = test_data_task2["Title Cleaned"].apply(lambda x: len(str(x)))
test_data_task2["Word_Count_Title"] = test_data_task2["Title Cleaned"].apply(lambda x: len(str(x).split()))

# TF-IDF with bigrams and n-grams for review text
tfidf_review = TfidfVectorizer(ngram_range=(2, 4), max_features=100)  # Adjust max_features as needed
train_review_tfidf = tfidf_review.fit_transform(train_data_task2["Review Text Cleaned"])
test_review_tfidf = tfidf_review.transform(test_data_task2["Review Text Cleaned"])

# Convert to DataFrames
train_review_tfidf_df = pd.DataFrame(train_review_tfidf.toarray(), columns=[f"review_tfidf_{i}" for i in range(train_review_tfidf.shape[1])])
test_review_tfidf_df = pd.DataFrame(test_review_tfidf.toarray(), columns=[f"review_tfidf_{i}" for i in range(test_review_tfidf.shape[1])])

# Initialize VADER
vader = SentimentIntensityAnalyzer()

# Function to get sentiment intensity
def get_sentiment_intensity(text):
    sentiment = vader.polarity_scores(text)
    return sentiment["compound"]  # Compound score

# Apply sentiment intensity to review and title
train_data_task2["Sentiment_Intensity_Review"] = train_data_task2["Review Text Cleaned"].apply(get_sentiment_intensity)
train_data_task2["Sentiment_Intensity_Title"] = train_data_task2["Title Cleaned"].apply(get_sentiment_intensity)

test_data_task2["Sentiment_Intensity_Review"] = test_data_task2["Review Text Cleaned"].apply(get_sentiment_intensity)
test_data_task2["Sentiment_Intensity_Title"] = test_data_task2["Title Cleaned"].apply(get_sentiment_intensity)

# Function to get emotion scores using NRCLex
def get_emotion_scores_nrclex(text):
    emotion = NRCLex(text)
    return emotion.affect_frequencies

# Apply emotion scores to review and title
train_data_task2["Emotion_Scores_Review"] = train_data_task2["Review Text Cleaned"].apply(get_emotion_scores_nrclex)
train_data_task2["Emotion_Scores_Title"] = train_data_task2["Title Cleaned"].apply(get_emotion_scores_nrclex)

test_data_task2["Emotion_Scores_Review"] = test_data_task2["Review Text Cleaned"].apply(get_emotion_scores_nrclex)
test_data_task2["Emotion_Scores_Title"] = test_data_task2["Title Cleaned"].apply(get_emotion_scores_nrclex)

# List of emotions in the NRC lexicon
emotions = ["anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", "trust"]

# Function to expand emotion scores into separate columns
def expand_emotion_scores(df, column_prefix):
    for emotion in emotions:
        df[f"{column_prefix}_{emotion}"] = df[column_prefix].apply(lambda x: x.get(emotion, 0))
    return df.drop(columns=[column_prefix])

# Create copies without text columns for final feature set
train_features_2 = train_data_task2.drop(columns=["Title", "Review Text", "Title Cleaned", "Review Text Cleaned"])
test_features_2 = test_data_task2.drop(columns=["Title", "Review Text", "Title Cleaned", "Review Text Cleaned"])

# Reset index on all DataFrames before concatenation
train_features_2_reset = train_features_2.reset_index(drop=True)
test_features_2_reset = test_features_2.reset_index(drop=True)

# Final dataframes
X_train_task2 = train_features_2_reset
X_test_task2 = test_features_2_reset

# Concatenate with existing features
X_train_task2 = pd.concat([X_train_task2, train_review_tfidf_df], axis=1)
X_test_task2 = pd.concat([X_test_task2, test_review_tfidf_df], axis=1)

# Add to final feature set
X_train_task2["Sentiment_Intensity_Review"] = train_data_task2["Sentiment_Intensity_Review"]
X_train_task2["Sentiment_Intensity_Title"] = train_data_task2["Sentiment_Intensity_Title"]
X_test_task2["Sentiment_Intensity_Review"] = test_data_task2["Sentiment_Intensity_Review"]
X_test_task2["Sentiment_Intensity_Title"] = test_data_task2["Sentiment_Intensity_Title"]

# Add to final feature set
X_train_task2["Emotion_Scores_Review"] = train_data_task2["Emotion_Scores_Review"]
X_train_task2["Emotion_Scores_Title"] = train_data_task2["Emotion_Scores_Title"]
X_test_task2["Emotion_Scores_Review"] = test_data_task2["Emotion_Scores_Review"]
X_test_task2["Emotion_Scores_Title"] = test_data_task2["Emotion_Scores_Title"]

# Expand emotion scores for review and title
X_train_task2 = expand_emotion_scores(X_train_task2, "Emotion_Scores_Review")
X_train_task2 = expand_emotion_scores(X_train_task2, "Emotion_Scores_Title")
X_test_task2 = expand_emotion_scores(X_test_task2, "Emotion_Scores_Review")
X_test_task2 = expand_emotion_scores(X_test_task2, "Emotion_Scores_Title")

# Verify the final dataframes
print("X_train_task2 shape:", X_train_task2.shape)
print("X_test_task2 shape:", X_test_task2.shape)

Results

LightGBM for Recommendation Prediction (accuracy: 0.89, weighted F1-score: 0.90)

XGBoost for Rating Prediction (accuracy: 0.60, weighted F1-score: 0.56)

For the recommendation prediction task, sentiment scores (e.g., positive, negative, and compound sentiment) and BERT embeddings of the review text and title are influential features, followed by other demographic and categorical features. This highlights the importance of understanding the emotional tone, semantic meaning, and background of customer reviews. For the star rating prediction task, once again, the sentiment expressed in the review is the most important predictor, alongside specific dimensions of the TF-IDF vectors. This suggests that how a customer feels and views a product is a strong proxy for their recommendation behavior and star rating. Both tasks faced challenges due to class imbalance. For example, in the star rating prediction task, the model performed well for the majority class (5-star ratings) but struggled with minority classes (1-star to 4-star ratings). This is a common issue in real-world datasets, where extreme ratings (very positive or very negative) are easier to detect than moderate or neutral ratings.

Discussions & Implication

The analysis of the LightGBM and XGBoost models for predicting customer recommendations and star ratings provides valuable insights into the strengths and limitations of these approaches in a business setting. First, insights from sentiment analysis can help businesses identify specific pain points or areas of improvement. For instance, negative sentiment in reviews about product quality, fit, or delivery can guide product development and operational changes. Second, accurately predicting customer recommendations and star ratings can help businesses identify satisfied and dissatisfied customers. Specifically, customers who give low ratings or do not recommend a product may be at risk of churn. Addressing their concerns (e.g., through targeted outreach or product improvements) can improve customer retention and loyalty. Furthermore, understanding the factors that drive recommendations and high ratings can inform marketing strategies. For example, businesses can highlight features or aspects of products that are frequently praised in reviews, or they can personalize recommendations based on customer sentiment and preferences. However, the models' poor performance on low ratings and non-recommendations is a significant limitation because failing to identify dissatisfied customers can result in missed opportunities to address issues and improve customer experience.

To unlock the full potential of these models to improve customer satisfaction, drive product improvements, and enhance marketing strategies, businesses can:

  • Address class imbalance by oversampling minority classes or undersampling majority classes, using class weights during model training, or applying advanced methods like SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic samples for minority classes.

  • Leverage ordinal classification techniques with models specifically designed for ordinal classification such as ordinal logistic regression or models with ordinal-aware loss functions

  • Invest in more advanced feature engineering, including domain-specific sentiment analysis and topic modeling, to capture additional nuances in customer reviews while incorporating customer demographics (e.g., age, location, purchase history) to provide additional context for predictions.

Let's Talk

Let's Talk

Let's Talk

© 2025. All rights Reserved.

© 2025. All rights Reserved.

© 2025. All rights Reserved.