Amazon Fake Review Machine Learning Classification
Project Overview
For this project, the student team (Aditya Agrawal, Ester Tsai, Kelly Park, Sukanya Krishna) was given manay large datasets of Amazon product reviews, which contain details such as the review title, review text body, star rating, how many people voted “helpful,” and the time the review was posted.
Verified = product was bought on Amazon
Unverified = product not bought on Amazon but might be bought from 3rd party seller
We sought to answer the question:
How well can we predict whether a review comes from a verified purchase or not?
Dataset Description
The team used US Amazon Customer Reviews datasets from Amazon archives. The reviews were taken over the span of the years 2014 - 2015.
The collection of reviews is organized into sub datasets for each product category. The collection contained datasets for 46 different product categories, and each dataset contains 15 features for each review.
Analysis
KNN Classification
- Assigned weights to parameters or the contributions of each data point referencing a review.
- Used weights so that nearer neighbors contribute more than distant ones (weight based on distance formula)
- Applied VADER sentiment analysis to compute the polarity scores (sentiment) for the reviews
- Cons: “majority voting” when class distribution is skewed so need to weight classification
Random Forest Classification + Bigrams
- Cleaned review body (remove stop words, html tags, punctuations, numbers; lowercase + lemmatize)
- Converted review into list of bigrams (each pair of words)
- Engineered features to quantify “fakeness”
- Random forest classifier: each node of tree considers a certain number of features → prediction made by averaging over the predictions of each tree
- Pros: the ONLY input required is the review body. Does not need star rating or number of helpful votes.
- Cons: low accuracy + cannot predict well for one-word reviews
EXAMPLE UNVERIFIED REVIEW:

What the Features Mean:

Jupyter Notebook Demo

Steamlit Demo

Results
Final Poster Presentation
