Lab 5: Transcend
OIDD 245 Tambe
Objective:
To provide a context for practice with string manipulation and text mining and to explore questions of competitive industry analysis using text analysis methods.
Due Date and Deliverables:
Prepare either an R-script and Word document or R notebook output (“knitted to HTML”) with answers to the following questions. You are encouraged to work in small groups, but submit your own, individual solutions to Canvas. You are not expected to work on the Lab outside of the two class sessions. Submit what you have completed by the end of class on Thursday and any reasonable progress receives full credit.
Background:
The VP of US eCommerce at Transcend (a Taiwanese electronics company) has witnessed sales for its SD (Secure Digital) card product on Amazon eroding from 2013 to 2015 (the “present”). This VP says that although the star score has increased, sales have been falling. The VP claims that to his knowledge, Transcend has been the category sales leader on Amazon since it joined Amazon in 2010. He has approached your team to diagnose what may be going on in the Amazon ecosystem, and to start with a competitive analysis. A goal of this exercise is to use text analysis to learn something about the marketplace in which Transcend is competing and the strengths and weaknesses of its competitors.
Data:
The data to be analyzed were originally taken from Julian McAuley’s curated Amazon Review Dataset (http://jmcauley.ucsd.edu/data/amazon/). In that data set, there are over 1.6 million reviews for electronics products from May 1996 – July 2014, and its key features include: i) Amazon ASIN number (i.e. product /product family ID), ii) review text, iii) star rating score, iv) the review date, and v) helpfulness of the review. This is a large data set, probably too large to work with effectively. A smaller and more tractable data set to work with is available here in csv form.
Part 1. Identify competitors in the Amazon SD marketplace
We are interested in reviews about SD cards because our client is an SD card manufacturer. We do not know which SD cards are the top selling competitors, but we can quickly identify them by searching for “SD” in the review text. Then, after isolating potential candidates, we can use an Amazon search to find product details.
- Step 1: Read in the .csv file of electronic reviews.
- Step 2: Filter for reviews containing the term “sd” using
stringr
functions and regular expressions as needed. To do this, find reviews that have “sd” or “SD” with a word boundary that occurs before and after. The word boundary restriction avoids erroneously detecting phrases like “ssd” or “misdetection”. Then, group, count, and sort the reviews you identify by ASIN to identify the ASINs that are likely to be related to sd cards. - Step 3: Of the top three of these ASINs, one is a Transcend product and the other two are from competitors. Search for each of these three ASINs on Amazon (or through Google) to identify the product manufacturers.
Feel free to try this section on your own, but because we will only be working on this lab in class, solutions for this part are provided in the space below so you can get going on the next sections relatively quickly.
library(stringr)
library(dplyr)
library(magrittr)
library(readr)
# modify your directory path to wherever you downloaded the file in the line below
reviews = read_csv("~/Downloads/electronics_downsample.csv")
Parsed with column specification:
cols(
X1 = col_integer(),
asin = col_character(),
helpful = col_character(),
overall = col_integer(),
reviewText = col_character(),
reviewTime = col_character(),
reviewerID = col_character(),
reviewerName = col_character(),
summary = col_character(),
unixReviewTime = col_integer()
)
Warning message:
Missing column names filled in: 'X1' [1]
rows = str_detect(reviews$reviewText, "\\b(sd|SD)\\b")
reviews[rows,] %>%
group_by(asin) %>%
tally() %>%
arrange(desc(n)) %>%
data.frame() %>%
slice(1:3)
asin n
1 B007WTAJTO 576
2 B002WE6D44 214
3 B000VX6XL6 192
Part 2. Exploratory analysis:
For this section, use only the reviews for the three product ASINs identified above. Use all the reviews for these three products, not just the ones you identified in your regex match in Part 1.
- A. What are the average overall number of stars for each of the three products identified in the previous part?
- B. What are the average sentiment scores for the reviews of each
of these three products? For this part, use a sentiment analysis
package (e.g.
syuzhet
ortidytext
) to analyze the sentiment of the text for the reviews and compute the average sentiment score for each of the three products. Choose a sentiment computation method that produces a numeric score (e.g. -5 to 5) rather than a category (happy, angry, sad).
Part 3: Text exploration with wordclouds
For this section, use only the reviews for the three product ASINs identified above. Use all the reviews for these products, not just the ones you identified in your regex match in Part 1. The goal of this section is to develop word clouds that summarize the review words most highly correlated with positive and negative scores for this product category.
- Step 1: Convert the relevant reviews into a text corpus using
VCorpus
(located in thetm
package). Clean the reviews by eliminating “stopwords”, removing whitespace, and converting words to lowercase. You may also choose to make other adjustments, such as removing punctuation and numbers. - Step 2: Generate a document-term matrix from this Corpus. Remove
sparse terms using
RemoveSparseTerms
and a threshold for the sparsity parameter that leaves you with no more than 300 words. Then, attach a column of data to this matrix that includes overall star scores for each review. - Step 3: Extract the 30 words that are most highly positively correlated with the number of stars, and the 30 words most negatively correlated with the number of stars.
- Step 4: Plot two wordclouds: One wordcloud for your list of positively correlated words and another wordcloud for your list of negatively correlated words. For each of these word clouds, the size of the words that appear in the cloud should be in proportion to the strength of the correlation between that word and the number of stars.
Part 4: Predicting review helpfulness
For this question, use all of the review text, not just the reviews for the three SD products.
Put yourself in the shoes of Amazon's platform designers. How do web platforms let reviewers know when they have written a good review? Amazon solves this problem by allowing reviews to be voted "Helpful" by other readers. This question asks you to build a predictive model for whether a review has at least one helpful vote or not. In the data, the number of helpful votes is the first of the two numbers in the helpful column.
Unlike previous examples we have done in this class, however, you are
not given the “x” variables to use for prediction. Instead, you are
asked to generate these features from the review text that can help
predict this binary variable. This is a process we have referred to in
class as “feature engineering”. Identifying specific words that are
predictive of helpfulness is one approach but there are many others. For
example, the sentiment of the review could be another possible feature,
or the number of capital letters.
This question is left purposely open
ended, as there are a large number of new variables you may be able to
create, and this question is meant to illustrate some of the challenges
when you have to engineer features from raw data. You will get full
credit for this question as long as you generate at least two
“features” from the review data and use them to generate predictions.
The deliverable for this question is the model and accuracy levels for
the best feature set you can create (in terms of prediction accuracy).
- Step 1: First create a new binary variable indicating whether a review has at least one helpful vote.
- Step 2: From the review text, create “features” that you can use in a predictive logistic regression model to predict your new binary dependent variable.
- Step 3: Divide the sample into a training and test data set (first 80% and last 20%), and using the training data, build a logistic regression model to predict the helpfulness variable using the features you created.
- Step 4: If you predicted that every row was a 0 (not helpful), you would be close to 60% accurate. How much better does your model do?
- Step 5: Go back and adjust your features to see if you can build an improved model!