Lab 5: Transcend

OIDD 245 Tambe


To provide a context for practice with string manipulation and text mining and to explore questions of competitive industry analysis using text analysis methods.

Due Date and Deliverables:

Prepare either an R-script and Word document or R notebook output (“knitted to HTML”) with answers to the following questions. You are encouraged to work in small groups, but submit your own, individual solutions to Canvas. You are not expected to work on the Lab outside of the two class sessions. Submit what you have completed by the end of class on Thursday and any reasonable progress receives full credit.


The VP of US eCommerce at Transcend (a Taiwanese electronics company) has witnessed sales for its SD (Secure Digital) card product on Amazon eroding from 2013 to 2015 (the “present”). This VP says that although the star score has increased, sales have been falling. The VP claims that to his knowledge, Transcend has been the category sales leader on Amazon since it joined Amazon in 2010. He has approached your team to diagnose what may be going on in the Amazon ecosystem, and to start with a competitive analysis. A goal of this exercise is to use text analysis to learn something about the marketplace in which Transcend is competing and the strengths and weaknesses of its competitors.


The data to be analyzed were originally taken from Julian McAuley’s curated Amazon Review Dataset ( In that data set, there are over 1.6 million reviews for electronics products from May 1996 – July 2014, and its key features include: i) Amazon ASIN number (i.e. product /product family ID), ii) review text, iii) star rating score, iv) the review date, and v) helpfulness of the review. This is a large data set, probably too large to work with effectively. A smaller and more tractable data set to work with is available here in csv form.

Part 1. Identify competitors in the Amazon SD marketplace

We are interested in reviews about SD cards because our client is an SD card manufacturer. We do not know which SD cards are the top selling competitors, but we can quickly identify them by searching for “SD” in the review text. Then, after isolating potential candidates, we can use an Amazon search to find product details.

Feel free to try this section on your own, but because we will only be working on this lab in class, solutions for this part are provided in the space below so you can get going on the next sections relatively quickly.


# modify your directory path to wherever you downloaded the file in the line below
reviews = read_csv("~/Downloads/electronics_downsample.csv")
Parsed with column specification:
  X1 = col_integer(),
  asin = col_character(),
  helpful = col_character(),
  overall = col_integer(),
  reviewText = col_character(),
  reviewTime = col_character(),
  reviewerID = col_character(),
  reviewerName = col_character(),
  summary = col_character(),
  unixReviewTime = col_integer()
Warning message:
Missing column names filled in: 'X1' [1]
rows = str_detect(reviews$reviewText, "\\b(sd|SD)\\b")

reviews[rows,] %>% 
   group_by(asin) %>% 
   tally() %>% 
   arrange(desc(n)) %>% 
   data.frame() %>% 
        asin   n
1 B007WTAJTO 576
2 B002WE6D44 214
3 B000VX6XL6 192

Part 2. Exploratory analysis:

For this section, use only the reviews for the three product ASINs identified above. Use all the reviews for these three products, not just the ones you identified in your regex match in Part 1.

Part 3: Text exploration with wordclouds

For this section, use only the reviews for the three product ASINs identified above. Use all the reviews for these products, not just the ones you identified in your regex match in Part 1. The goal of this section is to develop word clouds that summarize the review words most highly correlated with positive and negative scores for this product category.

Part 4: Predicting review helpfulness

For this question, use all of the review text, not just the reviews for the three SD products.

Put yourself in the shoes of Amazon's platform designers. How do web platforms let reviewers know when they have written a good review? Amazon solves this problem by allowing reviews to be voted "Helpful" by other readers. This question asks you to build a predictive model for whether a review has at least one helpful vote or not. In the data, the number of helpful votes is the first of the two numbers in the helpful column.

Unlike previous examples we have done in this class, however, you are not given the “x” variables to use for prediction. Instead, you are asked to generate these features from the review text that can help predict this binary variable. This is a process we have referred to in class as “feature engineering”. Identifying specific words that are predictive of helpfulness is one approach but there are many others. For example, the sentiment of the review could be another possible feature, or the number of capital letters.

This question is left purposely open ended, as there are a large number of new variables you may be able to create, and this question is meant to illustrate some of the challenges when you have to engineer features from raw data. You will get full credit for this question as long as you generate at least two “features” from the review data and use them to generate predictions. The deliverable for this question is the model and accuracy levels for the best feature set you can create (in terms of prediction accuracy).