Lab 5: Yelp

OIDD 2450 Tambe

Objective:

To text mine online reviews from Yelp to identify restaurants that can be recommended for lunch, and thereby improve ad targeting effectiveness. This lab is typical of the type of a project done by data scientists “in the wild”.

Background:

Yelp earns revenue from business owners who pay for online advertising. Therefore, the effectiveness of its ad targeting is a key differentiator. It is unique among existing ad platforms for its 148 million reviews of restaurants.

This massive database of online reviews is useful for improving ad targeting effectiveness.

Assume that Yelp prioritizes restaurants that are known to be "good for lunch" when showing ads to users who visit the website after breakfast but before 2 pm.

For most establishments in their data, Yelp has access to a binary (i.e. 0/1) variable called "GoodforLunch" that specifies whether the restaurant is a suitable lunch destination. This variable is provided (i.e. self-reported) by business owners.

However, around 15% of restaurants in their database have missing values for this attribute because business owners have not provided a value. One of the goals of this lab is to use the online text review data – i.e. what consumers have said about their experiences at the restaurant – to make predictions about whether restaurants with missing values are likely to fall into the "good for lunch" category. Therefore, you are asked to 1) process the text in Yelp's reviews, 2) summarize, and create quantitative measures from these text reviews, and 3) use them to predict whether restaurants with missing data for this field are "good for lunch" based on what customers have said about the restaurant in the reviews (and thereby improve ad targeting effectiveness).

Data:

A sample of 20,000 online restaurant reviews from Yelp.com are available here. (15 MB)

Due Date and Deliverables:

Prepare an R notebook with answers to the following questions (or an R script with an accompanying Word file with responses to the questions). You are encouraged to work in groups, but submit your own, individual solutions to Canvas.

Part 1. Summary Statistics

Provide the following summary statistics about the data set.

Part 2. Exploratory Text Analysis

Part 3. Text analytics and prediction

We will be using a "unigram"" text model (i.e. only using single words, not word combinations) to predict the value of the variable "GoodforLunch". For this exercise, the document-term matrix you generated above contains your predictor variables – i.e., the words used in the reviews – and GoodforLunch is the outcome variable to be predicted. Note: Many of the following steps can be executed by calling only one function, but require data manipulation that may require you to think carefully about how to access the right rows and columns. Be prepared to sketch out what you are doing from step-to-step, and don't be afraid to try a few different things and check the output at each step along the way.

Partition the matrix into training and test rows so you can use the test data to evaluate your model performance. Set the last 20% of your rows aside for testing, and use the first 80% to build your model as specified below.

Part 4. Sentiment Analysis

How do web platforms let reviewers know when they have written a good review? Yelp solves this problem by allowing reviews to be voted “Useful”, “Funny”, or “Cool” by other readers. Identifying what characteristics of reviews that tend to garner these “UFC” votes from other readers has been the subject of some discussion among Yelp reviewers (e.g. 1, 2, 3). Using a sentiment analysis package (e.g. syuzhet or tidytext), analyze whether reviews that have more emotional valence (i.e. sentiment) tend to garner more UFC votes.

Part 5: Feature engineering to predict review helpfulness

In this section, you are asked to use all of the review text to predict whether or not a review garners at least one “Useful” vote. However, here, you are not told the “x” variables to use for prediction. Instead, you are asked to generate these features from the review text that can help predict this binary variable. These can include the sentiment measure from Part 4, but it should not be limited to sentiment.

This is a process known as “feature engineering”. Identifying specific words that are predictive of helpfulness is one approach but there are many others. For example, you could use a measure of whether there are typographical errors or you could count the number of capital letters. This question is left purposely open ended, as there are a large number of new variables you may be able to create, and this question is meant to illustrate the challenges when you have to engineer features from raw data.

You should generate at least two “features” from the review data (beyond sentiment) and use them to generate predictions. The deliverable for this question is the model and accuracy levels for the best feature set you can create (in terms of prediction accuracy). After you have generated features that you like, please add your name, accuracy score, and the features you used to this spreadsheet. Feel free to go back and add any new features to your model and further improve the accuracy, but then add each iteration to a new row in the spreadsheet and change the “Submission Number” field accordingly.