Lab 5: Yelp

OIDD 2450 Tambe

Objective:

To text mine online reviews from Yelp to identify restaurants that can be recommended for lunch, and thereby improve ad targeting effectiveness. This lab is typical of the type of a project done by data scientists “in the wild”.

Background:

Yelp earns revenue from business owners who pay for online advertising. Therefore, the effectiveness of its ad targeting is a key differentiator. It is unique among existing ad platforms for its 148 million reviews of restaurants.

This massive database of online reviews is useful for improving ad targeting effectiveness.

Assume that Yelp prioritizes restaurants that are known to be "good for lunch" when showing ads to users who visit the website after breakfast but before 2 pm.

For most establishments in their data, Yelp has access to a binary (i.e. 0/1) variable called "GoodforLunch" that specifies whether the restaurant is a suitable lunch destination. This variable is provided (i.e. self-reported) by business owners.

However, around 15% of restaurants in their database have missing values for this attribute because business owners have not provided a value. One of the goals of this lab is to use the online text review data – i.e. what consumers have said about their experiences at the restaurant – to make predictions about whether restaurants with missing values are likely to fall into the "good for lunch" category. Therefore, you are asked to 1) process the text in Yelp's reviews, 2) summarize, and create quantitative measures from these text reviews, and 3) use them to predict whether restaurants with missing data for this field are "good for lunch" based on what customers have said about the restaurant in the reviews (and thereby improve ad targeting effectiveness).

Data:

A sample of 20,000 online restaurant reviews from Yelp.com are available here. (15 MB)

Due Date and Deliverables:

Prepare an R notebook with answers to the following questions (or an R script with an accompanying Word file with responses to the questions). You are encouraged to work in groups, but submit your own, individual solutions to Canvas.

Part 1. Summary Statistics

Provide the following summary statistics about the data set.

Generate a histogram of all restaurant ratings (i.e. "stars") given by users in the dataset. You can use hist or geom_hist if you prefer to use ggplot.

How many reviews has the average restaurant in this sample received? There are many ways to summarize a value by a group such as a restaurant id. You can use the aggregate function, (or the summarise and group_by functions if you are using the plyr or dplyr packages). You can also use the table command, convert the output to a data frame, and take means of the appropriate columns.

How many reviews has the average user in this sample contributed?

On average, do restaurants that have been marked as “GoodforLunch” receive a greater number of reviews?

Do they receive a higher number of stars?

Part 2. Exploratory Text Analysis

Convert the reviews into a text corpus using VCorpus (located in the tm package).

Clean the reviews by eliminating "stopwords", removing whitespace, and converting words to lowercase. You may also choose to make other adjustments, such as removing punctuation and numbers.

Generate a document-term matrix from the Corpus.

Use the document-term matrix to provide answers to the following questions.
- Provide a list of the ten most frequently appearing words among all reviews. To do this, compute the sum of columns in the document term matrix for each word, and report the words with the ten highest sums.
- Generate a word cloud using the document-term matrix, where the word is sized by its frequency in the review (this kind of sizing is the default). Set a maximum of no more than 100 or so words (this is an option that can be passed to the wordcloud function).

Part 3. Text analytics and prediction

We will be using a "unigram"" text model (i.e. only using single words, not word combinations) to predict the value of the variable "GoodforLunch". For this exercise, the document-term matrix you generated above contains your predictor variables – i.e., the words used in the reviews – and GoodforLunch is the outcome variable to be predicted. Note: Many of the following steps can be executed by calling only one function, but require data manipulation that may require you to think carefully about how to access the right rows and columns. Be prepared to sketch out what you are doing from step-to-step, and don't be afraid to try a few different things and check the output at each step along the way.

How many unique terms are in your document-term matrix?

Use the following steps, which we discussed in class, to further narrow the list to 200 words with the most predictive power.
- Remove sparse terms using RemoveSparseTerms and a 0.990 threshold for the sparsity parameter
- Calculate the correlation matrix between the words in the document-term matrix and goodforlunch
- Keep only the 200 terms with the highest correlation magnitudes (i.e. positive or negative)
- Create a new document-term matrix containing only these words.

Using the top 20 positive and negative words in terms of correlation strength, generate a wordcloud where the size corresponds to the correlation strength, and where positive and negative terms are shown in different colors. Explore wordcloud help and the options that the wordcloud function takes to figure out how to do this.

Partition the matrix into training and test rows so you can use the test data to evaluate your model performance. Set the last 20% of your rows aside for testing, and use the first 80% to build your model as specified below.

Fit a logistic regression model to the selected variables in the training data.

A positive coefficient positively predicts that a restaurant is good for lunch. A negative coefficient suggests a restaurant would not be good for lunch. You can use the coef command to access the model coefficients. Produce a word cloud that separates the top 15 positive words and top 15 negative words. Plot these two groups in different colors. You can pass your color choices into the wordcloud function.

Using the model you have generated, choose a probability threshold to maximize accuracy and classify the restaurants in your training data as 1 or 0 according to whether they are "GoodForLunch". How well does this model perform on the training data in terms of classification accuracy (i.e. the percentage of GoodForLunch values that you get correct)?

Predict values for "GoodforLunch"" in your test data. How well does the model perform in terms of classification accuracy (i.e. the percentage of GoodForLunch values that you get correct)?

Part 4. Sentiment Analysis

How do web platforms let reviewers know when they have written a good review? Yelp solves this problem by allowing reviews to be voted “Useful”, “Funny”, or “Cool” by other readers. Identifying what characteristics of reviews that tend to garner these “UFC” votes from other readers has been the subject of some discussion among Yelp reviewers (e.g. 1, 2, 3). Using a sentiment analysis package (e.g. syuzhet or tidytext), analyze whether reviews that have more emotional valence (i.e. sentiment) tend to garner more UFC votes.

For each review, create a combined “UFC” score by adding the funny, useful, and cool votes for each review.

Use a sentiment analysis package (e.g. syuzhet or tidytext) to analyze the sentiment of the text of each review. Choose a sentiment computation method that produces a numeric score (e.g. -5 to 5) rather than a category (happy, angry, sad).

Using this sentiment score, create a new column that identifies whether reviews in your data are above or below the median sentiment score for all of the reviews in your sample.

Finally, create a bar chart that compares the mean UFC score for reviews that are above and below the median sentiment score. It should have only two bars!

Optional: For a more statistically correct comparison, use the t.test command to test if the UFC score across the high and low sentiment groups is different at a level that is statistically significant.

Part 5: Feature engineering to predict review helpfulness

In this section, you are asked to use all of the review text to predict whether or not a review garners at least one “Useful” vote. However, here, you are not told the “x” variables to use for prediction. Instead, you are asked to generate these features from the review text that can help predict this binary variable. These can include the sentiment measure from Part 4, but it should not be limited to sentiment.

This is a process known as “feature engineering”. Identifying specific words that are predictive of helpfulness is one approach but there are many others. For example, you could use a measure of whether there are typographical errors or you could count the number of capital letters. This question is left purposely open ended, as there are a large number of new variables you may be able to create, and this question is meant to illustrate the challenges when you have to engineer features from raw data.

You should generate at least two “features” from the review data (beyond sentiment) and use them to generate predictions. The deliverable for this question is the model and accuracy levels for the best feature set you can create (in terms of prediction accuracy). After you have generated features that you like, please add your name, accuracy score, and the features you used to this spreadsheet. Feel free to go back and add any new features to your model and further improve the accuracy, but then add each iteration to a new row in the spreadsheet and change the “Submission Number” field accordingly.

Step 1: First create a new binary variable indicating whether a review has at least one useful vote.

Step 2: From the review text, create “features” that you can use in a predictive logistic regression model to predict your new binary dependent variable.

Step 3: Divide the sample into a training and test data set (first 80% and last 20%), and using the training data, build a logistic regression model to predict the helpfulness variable using the features you created.

Step 4: Consider that if you predicted that every row was a 0 or 1 (not helpful), you could already have a baseline high level of accuracy. Relative to those approaches, how much better does your model do?

Step 5: Go back and adjust your features to see if you can build an improved model!