Lab 4: Lending Club

OIDD 2450 Tambe

Objective and Background

One industry in which machine learning has been having a significant impact is financial lending (e.g. credit-scoring models). In this lab, you will use data from LendingClub, a well-known peer-to-peer lending platform based in San Francisco, California, to build statistical models that use debtor attributes to predict the grades given to loan applications by the Lending Club. This is a data set that may be familiar, for example from a finance or statistics class, and it is a particularly good context to try some of the prediction concepts we have covered in the last few sessions.

You will be exploring how to build prediction models for this industry. To accomplish this objective, you are asked to build models using historical data from 2013 to 2014 (which will be used as the training data) and test the quality of your prediction model using new loan requests that came in during 2015 (the test data) to evaluate how well it performs.

Deliverables

Please submit both your R code and the answers to the questions posed below. You can choose to submit a notebook file that contains both code and answers or you can submit an R-script and a separate .pdf file that contains answers.

Data

The training data is the LendingClub historical data on loans and chargeoffs from 2013 through 2014. (175 MB)

The test data is LendingClub applications that were received in 2015. (309 MB) (this data set is only required for Parts E and F)

You will also need the data dictionary which explains the meanings of the different fields in the files.

Exercises

1. Data loading and cleanup.

The data have a messy first row that will cause errors when you read in the file. To discard the first row, use the "skip=1" option in read.csv() to tell it to skip the first row when reading the file. There is a similar option you can use if using the read_csv command. Given the size of these files, I would recommend using the read_csv function, because it is much faster.

Note that there are two empty rows at the end of each file. This messy data will cause you problems if you run regressions that try to use these rows. Remove the last two rows from each data set before getting started.

Finally, since we will be running models that rely on random number generation, somewhere near the top of your file, add the set.seed() command using any number you choose as the argument. This will ensure you get the same results every time you run your code. If you are working with others, using the same seed value should assure that your models produce the same results, assuming everything else you do is also the same.

2. Descriptive statistics.

Using the training data, create a new binary variable called highgrade that takes the value 1 when a loan has received an "A" or "B" grade and 0 otherwise. This is the response variable that you will try to predict. What is the proportion of loans in the training data that receive either an "A" or "B" grade?

3. Build a logistic regression classifier using the training data.

Using the glm command as discussed in an earlier class session, perform a logistic regression that predicts the highgrade outcome variable using the variables: annual income, home ownership, loan amount, verification status, and purpose. Use the summary command to view the output of your regression.

Next, use the predict command to generate a vector of the probabilities predicted by your logistic regression. This probability can be interpreted as the likelihood that highgrade is a "1" for that row. When using the predict function, remember to use type='response' when calling your predict function so that its output can be interpreted as probabilities between 0 and 1. If you forget to do that, predict will produce log-odd ratios, which can be negative values.

Create a new column that contains your predictions that classifies loans as being highgrade or not (1 or 0), based on your predicted probabilities. To do this, you will need to choose a probability threshold above which to classify loans as being high grade. Although – as we discussed in class – there are a variety of criteria to use when choosing this value, for now, try different values to select one that roughly optimizes the accuracy of the classifier (you do not need to go beyond the two-digit level), where accuracy is defined in the next bullet.

To evaluate how well this logistic regression-based classifier performs, we can measure its accuracy, defined as the proportion of answers in the training data that it gets correct. In other words, this would be the proportion of rows in which the classifer prediction is equal to its actual highgrade value, as assessed by the original loan officers.
- What is the accuracy of this classifer on the training data?
- For comparison, what is the accuracy of a classifier that assigns a value of 0 to all rows for the predicted class?

4. Build a classification tree on the training data.

As an alternative to logistic regression, we can use a "classification tree" to classify loans according whether or not they are likely to have an "A" or "B" grade. A widely used implementation of decision trees in R is the rpart library in R. rpart stands for "Recursive partitioning and regression trees". Install the rpart package and don't forget to use the library command to include the library wherever you are executing your code.

library(rpart)

The syntax for running a classification tree using the rpart library is similar to that of a regression. For example, if your data set is called loans and, within that data set, you have a binary dependent variable y that you are trying to predict using two predictors x1 and x2, the R command to build a classification model using rpart would have the following syntax.

fit = rpart(y ~ x1 + x2, data = loans, method = "class")

In this example, the method option in this call tells rpart that this is a classification tree, which has binary outcomes, not a regression tree, which has continuous outcomes. To visualize the classification tree you have built:

plot(fit) text(fit)

If you are having trouble with the tree visualization, you may want to explore other packages, such as rpart.plot, that do a much better job with decision tree visualization.

Now that you have built this classification tree, you can predict values in the same way as with regression, using the predict command. The result (in this case, stored in the vector z) will hold the binary values that indicate whether your classification tree predicts whether or not a loan will be high grade.

z = predict(fit, type="class")

Using the accuracy metric described above, is this decision tree classifier more or less accurate than the one using logistic regression?

5. Build a Naive Bayes model on the training data.

As another alternative, we can use a [“Naive bayes”] (https://en.wikipedia.org/wiki/Naive_Bayes_classifier) algorithm to classify loans according whether or not they are likely to have an "A" or "B" grade. A widely used implementation of Naive Bayes in R is the naivebayes library in R. Install the package and don't forget to use the library command to include the library wherever you are executing your code. In this question, the functions to use are purposefully left unspecified. You are expected to look up the appropriate documentation to learn which functions from the library to use, and how to call them.

6. Compute predictor performance on the test data.

Classifiers such as the ones described above are built on training data (e.g. historical data), where the “correct” answers are available so that the model can be calibrated. Once you are happy with their performance, their performance is assessed on test data where the correct answers are generally not available (e.g. future data). In general, the classifier will not perform as well on the test data as it did on the training data, because it was calibrated for optimal performance on the training data.

Evaluate the accuracy of all three of the classifiers you built above (logistic regression, classification tree, and naive bayes) on the test data. Note that there is an extra purpose category in the test data called “educational”. You should remove those rows from the test data to get your models to fit the new data set. To report your results, create a table reporting accuracy for each algorithm. Which algorithm generates the highest test data accuracy for this data context?

As a benchmark to use as a comparison, what is the accuracy of a classifier that simply assigns a value of 0 to all rows of the test data?

7. Beyond accuracy as a performance metric.

The accuracy metric described above is the fraction of responses the algorithm correctly predicts. Suppose, instead, that the lending organization is able to quantify the different returns to originating good and bad loans and decides to focus on which algorithms generate the highest overall returns given these figures.

a. Suppose that highgrade loans that are correctly identified as such earn the lender $40, but that loans that are incorrectly predicted to be highgrade cost them $15. Based on test data performance, which classifier would the lender want to use?

b. What if these numbers were reversed - i.e. a highgrade loan that is correctly identified as highgrade earns the lender $15, but incorrectly classifying a loan as highgrade costs $40. Under these circumstances, based on test data performance, which classifier would the lender want to use?