Lab 4: Lending Club

OIDD 245 Tambe

Objective and Background

One industry in which machine learning has been having a significant impact is financial lending. In this lab, you will use data from LendingClub, a well-known peer-to-peer lending platform based in San Francisco, California, to build statistical models that use debtor attributes to predict the grades given to loan applications by the Lending Club. This is an application/data set that many of you may have seen before, for example in a finance or statistics class, and it is a particularly good context to try some of the prediction concepts we have covered in the last few sessions.

You will be exploring how to build prediction models for this industry. To accomplish this objective, you are asked to build models using historical data from 2013 to 2014 (which will be used as the training data) and test the quality of your prediction model using new loan requests that came in during 2015 (the test data) to evaluate how well it performs.

Deliverables

You are required to complete and submit the lab, but grading will be based on completion (as opposed to grading of individual questions). The lab looks long, but much of it is explanation and if working with others, it is certainly possible to complete the lab in a single class period.

Data

Exercises

1. Data loading and cleanup.

2. Descriptive statistics.

3. Build a logistic regression based classifier using the training data.

4. Build a classification tree on the training data.

As an alternative to logistic regression, we can use a "classification tree" to classify loans according whether or not they are likely to have an "A" or "B" grade. A widely used implementation of decision trees in R is the rpart library in R. rpart stands for "Recursive partitioning and regression trees". Install the rpart package and don't forget to use the library command to include the library wherever you are executing your code.

library(rpart)

The syntax for running a classification tree using the rpart library is similar to that of a regression. For example, if your data set is called loans and, within that data set, you have a binary dependent variable y that you are trying to predict using two predictors x1 and x2, the R command to build a classification model using rpart would have the following syntax.

fit = rpart(y ~ x1 + x2, data = loans, method = "class")

In this example, the method option in this call tells rpart that this is a classification tree, which has binary outcomes, not a regression tree, which has continuous outcomes. To visualize the classification tree you have built:

plot(fit) 
text(fit)

If you are having trouble with the tree visualization, you may want to explore other packages, such as rpart.plot, that do a much better job with decision tree visualization.

Now that you have built this classification tree, you can predict values in the same way as with regression, using the predict command. The result (in this case, stored in the vector z) will hold the binary values that indicate whether your classification tree predicts whether or not a loan will be high grade.

z = predict(fit, type="class")

Using the accuracy metric described above, is this machine learning based classifier more or less accurate than the one based in logistic regression?

5. Compute performance on the test data.

Classifiers such as the ones described above are built on training data (e.g. historical data), where the “correct” answers are available so that the model can be calibrated. Once you are happy with their performance, their performance is assessed on test data where the correct answers are generally not available (e.g. future data). In general, the classifier will not perform as well on the test data as it did on the training data, because it was calibrated for optimal performance on the training data.