Lab 4: Lending Club

OIDD 2450 Tambe

Objective and Background

One industry in which machine learning has been having a significant impact is financial lending (e.g. credit-scoring models). In this lab, you will use data from LendingClub, a well-known peer-to-peer lending platform based in San Francisco, California, to build statistical models that use debtor attributes to predict the grades given to loan applications by the Lending Club. This is a data set that may be familiar, for example from a finance or statistics class, and it is a particularly good context to try some of the prediction concepts we have covered in the last few sessions.

You will be exploring how to build prediction models for this industry. To accomplish this objective, you are asked to build models using historical data from 2013 to 2014 (which will be used as the training data) and test the quality of your prediction model using new loan requests that came in during 2015 (the test data) to evaluate how well it performs.

Deliverables

Please submit both your R code and the answers to the questions posed below. You can choose to submit a notebook file that contains both code and answers or you can submit an R-script and a separate .pdf file that contains answers.

Data

Exercises

1. Data loading and cleanup.

2. Descriptive statistics.

3. Build a logistic regression classifier using the training data.

4. Build a classification tree on the training data.

As an alternative to logistic regression, we can use a "classification tree" to classify loans according whether or not they are likely to have an "A" or "B" grade. A widely used implementation of decision trees in R is the rpart library in R. rpart stands for "Recursive partitioning and regression trees". Install the rpart package and don't forget to use the library command to include the library wherever you are executing your code.

library(rpart)

The syntax for running a classification tree using the rpart library is similar to that of a regression. For example, if your data set is called loans and, within that data set, you have a binary dependent variable y that you are trying to predict using two predictors x1 and x2, the R command to build a classification model using rpart would have the following syntax.

fit = rpart(y ~ x1 + x2, data = loans, method = "class")

In this example, the method option in this call tells rpart that this is a classification tree, which has binary outcomes, not a regression tree, which has continuous outcomes. To visualize the classification tree you have built:

plot(fit) text(fit)

If you are having trouble with the tree visualization, you may want to explore other packages, such as rpart.plot, that do a much better job with decision tree visualization.

Now that you have built this classification tree, you can predict values in the same way as with regression, using the predict command. The result (in this case, stored in the vector z) will hold the binary values that indicate whether your classification tree predicts whether or not a loan will be high grade.

z = predict(fit, type="class")

Using the accuracy metric described above, is this decision tree classifier more or less accurate than the one using logistic regression?

5. Build a Naive Bayes model on the training data.

As another alternative, we can use a [“Naive bayes”] (https://en.wikipedia.org/wiki/Naive_Bayes_classifier) algorithm to classify loans according whether or not they are likely to have an "A" or "B" grade. A widely used implementation of Naive Bayes in R is the naivebayes library in R. Install the package and don't forget to use the library command to include the library wherever you are executing your code. In this question, the functions to use are purposefully left unspecified. You are expected to look up the appropriate documentation to learn which functions from the library to use, and how to call them.

6. Compute predictor performance on the test data.

Classifiers such as the ones described above are built on training data (e.g. historical data), where the “correct” answers are available so that the model can be calibrated. Once you are happy with their performance, their performance is assessed on test data where the correct answers are generally not available (e.g. future data). In general, the classifier will not perform as well on the test data as it did on the training data, because it was calibrated for optimal performance on the training data.

7. Beyond accuracy as a performance metric.

The accuracy metric described above is the fraction of responses the algorithm correctly predicts. Suppose, instead, that the lending organization is able to quantify the different returns to originating good and bad loans and decides to focus on which algorithms generate the highest overall returns given these figures.

a. Suppose that highgrade loans that are correctly identified as such earn the lender $40, but that loans that are incorrectly predicted to be highgrade cost them $15. Based on test data performance, which classifier would the lender want to use?

b. What if these numbers were reversed - i.e. a highgrade loan that is correctly identified as highgrade earns the lender $15, but incorrectly classifying a loan as highgrade costs $40. Under these circumstances, based on test data performance, which classifier would the lender want to use?