Lab 4: Lending Club
OIDD 245 Tambe
Objective and Background
One industry in which machine learning has been having a significant impact is financial lending. In this lab, you will use data from LendingClub, a well-known peer-to-peer lending platform based in San Francisco, California, to build statistical models that use debtor attributes to predict the grades given to loan applications by the Lending Club. This is an application/data set that many of you may have seen before, for example in a finance or statistics class, and it is a particularly good context to try some of the prediction concepts we have covered in the last few sessions.
You will be exploring how to build prediction models for this industry. To accomplish this objective, you are asked to build models using historical data from 2013 to 2014 (which will be used as the training data) and test the quality of your prediction model using new loan requests that came in during 2015 (the test data) to evaluate how well it performs.
You are required to complete and submit the lab, but grading will be based on completion (as opposed to grading of individual questions). The lab looks long, but much of it is explanation and if working with others, it is certainly possible to complete the lab in a single class period.
- The training data is the LendingClub historical data on loans and chargeoffs from 2013 through 2014. (175 MB)
- The test data is LendingClub applications that were received in 2015. (309 MB) (this data set is only required for Parts E and F)
- You will also need the data dictionary which explains the meanings of the different fields in the files.
1. Data loading and cleanup.
- The data have a messy first row that will cause errors when you read
in the file. To discard the first row, use the "skip=1" option in
read.csv()to tell it to skip the first row when reading the file. There is a similar option you can use if using the
read_csvcommand. Given the size of these files, I would recommend using the
read_csvfunction, because it is much faster.
- Note that there are two empty rows at the end of each file. This messy data will cause you problems if you run regressions that try to use these rows. Remove the last two rows from each data set before getting started.
- Finally, since we will be running models that rely on random number
generation, somewhere near the top of your file, add the
set.seed()command using any number you choose as the argument. This will ensure you get the same results every time you run your code. If you are working with others, using the same seed value should assure that your models produce the same results, assuming everything else you do is also the same.
2. Descriptive statistics.
- Using the training data, create a new binary variable called
highgradethat takes the value 1 when a loan has received an "A" or "B" grade and 0 otherwise. This is the response variable that you will try to predict. What is the proportion of loans in the training data that receive either an "A" or "B" grade?
3. Build a logistic regression based classifier using the training data.
- Using the
glmcommand as discussed in the class session, perform a logistic regression that predicts the
highgradeoutcome variable using annual income, home ownership, loan amount, verification status, and purpose as the predictors. Use the
summarycommand to view the output of your regression.
- Next, use the
predictcommand to generate a vector of the probabilities predicted by your logistic regression. This probability can be interpreted as the likelihood that
highgradeis a "1" for that row. When using the
predictfunction, remember to use
type='response'when calling your predict function so that its output can be interpreted as probabilities between 0 and 1. If you forget to do that,
predictwill produce log-odd ratios, which can be negative values.
- Create a new column that contains your predictions that classifies loans as being highgrade or not (1 or 0), based on your predicted probabilities. To do this, you will need to choose a probability threshold above which to classify loans as being high grade. Although – as we discussed in class – there are a variety of criteria to use when choosing this value, for now, try different values to select one that roughly optimizes the accuracy of the classifier (you do not need to go beyond the two-digit level), where accuracy is defined in the next bullet.
- To evaluate how well this logistic regression-based classifier
performs, we can measure its accuracy, defined as the proportion of
answers in the training data that it gets correct. In other words,
this would be the proportion of rows in which the classifer
prediction is equal to its actual
highgradevalue, as assessed by the original loan officers.
- What is the accuracy of this classifer on the training data?
- For comparison, what is the accuracy of a classifier that assigns a value of 0 to all rows for the predicted class?
4. Build a classification tree on the training data.
As an alternative to logistic regression, we can use a "classification
classify loans according whether or not they are likely to have an "A"
or "B" grade. A widely used implementation of decision trees in R is
rpart library in R.
rpart stands for "Recursive partitioning
and regression trees". Install the rpart package and don't forget to
library command to include the library wherever you are
executing your code.
The syntax for running a classification tree using the rpart library is
similar to that of a regression. For example, if your data set is called
loans and, within that data set, you have a binary dependent variable
y that you are trying to predict using two predictors x1 and x2,
the R command to build a classification model using
rpart would have
the following syntax.
fit = rpart(y ~ x1 + x2, data = loans, method = "class")
In this example, the
method option in this call tells
this is a classification tree, which has binary outcomes, not a
regression tree, which has continuous outcomes. To visualize the
classification tree you have built:
If you are having trouble with the tree visualization, you may want to
explore other packages, such as
rpart.plot, that do a much better job
with decision tree visualization.
Now that you have built this classification tree, you can predict values
in the same way as with regression, using the
predict command. The
result (in this case, stored in the vector z) will hold the binary
values that indicate whether your classification tree predicts whether
or not a loan will be high grade.
z = predict(fit, type="class")
Using the accuracy metric described above, is this machine learning based classifier more or less accurate than the one based in logistic regression?
5. Compute performance on the test data.
Classifiers such as the ones described above are built on training data (e.g. historical data), where the “correct” answers are available so that the model can be calibrated. Once you are happy with their performance, their performance is assessed on test data where the correct answers are generally not available (e.g. future data). In general, the classifier will not perform as well on the test data as it did on the training data, because it was calibrated for optimal performance on the training data.
- Evaluate the accuracy of both of the classifiers you built above
(logistic regression & classification tree) on the test data.
Note that there is an extra
purposecategory in the test data called “educational”. You should remove those rows from the test data to get your models to fit the new data set.
- As a benchmark, what is the accuracy of a classifier that simply assigns a value of 0 to all rows of the test data?