# Lab 4: Lending Club

#### OIDD 245 Tambe

# Objective and Background

One industry in which machine learning has been having a significant
impact is financial lending. In this lab, you will use data from
**LendingClub**, a well-known peer-to-peer
lending platform based in San Francisco, California, to build
statistical models that use debtor attributes to predict the grades
given to loan applications by the Lending Club. This is an
application/data set that many of you may have seen before, for example
in a finance or statistics class, and it is a particularly good context
to try some of the prediction concepts we have covered in the last few
sessions.

You will be exploring how to build prediction models for this industry.
To accomplish this objective, you are asked to *build* models using
historical data from 2013 to 2014 (which will be used as the training
data) and *test* the quality of your prediction model using new loan
requests that came in during 2015 (the test data) to evaluate how well
it performs.

# Deliverables

You are required to complete and submit the lab, but grading will be based on completion (as opposed to grading of individual questions). The lab looks long, but much of it is explanation and if working with others, it is certainly possible to complete the lab in a single class period.

# Data

- The
**training data**is the LendingClub historical data on loans and chargeoffs from 2013 through 2014. (175 MB) - The
**test data**is LendingClub applications that were received in 2015. (309 MB) (this data set is only required for Parts E and F) - You will also need the data dictionary which explains the meanings of the different fields in the files.

# Exercises

## 1. Data loading and cleanup.

- The data have a messy first row that will cause errors when you read
in the file. To discard the first row, use the "skip=1" option in
`read.csv()`

to tell it to skip the first row when reading the file. There is a similar option you can use if using the`read_csv`

command. Given the size of these files, I would recommend using the`read_csv`

function, because it is much faster.

- Note that there are two empty rows at the end of each file. This messy data will cause you problems if you run regressions that try to use these rows. Remove the last two rows from each data set before getting started.

- Finally, since we will be running models that rely on random number
generation, somewhere near the top of your file, add the
`set.seed()`

command using any number you choose as the argument. This will ensure you get the same results every time you run your code. If you are working with others, using the same seed value should assure that your models produce the same results, assuming everything else you do is also the same.

## 2. Descriptive statistics.

- Using the training data, create a new binary variable called
`highgrade`

that takes the value 1 when a loan has received an "A" or "B" grade and 0 otherwise. This is the response variable that you will try to predict. What is the proportion of loans in the training data that receive either an "A" or "B" grade?

## 3. Build a logistic regression based classifier using the training data.

- Using the
`glm`

command as discussed in the class session, perform a logistic regression that predicts the`highgrade`

outcome variable using annual income, home ownership, loan amount, verification status, and purpose as the predictors. Use the`summary`

command to view the output of your regression.

- Next, use the
`predict`

command to generate a vector of the probabilities predicted by your logistic regression. This probability can be interpreted as the likelihood that`highgrade`

is a "1" for that row. When using the`predict`

function, remember to use`type='response'`

when calling your predict function so that its output can be interpreted as probabilities between 0 and 1. If you forget to do that,`predict`

will produce log-odd ratios, which can be negative values.

- Create a new column that contains your predictions that classifies
loans as being highgrade or not (1 or 0), based on your predicted
probabilities. To do this, you will need to choose a probability
threshold above which to classify loans as being high grade.
Although – as we discussed in class – there are a variety of
criteria to use when choosing this value, for now, try different
values to select one that roughly optimizes the
*accuracy*of the classifier (you do not need to go beyond the two-digit level), where accuracy is defined in the next bullet.

- To evaluate how well this logistic regression-based classifier
performs, we can measure its accuracy, defined as the proportion of
answers in the training data that it gets correct. In other words,
this would be the proportion of rows in which the classifer
prediction is equal to its actual
`highgrade`

value, as assessed by the original loan officers.- What is the accuracy of this classifer on the training data?
- For comparison, what is the accuracy of a classifier that assigns a value of 0 to all rows for the predicted class?

## 4. Build a classification tree on the training data.

As an alternative to logistic regression, we can use a "classification
tree" to
classify loans according whether or not they are likely to have an "A"
or "B" grade. A widely used implementation of decision trees in R is
the `rpart`

library in R. `rpart`

stands for "Recursive partitioning
and regression trees". Install the rpart package and don't forget to
use the `library`

command to include the library wherever you are
executing your code.

```
library(rpart)
```

The syntax for running a classification tree using the rpart library is
similar to that of a regression. For example, if your data set is called
*loans* and, within that data set, you have a binary dependent variable
*y* that you are trying to predict using two predictors *x1* and *x2*,
the R command to build a classification model using `rpart`

would have
the following syntax.

```
fit = rpart(y ~ x1 + x2, data = loans, method = "class")
```

In this example, the `method`

option in this call tells `rpart`

that
this is a classification tree, which has binary outcomes, not a
regression tree, which has continuous outcomes. To visualize the
classification tree you have built:

```
plot(fit)
text(fit)
```

If you are having trouble with the tree visualization, you may want to
explore other packages, such as `rpart.plot`

, that do a much better job
with decision tree visualization.

Now that you have built this classification tree, you can predict values
in the same way as with regression, using the `predict`

command. The
result (in this case, stored in the vector *z*) will hold the binary
values that indicate whether your classification tree predicts whether
or not a loan will be high grade.

```
z = predict(fit, type="class")
```

Using the accuracy metric described above, is this machine learning based classifier more or less accurate than the one based in logistic regression?

## 5. Compute performance on the test data.

Classifiers such as the ones described above are built on *training*
data (e.g. historical data), where the “correct” answers are available
so that the model can be calibrated. Once you are happy with their
performance, their performance is assessed on *test* data where the
correct answers are generally not available (e.g. future data). In
general, the classifier will not perform as well on the test data as it
did on the training data, because it was calibrated for optimal
performance on the training data.

- Evaluate the accuracy of both of the classifiers you built above
(logistic regression & classification tree) on the
**test**data. Note that there is an extra`purpose`

category in the test data called “educational”. You should remove those rows from the test data to get your models to fit the new data set.

- As a benchmark, what is the accuracy of a classifier that simply
assigns a value of 0 to all rows of the
**test**data?