Lab 1. Diabetes Prediction lab

OIDD 255, Tambe

Learning Objectives for this lab

A first attempt at using a familiar tool - Excel — for using data to predict a target variable.

Develop initial familiarity with the tradeoffs that arise when using prediction models.

Understand the concept of error when using prediction models.

Data Context

The setting for the exercise is healthcare. Machine learning models are becoming widespread in healthcare diagnostics, and using machine learning for diabetes prediction is becoming fairly common. [1, 2]

This lab uses a popular and common machine learning model — “logistic regression” — to predict outcomes from patient data. In this context, given access to other indicators of patient health (that may be readily available or easy to collect), the goal would be to predict whether or not a patient has diabetes (or will have it soon).

We have provided an Excel spreadsheet with a logistic regression model built into the spreadsheet for the PIMA Diabetes data. For a large sample of patients, these data contain information on a variety of health indicators for patients as well as whether or not they have diabetes, denoted by a 1 or a 0. If you would like to read more about the diabetes and health issues associated with the PIMA people (a community of Native Americans from Arizona and Northwestern Mexico), read here (optional).

Modeling

It is not essential that that you understand ML concepts or the statistics behind logistic regression models for this exercise. This lab is meant to get you to go hands-on with some key concepts before we formally cover them in class.

Regression and prediction

It is important to know, however, that regression models can be used to make predictions. You may have come across regressions in the context of fitting a line (or another shape) to a set of known data points. Once these models are fitted, they can allow you to take new data and predict the value of an outcome. In other words, we can use a set of known data points (x’s) with outcomes (y’s) to fit a regression model, and then use the fitted model to predict outcomes for data points (x’s) where we do not know the outcomes.

In the diabetes context, we use data on patient indications and known diabetes outcomes to find how the patient indications can be used and combined to predict whether or not a patient is likely to have diabetes. We can then use the fitted model to predict whether new patients, for who we do not know a diabetes diagnosis, might have diabetes.

Logistic regression

Logistic regression is used when the outcome variable that is to be predicted is “binary” (i.e. it takes either the value 0 or 1). For instance, you might use logistic regression to predict whether someone is likely to be approved for a home loan (which can only take the value 0 or 1), but not for predicting a house price (which can take any positive value).

Logistic regression-based prediction in this context proceeds in several steps.

First the logistic regression model is fitted using data on features and known outcomes for the patients. During this process, it finds the best weights to apply to the data columns to predict the outcome as best as possible.

The output of the fitted logistic regression model is the predicted likelihood (a probability value that lies between 0 and 1) that the outcome variable takes the value 1.

The final step in using the prediction model to predict new outcomes is to choose a threshold value that the predicted likelihood value must exceed for the observations to take a predicted value of 1.

In this lab, we use a logistic regression based prediction model to understand the tradeoffs that arise when building machine learning models. If you would like to learn more about logistic regression, there are many resources available online.

For instance:

Read here about logistic regression (optional).

Machine learning - logistic regression (optional).

Spreadsheet

For this lab, you are given a spreadsheet (click here to access the spreadsheet). In the spreadsheet you are given, you are provided with the data on the PIMA people, including the health indicators for each patient (labeled as x1 - x8 in columns A-H in the spreadsheet) as well as whether or not they have diabetes (column I).

You are also given a column for predictions from the fitted model (column M) and another column for the predicted value for whether or not the patient has diabetes (Column Q).

Notice that the predictions from the fitted model are all probabilities that an individual has diabetes. As described above, a threshold value is required in order to convert these probabilities into a binary flag of whether an individual is predicted to have diabetes. You can see in this spreadsheet, a threshold value of 0.5 is used (cell Q2), so using this model, all patients with a predicted probability greater than 0.5 will be predicted to have diabetes.

The data scientists/ML engineer’s challenge is to find models that fit the data well, and to adjust parameters like the threshold value to make the most accurate predictions. Here, we have given you the model (logistic regression) and will be asking you to adjust the threshold value.

Deliverables

Please submit a response document (e.g. a Word or Google Doc) with answers to each of the questions below. You do not need to submit the Excel document in which you do the work, or to paste the original question into your response document, and your answers need not be lengthy. For instance, when a question asks for a number, simply reporting the number is fine. When it asks for an explanation, one or two sentences is usually sufficient.

Some Excel functions are recommended in the text for those that are new to Excel, but you are welcome to use other methods. Assignments are to be submitted through Canvas. See Canvas for the due date.

Please answer the following questions

Question 1.

You are already provided with the fitted logistic regression model, which was built using the PIMA people health data (see rows 1-2 in columns A-I). In no more than a few sentences, describe in what sense during the model fitting process the machine has “learned”? There is no need to go into the technical details behind logistic regression, but provide a high level view of what the model is “learning”, what it is using to learn, and how it is learning it.

Question 2.

What fraction of patients in the data that you are provided has a known diagnosis of the diabetes condition? The Excel function AVERAGE may be useful here, which generates the mean value of a column of numbers.

Question 3.

The spreadsheet that you are given allows you to adjust the threshold used to convert the logistic regression probabilities from the fitted model to predictions of whether or not a patient has diabetes. When using a threshold of 0.5, what fraction of 1/0 predictions is correct (i.e. matches the patient’s actual condition)? This metric is known as “accuracy”. For this question, you will first need to generate a new column which indicates whether or not the prediction matches the patient’s condition. The IF function may be useful here.

Question 4.

As you saw in the previous question (question 3), the model does not always predict the correct answer. Assuming we chose the best modeling technique and fit the data appropriately, why might the model still produce incorrect predictions for certain individuals?

Question 5.

Read here about false positives and false negatives. (These terms have now entered the public conversation thanks to COVID-19 tests!) In language that might be relevant to a medical practitioner, what do these numbers mean for this diabetes diagnostic context?

What fraction of predictions are “false negatives” when using a 0.5 threshold?

What fraction of predictions are “false positives” when using a 0.5 threshold?

Question 6.

Put yourselves in the shoes of a senior healthcare manager in a hospital system that is using this prediction model for making diagnoses.

As a healthcare manager, under what conditions might you be more concerned about the false negative rate?

As a healthcare manager, under what conditions might you be more concerned with the false positive rate?

Question 7.

What threshold value maximizes the accuracy of the prediction model? (Excel ninjas can use “Solver” to find this value, but feel free to use trial-and-error to find this value by changing it by hand until you find which values produce the highest accuracy.) Specify this value to two digits.

Question 8.

In the previous question, you found a threshold that optimized accuracy. How did the new threshold impact false positive and false negative rates when compared to the original threshold of 0.5? Why do you think these changes occurred?