Homework 1: Rats!

OIDD 2450 Tambe

Due Dates

This assignment should be submitted through Canvas.

You may NOT work with others on this assignment. However, you can ask the TAs or I for help as needed. If using Slack, directly message us rather than posting to a public channel. It is best not to put this assignment off until very close to the due date. These data sets are large, so it can sometimes take more time than anticipated to process and work through them.

This is a good assignment on which to use R scripts or R notebooks, and a good opportunity to learn about them if you have not been using them already. You should complete the entire assignment using either R scripts or an R notebook. If you use an R-script, please also create a separate document (which you should ultimately convert to a pdf) that includes answers to the questions that are being asked, including any relevant charts, and submit both documents to Canvas. If you are using an R notebook, you have the option of trying to knit the notebook to create a Word document (which you should convert to a pdf) or .html file that you can submit along with your R notebook. You are encouraged to use any and all R packages that you think might be useful for completing this assignment, even if we have not covered them in class.

As you read through the assignment, I would recommend following each of the links (i.e, clicking on them). They provide important background context.

Data Context

An interesting application of big data in the public sector is for optimizing the use of constrained government resources.

A recent example of this from NYC was a Department of Health initiative to target restaurant inspections by data mining Yelp reviews to find which ones mentioned roaches. More generally, government agencies and corporations are realizing that bottom-up, crowd-sourced data collection is a good way to gather some types of important data that were previously difficult or too expensive to collect accurately.

A more recent scourge on cities has been RATS. Surprisingly, NYC is NOT the most rat-infested city in the US. Perhaps you can guess which one is. For this exercise, however, we will use data from NYC, simply because it collects particularly good city data through their open data initiatives. In NYC, rodents have commanded attention from government offices and from the press and the community (in part fueled by viral videos like pizza rat; more recently chubby sewer rat). In New York, the resurgence of rats in the city has been a result of many forces, including population growth, construction, and trash pickup problems, and even global warming, as many would argue, that a major redistribution of water rats that occurred as a result of Hurricane Sandy, which hit New York City on October 29th, 2012. (In Philadelphia, some have argued that gentrification-related rebuilding is having a similar effect in terms of displacing rats around the city).

Like other vermin, rats are difficult to detect and they vastly outnumber rodent inspectors and the resources that the city has available for monitoring. In fact, some estimates have put the number of rats in New York city at about one per person (i.e. slightly over 8 million) although these estimates are controversial. Nevertheless, there is little debate that rats remain a significant problem for residents of the five boroughs. As in the Yelp-roach-restaurant inspection example provided above, the goal of this assignment is uncover some temporal patterns in rat problems in the city, and to use prediction based techniques to investigate if the city's restaurant inspection activities (which are separate from their rodent inspection activities) can be useful for targeting rodent inspections. Other major cities are also using crowd-sourcing efforts to fight this menace.

In addition to the NYC rodent inspection data, we will use data on NYC 311 calls, which are a fascinating source of information on New York City as well as the NYC restaurant health inspection data. All of these stakeholders–rodent inspectors, restaurant inspectors, and citizens–can be valuable sources of information about which parts of the city have rat problems. Note that these data sources are not "independent". Many of the inspections in the rodent inspection database are initially undertaken due to 311 complaints.

Deliverables

There are two deliverables for this assignment. You should submit an R-script or R-notebook with the R code that you used to complete the project. You should also knit the output of the code into a report in .html format or if using an R-script, create a Word document with your answers and supporting figures. The R files should be well-documented with comments so that it is clear what you are trying to do. Although you will not be graded on the efficiency of your code, this project is an opportunity to spend a few extra minutes to think about how to keep your code organized and easy to follow. This includes choosing good variable and file names, which will ultimately save you time in a project such as this one. A key learning objective of this homework is gaining experience building and managing a larger R-based project.

Data sources

You will need to use the following three data sources to complete this homework. These data sources are all available through NYC Open Data, but are also made available through the following links.

NYC Rodent Inspection Results (211 MB). This is the main data source and there is information about these data located here.

311 calls that occurred around the time of Hurricane Sandy (38 MB).

3a. If you are using a Mac, use these NYC Restaurant Inspection Results (169 MB). If you are using a PC, use these NYC Restaurant Inspection Results (102 MB).

Objectives

Part 0: File setup

You should begin by running the following code snippet.

install.packages(TeachingDemos) library(TeachingDemos) library(dplyr) library(readr) rod.inspection = read_csv("<put in location of the rodent inspection data file on your computer here>") # In the line below, change "Jack Black" to your own name myname = "Jack Black" # Then run the following lines set.seed(char2seed(myname)) rod.inspection = sample_frac(rod.inspection, .8) rod.inspection = rod.inspection[, sample(1:ncol(rod.inspection))]

Now you are ready to begin the assignment. Please use the rod.inspection data frame for the parts below. Feel free to rename it if it is convenient.

PART 1. Descriptive Statistics and Figures.

a. Using the rodent inspection data frame that you just generated, create a line chart illustrating, for each of the five NYC boroughs, how rat sightings (inspections yielding Active Rat Signs) have been changing from month to month over the most recent five years that are available in the data set (i.e. each tick mark on the x-axis should be a single month for a single year, so with five years of data, there will be sixty ticks on the x-axis). (The lubridate and zoo packages in R are useful for working with dates. stringr can help when working with strings). You can create one line chart with a different line for each borough, or you can create five different line charts, one for each borough.

b. In general, have rat sightings been increasing, decreasing, or remained steady over this five year window? No need for analysis for this question. Just eyeball the charts you created and provide your answer.

c. Are there seasonal trends in rat sightings? Are there more rat sightings in some months than others? Again, just eyeball the charts you created in part (a) to provide this answer.

d. Generate a similar plot illustrating “efficiency” of rat inspections where efficiency for any given month is computed as the number of inspections yielding “Active Rat Signs” in that month divided by the total number of inspections that take place in that month. Again, you can create a line chart for each borough or you can put the lines for all five boroughs on the same chart.

e. Generate a list of the top ten zip codes with the largest number of inspections yielding active rat signs. You do NOT need to create a chart or map, but you are welcome to try. If you want to do so, there are a number of R packages that enable mapping, including ggmaps, rmaps, leaflet, and maps.

PART 2. The 311 data.

a. Create a chart of phone calls about rat/vermin sightings in the weeks before and after Hurricane Sandy (Oct 29, 2012). To identify these, use the ‘Complaint Type’ tag “Rodent”. There is not a lot of data from the before period, but based on what you have, do these data suggest that Hurricane Sandy led to an increase in the daily numbers of rodent sightings on the island? No formal statistics are needed in your answer to this part.

b. We will now begin to explore the idea of using other “proxy” indicators as a predictor of Rodent problems.

Imagine that it would be useful, to know what other types of complaints are most highly correlated with Rodent complaints. This would allow us to identify which other complaint types are the best predictors of Rodent sightings when the Rodent data are unavailable. Using the 311 data, report a table of correlations between Rodent complaints and the top fifteen types of non-Rodent related complaint types, where the correlations of interest are between the counts of complaints of each type across the different zip codes in the data set (and where top fifteen is according to total counts of complaint types in the 311 data set). Across all zip codes, what two types of other complaints are most highly correlated with Rodent sightings? Although you can approach this question anyway you like, some suggested steps are shown below if you are having trouble.

Step 1. Identify the top 15 non-rodent complaint types by the number of complaints.

Step 2. Using the 311 data, for the top 15 complaint types as well as the Rodent complaint type, create a data frame of counts by complaint type and zip code. The top few rows of such a data frame are shown below.

Zip code	Complaint type	Counts
10001	NONCONST	# Counts
10001	SEWER	# Counts
10001	PLUMBING	# Counts
10001	RODENT	# Counts

Step 3. Convert your data into the type of format shown in the table below (of course, with all fifteen complaint types and a longer list of zip codes). The pivot_wider function in the tidyr library can be useful here.

	NONCONST	SEWER	PLUMBING	RODENT
Zip code 1	# Counts	# Counts	# Counts	# Counts
Zip code 2	# Counts	# Counts	# Counts	# Counts
Zip code 3	# Counts	# Counts	# Counts	# Counts

Step 4. Then, remove the zip code column and convert any missing values to zeroes and make sure you have columns full of numbers. Finally, use cor or a similar function to compute correlations across each of the columns. The column we are interested in is the one between Rodent sightings and all the other Complaint types. From these correlations, you can answer the question posed above, i.e. what two other complaint types are most highly correlated with Rodent sightings?

PART 3. Rodent and Restaurant Inspection.

Imagine that the NYC Department of Pest Control has been experiencing budget cuts, and thinks it may be useful to predict which 311 calls they receive are most likely to yield “Active Rat Signs”. They ask you to investigate whether the restaurant violation database, which is created and maintained by a different agency, can be useful for predicting which property inspections are most likely to yield Active Rat Signs. For the purposes of this assignment, we will estimate a simple predictive model of the following form:

ActiveRatSigns = log(RestaurantRatViolations+1) + month + year + \epsilon

To do this, you will need to use both the Restaurant Violation and Rodent Inspection databases. In this regression, each row corresponds to an entry in the Rodent Inspection database. RestaurantRatViolations is a new variable you should create from the restaurant inspection database which is the number of rodent-related restaurant violations in that zip code, year, and month (The relevant inspection codes for a restaurant rat violation are: 04L, 04K, and 08A). The dependent variable, ActiveRatSightings, is a binary measure you create from the rodent inspection database that should be computed as a 1 or 0 depending on whether an inspection yielded Active Rat Signs. Month and year should be included into the regression as dummy variables (i.e. convert them to factor variables, like the example in class). The specific steps you should follow are:

Step 1. From the restaurant inspection database, generate a data set which, for each month-year-zip code combination, contains the total number of rodent-related inspection violations.

Step 2. Once you have created this data set, join the measure with the rodent inspection data so that there is a row for each rodent inspection that also contains a measure of the number of restaurant violations for the zip code, month, and year in which the inspection occurred. For example, if a property in 10012 was inspected in June 2013, you would merge into that row the total number of rodent related restaurant violations for the zip code 10012 in June 2013 computed from the data set generated in Step 1. Depending on the type of join you did, you may wish at this point to convert any missing restaurant violation numbers to zero.1 Add one to this restaurant violation measure and log it. Adding one accounts for zero values, which would otherwise yield the value negative Infinity. Adding one to all the values you are logging offsets this potential issue.

Step 3. In this new data set, create a binary variable that takes the value 1 or 0 depending on whether an inspection yields Active Rat Signs.

Step 4. Run a logistic regression on the new data set to test whether the estimated coefficient on the restaurant sightings variable has a statistically significant relationship with whether or not an inspection yields Active Rat Signs and–if it is significant–whether the relationship is positive or negative. Include month and year in your regression as factor (dummy) variables. You should include a single dummy variable for each month (Jan, Feb, etc) and for each year, not for each month-year combination. This should leave you with about 15 or 16 dummy variables in your regression output.

Does this evidence indicate that there is a statistically significant relationship between the restaurant inspection based measures and whether a rodent inspection yields Active Rat Signs? Show the regression statistics and justify your answer.

If you need, there are a number of resources available online to help you understand how to interpret logistic regression coefficients.

PART 4. Beyond inspections

Provide one example of a story you might be able to tell by connecting the NYC rats data with one or more data sources that we did NOT use here. For example, you might think about connecting the rats data to data sets on disease and health, urban transportation, housing information, demographics, or many other possibilities. You do NOT have to implement the idea, but your answer should include the following three things:

An overview of the question you might like to answer in your data analysis – no more than one or two sentences.

Pointers to a specific data set you would use and the variables you would use to join it with any of the rats data sets described above. This can be a URL link that points to a data set or you can describe how you might collect the data using web scraping or an API.

A short description of the type of supporting analysis you might conduct (one or two sentences). This hypothetical analysis can be a chart, a graph, or a regression.

Doing a left join and converting missing values to zero is a more correct approach, but for this exercise, doing an inner join that drops missing values is also okay. ↩