Datathon 2: House Prices
OIDD 245 Tambe
In this datathon, you will compete in an in-class entry level Kaggle competition. You have until the end of class today to achieve the best score you can for the an in-class Kaggle competition which asks you to predict house prices. Real estate markets are a hot and sometimes controversial application of analytics.
Link to the Kaggle competition on predicting house prices
As a reminder, the general flow of these competitions is to build and refine a model using the provided ‘training’ data set, use the model you create to predict outcomes (e.g. House Sale Prices) on the provided ‘test’ data set, and then upload your predictions to Kaggle for evaluation and scoring. Keep in mind that you often get a limited number of submissions to work with so use them wisely (10 per day for this competition).
As discussed in class, there are two ways to improve a model; we can either use a better tool or improve your predictors. As in many competitions, we have limited knowledge about these predictors. We have not covered in class how to think about combining predictors into a smaller number of features or how to choose which predictors to keep in the model and which to omit, but do the best you can. You are not limited to using linear models. Much of the point in this exercise is simply to get a feel for some of the day-to-day challenges involved in data science tasks, and to get a feel for the blend of art and science that go into developing solutions to these problems.
In terms of what you can do to improve your model, you can:
- Try to make informed guesses on which variables will generate the best model.
- Try to use better models (e.g. classification trees, random forest, etc.).
- Transform data or clean up missing data.
- Combine or transform variables to create new variables.
If required, more information about the individual data fields is available here.
The deadline is at the end of class today.
There will be no presentations or voting this time. The winners will be the team with the lowest score (i.e. Private Leaderboard score) by the deadline.
To get you started
Here is some R code to get you started. The code below runs a simple but workable linear regression based prediction on a randomly selected of the independent variables and creates a submission file that can be uploaded to Kaggle.
You can cut and paste this into an R-script and change the file paths as needed to get started. Remember that you should iteratively improve a model on your training data before making a submission.
library(readr) # Step 1: Read in Data train = read_csv("~/Downloads/oidd245housinga/ames_train.csv") #train test = read_csv("~/Downloads/oidd245housinga/ames_test.csv") #test # Step 2: Try a basic linear regression model based on some variables hp = lm(SalePrice ~ `LotArea`, data=train) # Step 3: Prediction on the training data pred = predict(hp, newdata = test) # Step 4: Output for uploading to Kaggle output = data.frame(cbind(as.character(test$Id),pred)) colnames(output) = c("Id","SalePrice") #use the ouput csv and submit to Kaggle write_csv(output, "~/Desktop/lm_submission.csv")
Finally, submit the file you produced to Kaggle and you should receive a score and a position on the Leader board. Your goal is to improve the model in Step 2, either by using a different model or an alternative set of predictor variables. You are not restricted to linear models. Models of any type are acceptable.
Rather than upload predictions to Kaggle after every change of your model, a good strategy is to divide up your training data into a fabricated training and test portion (e.g. 70:30) and then to modify and test the performance of new models using those two data sets on your laptop. The performance of your model can be assessed by computing the mean square error of your predictions, which is the evaluation metric used for this competition.
There are a number of mean square error metrics available in R packages or you can just write the code yourself, e.g. if you were trying to compute the mean square prediction error for the training data, it might look something like this:
train$predict = predict(out) performance = train %>% mutate(diff_sq = (predict - SalePrice)^2) %>% summarise(mean(diff_sq))
When you have sufficiently improved a model on your computer, you can run it against the true test sample and determine performance by uploading the predictions to Kaggle.