# Datathon 2: House Prices

#### OIDD 245 Tambe

## Rules

In this datathon, you will compete in an **in-class** entry level Kaggle
competition. You have until the end of class today to achieve the best
score you can for the an in-class Kaggle competition which asks you to
predict house prices. Real estate markets are a
hot
and sometimes
controversial
application of analytics.

Link to the Kaggle competition on predicting house prices

As a reminder, the general flow of these competitions is to build and refine a model using the provided ‘training’ data set, use the model you create to predict outcomes (e.g. House Sale Prices) on the provided ‘test’ data set, and then upload your predictions to Kaggle for evaluation and scoring. Keep in mind that you often get a limited number of submissions to work with so use them wisely (10 per day for this competition).

## Tips

As discussed in class, there are two ways to improve a model; we can
either use a better tool or improve your predictors. As in many
competitions, we have limited knowledge about these predictors. We have
not covered in class how to think about combining predictors into a
smaller number of features or how to choose which predictors to keep in
the model and which to omit, but do the best you can. You are **not**
limited to using linear models. Much of the point in this exercise is
simply to get a feel for some of the day-to-day challenges involved in
data science tasks, and to get a feel for the blend of art and science
that go into developing solutions to these problems.

In terms of what you can do to improve your model, you can:

- Try to make informed guesses on which variables will generate the best model.
- Try to use better models (e.g. classification trees, random forest, etc.).
- Transform data or clean up missing data.
- Combine or transform variables to create new variables.

If required, more information about the individual data fields is available here.

## Submissions

The deadline is **at the end of class today**.

There will be no presentations or voting this time. The winners will be the team with the lowest score (i.e. Private Leaderboard score) by the deadline.

## To get you started

**Here is some R code to get you started**. The code below runs a simple
but workable linear regression based prediction on a randomly selected
of the independent variables and creates a submission file that can be
uploaded to Kaggle.

You can cut and paste this into an R-script and change the file paths as needed to get started. Remember that you should iteratively improve a model on your training data before making a submission.

```
library(readr)
# Step 1: Read in Data
train = read_csv("~/Downloads/oidd245housinga/ames_train.csv") #train
test = read_csv("~/Downloads/oidd245housinga/ames_test.csv") #test
# Step 2: Try a basic linear regression model based on some variables
hp = lm(SalePrice ~ `LotArea`, data=train)
# Step 3: Prediction on the training data
pred = predict(hp, newdata = test)
# Step 4: Output for uploading to Kaggle
output = data.frame(cbind(as.character(test$Id),pred))
colnames(output) = c("Id","SalePrice")
#use the ouput csv and submit to Kaggle
write_csv(output, "~/Desktop/lm_submission.csv")
```

Finally, submit the file you produced to Kaggle and you should receive a score and a position on the Leader board. Your goal is to improve the model in Step 2, either by using a different model or an alternative set of predictor variables. You are not restricted to linear models. Models of any type are acceptable.

Rather than upload predictions to Kaggle after every change of your
model, a good strategy is to divide up your **training** data into a
fabricated training and test portion (e.g. 70:30) and then to modify and
test the performance of new models using those two data sets on your
laptop. The performance of your model can be assessed by computing the
mean square error of your predictions, which is the evaluation metric
used for this competition.

There are a number of mean square error metrics available in R packages or you can just write the code yourself, e.g. if you were trying to compute the mean square prediction error for the training data, it might look something like this:

```
train$predict = predict(out)
performance =
train %>%
mutate(diff_sq = (predict - SalePrice)^2) %>%
summarise(mean(diff_sq))
```

When you have sufficiently improved a model on your computer, you can
run it against the *true* test sample and determine performance by
uploading the predictions to Kaggle.

Good luck!