Data project 3

OIDD 245 Tambe

Overview

Data project 3 is meant to be completed individually and is meant to provide a platform for you to utilize your R skills on a project where you have flexibility to specify the structure and context. One of the goals is to complete a data project that may be of specific use to you in getting a future job (i.e. something you can talk about in an interview or show to a future employer). Part of the goal is also to allow you to spend time on a project based in an industry context that is of specific interest to you, since as a group are going to be entering different industries (e.g. consulting, finance, real estate, healthcare, technology etc.) This can also be a “passion” project on a topic you might be particularly interested in, which could be anything ranging from protecting endangered species to the Game of Thrones.

You are highly encouraged to pursue rich and creative data sources. There are many that are freely available on the web. Moreover, data from sources such as Facebook, Twitter, Yelp, and other companies can be harvested for analysis. In this class, we have covered web scraping, using API’s, text mining, prediction, and surveyed some visualization techniques, and you are welcome to use packages and methods that we have not covered in class.

Choosing an audience and specifying an “interesting” question are important parts of the assignment (and of any good data science work). What makes for an interesting question can be subjective and is often domain specific. It is a good idea, therefore, to avail yourself of feedback from friends, family, TAs, or me, if you are stuck. I am happy to provide feedback over the next few weeks as you develop your projects.

Deadlines:

You are asked to briefly share an idea you have for Data project 3 in class on either April 21st or April 26th.
The project must be submitted by May 3rd at 11:59 pm.

Project Requirements

Your project must be new, and completed specifically for this project.
You must use R for the project.
Your project must utilize R to acquire external data - this can be through web scraping or through the use of an API. This does not have to be a key data source for your analysis. It can play a secondary role to data acquired in other ways (e.g. downloaded).
Your output should include at least one or more data visualizations (it can be a static visualization or an interactive visualization).
You must incorporate some level of analysis. This can be prediction, regression, machine learning, or some form of natural language processing. You can also combine these, e.g. use sentiment based measures for prediction.

Deliverables

The R-scripts or R-notebooks used for your analysis.
An URL link to a website (e.g. Wix or Weebly) OR Medium post in which you present your results.
We will share ideas in class on Apr 21st and Apr 26th. The presentations are not graded, but you are required to give one and the feedback you provide to others is factored into your participation scores. You are encouraged to use feedback from the presentations to improve your projects. In your discussion, you should plan to use no more than one slide and to cover the following points:
- What question is your data project answering and why is it useful to your or others?
- What data sources might you use?

Grading Criteria (125 pts)

Idea share in class (10 pts, full credit awarded as long as you do it)
Data partially acquired through scraping or through an API (10 pts)
Quality of data visualization (20 pts)
Quality of analysis (i.e. prediction, ML, NLP, or other) (20 pts)
Utility of project, e.g. how clearly it makes a useful point (20 pts)
Creativity & originality (20 pts)
Clarity of presentation of results on web site (25 pts)

Some sample projects from past years