Data project 2
OIDD 245 Tambe
Overview
Data project 2 is meant to be completed individually and is meant to provide a platform for you to utilize your R skills on a project where you have flexibility to specify the structure and context. One of the goals is to complete a data project that may be of specific use to you in getting a future job (i.e. something you can talk about in an interview or showcase to a future employer). Part of the goal is also to allow you to spend time on a project based in an industry context that is of specific interest to you, since as a group are going to be entering different industries (e.g. consulting, finance, real estate, healthcare, technology etc.) This can also be a “passion” project on a topic you might be particularly interested in, which could be anything ranging from protecting endangered species to the Game of Thrones.
You are highly encouraged to pursue rich and creative data sources. There are many that are freely available on the web. Moreover, data from sources such as Facebook, Twitter, Yelp, and other companies can be harvested for analysis. In this class, we have covered web scraping, using API’s, text mining, prediction, and surveyed some visualization techniques, and you are welcome (and encouraged) to use packages and methods that we have not covered in class. If you are unsure whether a project you are considering is a good candidate, please ask me!
Choosing an audience and specifying an “interesting” question are important parts of the assignment (and of any good data science work). What makes for an interesting question can be subjective and is often domain specific. It is a good idea, therefore, to avail yourself of feedback from friends, family, TAs, or me, if you are stuck. I am happy to provide feedback over the next few weeks as you develop your projects.
Learning objectives
- Gain experience with data skills and with working with large data sets using R.
- Learn to appreciate the combinatorial nature of the possibilities that arise when combining data sources.
- Try to be creative in your projects. Many data scientists would argue that creativity, domain knowledge, and storytelling are equally or even more important than skills such as R or Python when developing data products.
Project Requirements
- Your project must be new, and completed specifically for this project.
- You must use R for the project.
- Your project must utilize R to acquire external data - this can be through web scraping or through the use of an API. This does not have to be a key data source for your analysis. It can play a secondary role to data acquired in other ways (e.g. downloaded).
- Your output should include at least one or more data visualizations (it can be a static visualization or an interactive visualization).
- You should incorporate some level of analysis. This can be prediction, regression, machine learning, or some form of natural language processing. You can also combine these, e.g. use sentiment based measures for prediction.
Deliverables
- The deliverable for this project is an URL link to a Medium post where your project can be viewed as well as the R-scripts or notebooks used for this analysis.
- Building a Medium post: A goal of this project is to provide you with the space to take another step towards building a project portfolio that you can share with others. When submitting your Medium post link to Canvas, make sure you get the ‘friend link’ from the top right of the page as shown below, so that anyone with that link can access it even if it’s behind a paywall.
-
In addition to your project information, You should also include the following information in your post.
- Who you are (so we can directly see on the site who the submission is from)
- Please make sure you give credit to the data sources you use
-
Some examples of Medium posts that incorporate data visualizations to tell a story (unrelated to the “Analytics for Good” theme):
-
We will share ideas in class on Apr 20th and Apr 25th. The presentations are not graded, but you are required to give one and the feedback you provide to others is factored into your participation scores. You are encouraged to use feedback from the presentations to improve your projects. In your discussion, you should plan to use no more than two slides and to cover the following points:
- What question is your data project answering and why is it useful to you or others?
- What data sources might you use?
Grading Criteria (125 pts)
- Idea share in class (10 pts, full credit awarded as long as you do it)
- Data partially acquired through scraping or through an API (10 pts)
- Quality of data visualization (20 pts)
- Quality of analysis (i.e. prediction, ML, NLP, or other) (20 pts)
- Utility of project, e.g. how clearly it makes a useful point (20 pts)
- Creativity & originality (20 pts)
- Clarity of presentation of results on web site (25 pts)