Datathon 1: Speed Dating
OIDD 245 Tambe
What is a datathon?
A datathon is a timed workshop that asks researchers to turn information into knowledge. It’s a format modeled after hackathons. The difference is that datathons use research questions and datasets to advance knowledge, not to launch apps or new software. At a datathon, participants work in teams to frame a research question, create and implement a research design, mobilize data resources and present their findings.
They are becoming increasingly common for the following reasons:
- Companies want to get people to work on their data and data problems
- Companies want to find smart people to hire
- Companies want to brand themselves as having interesting data and problems on which to work
- They are a good learning and community building exercise
Key objectives of our datathons
- To reinforce technical skills in a realistic setting.
- An *open* question format that simulates a data interview. (see below)
- To consider the mix of skills required for data science work.
The data-driven interview is an interview format that is becoming increasingly common. In such contexts, or in data consulting exercises in general, you are not given structured goals. Rather, you are given data sets, and asked to do something with it. Deciding what to do can be as hard or harder than the technical data work and often draws heavily on your domain expertise.
At Jawbone, Rogati said each applicant for data science jobs at Jawbone gets three hours to make sense of mixed-up company data sets. The test can reveal if candidates possess “applied skills,” she said, not just statistical know-how.
Applied skills are becoming increasingly important. The following article (read it outside of class, if you are interested) lists the following important skills for a data scientist.
- Statistical thinking
- Technical acumen
- Multi-modal communication skills
Note the importance of non-technical skills on this list. The answers come from the data, but the questions have to be formulated by the data scientist, and that requires some knowledge or expertise about the domain.
- Use only the data sets you are given.
- It is recommended that you use Tableau for this exercise. Future datathons will be focused on R, but this exercise was designed to get you accustomed to the datathon idea. It is about conducting exploratory data analysis; it does not require any statistical tests or data modeling.
The data set for this exercise was generated from “speed dating” events conducted at Columbia University. In general, data on dating has provided significant raw material for data analysis. For instance, a leading data science team, OK Cupid, maintains a widely read blog of what it finds in its data. Obviously, people are interested.
In this exercise, you are asked to generate and illustrate a finding that others might find interesting. In other words, generate a finding (i.e. a story and supporting visualization) that might be "blog-worthy". The output should be single, well-labeled image or set of images (collected into one graphic) and a brief description of what it shows.
Please submit your entry by 11:45 am for the 10:30 am section or 2:45 pm for the 1:30 pm section. You should submit your visualization by uploading a screen shot of your image to a slide at the appropriate link (will be announced in class) and you include a brief title and the names of the people you worked with on your slide. Alternatively, you can “export” the image from Tableau and upload it.
You will likely be asked to do an informal, less than 60 second presentation of your entry at the beginning of the next class session. A winning entry will be chosen from the presentations through an anonymous class poll.
There is only one data source to be used for this exercise, which is the speed dating data. However you will also need an additional word document which contains information about the fields in the data.
Participants in these speed dating trials engaged in four-minute conversations to determine whether or not they would be interested in meeting one other again. There is a row for each meeting between two individuals, partner ratings, correlation in interests between the two people, as well as information about whether a match was made, and what order the meeting was in the sequence. Moroever, the participants were surveyed so there is data about their major, their hobbies, their preferences, hometown, race, and so on as well as how they rate themselves in terms of attractiveness, how often they go out on dates, what they value in a date, and other similar assessments.
For this exercise, working in teams is required!
There is too much data here to evaluate all the possibilities, given the time constraints. A good strategy is, along with your team, to settle on two or three “stories” in the data that seem promising. Narrow your focus to the few fields required to support those stories and ignore everything else.
Start with simple ideas, try to execute it in the data, and progressively add complexity as time permits.
Good stories matter. In general, a clever idea and a little bit of “domain expertise” is at least as important as data skills, and probably much more so!
Some say that effective data science teams require a journalist, a data scientist, and a data visualization expert. For efficiency, you may want to think about having some of these tasks done in parallel (to the extent possible) by members of your team.
Good visualization matters! Try to leave yourself time to present your findings in an attractive and compelling way.