Lab 3: Baby Names

Each year after the Census Bureau releases its list of the names given to the babies born in the most recent year, it is common for the press to release a series of stories on baby names or trends in baby names. It has also recently become a popular data set for teaching introductory data science. In fact, one year after I created the original version of this lab in 2016, a babynames package was introduced to R which has the data pre-loaded for you. However, it is good practice for us to try loading these files on our own.

This lab is meant for you to practice using R. It is NOT for submission. Sample solutions will be provided on the second day of the lab.

Data Downloads:

For this lab, you will need to download and unzip a file containing the Census Bureau baby names data.

Data Description

The data set is a zipped folder of .csv text files that contains the Census Bureau counts of names of baby boys and girls born each year from 1880 to 2014. These data include any name-gender combination for which there were at least five babies born in that year. There is a separate text file for each year from 1880 to 2014. For example, the file named yob1942.txt contains the counts of names for all boy and girl babies born in 1942.

Each row in a text file is of the format [name, gender, count]. For instance, some sample lines in the text file yob1942.text might be:

The first line would indicate that 7,312 girls born in 1942 were given the name Mary. This lab asks you to use these data, along with R, to analyze how naming patterns in America have changed over the last 150 years.

Remember to keep all of your work in R-notebooks or R-scripts and to save frequently! NOTE: Many of the steps below can be combined, but I have tried to write them out to make them more clear, and I have tried to rely only on "core" R functions, which is what we have covered so far in class. There are also multiple ways to complete most of these objectives. Feel free to experiment with others.


Objective 1. Load the data files into an R data frame

To get you started quickly, I have provided the code to load in the files. If this works and makes sense, you can begin with Objective 2.

Set your working directory to the folder where the name files are located on YOUR computer.

setwd("~/Dropbox/Teaching/oidd245/labs/baby/data/")

Because we are going to append rows to this dataframe, we need to start with an empty initial dataframe.

babynames = NULL

For loop to read in the file and append it to a “running total dataframe” called babynames. If loading in the data set is taking too long, you may want to change 1950 to something more recent.

for (year in (1970:2014)) {
    foo = read.csv(paste("yob", toString(year), ".txt", sep=''), header=FALSE)
    babynames = rbind(babynames, cbind(foo, year))  
}

Objective 2. Plot how the popularity of your own name has been changing over the years

You should now have available in your RStudio environment a data frame containing the counts of all name-gender combinations in each year from your beginning start year to 2014.


Objective 3. Visualize the growth in girl names from 1880-2014

Create a plot of the number of different baby girl names in each year from 1880 (or whatever you have chosen as the initial year in your data set) to 2014. Plot the number of different names there are for girls reported by the Census Bureau in the data each year. In other words, no matter how many girls have a given name, it should only count as one name.

It may be interesting to visually compare the results with the following image.


Write a function toptennames that takes two arguments, gender and year, and returns the top ten names for that gender in that year. This function should:

After running your function, test it by calling it on girl names for the first and last years that appear in your data set:\

> print(toptennames(1880, "F"))
> print(toptennames(2014, "F"))

Objective 5. Compute the most gender-neutral names

Generate a list of the ten names that, in 2014, were relatively popular for both baby boys and baby girls. To do this, use the following condition: for names for which there are at least 1000 people in the year with that name (including boys or girls), compute the difference between boys and girls with the name. Then, we will call those names with the smallest magnitude difference between the two the most “gender neutral” names. This is not the best definition of gender-neutral, but it is a definition and is straightforward to implement.

To complete this objective:


Objective 6. Compute the expected age of people in this class, given the list of class names. (OPTIONAL)

You will need the first names of students in this class.

For a given year, the goal is to compute the likelihood of this distribution of names. For any particular name in a given year, this likelihood would be the number of people with that name divided by the total number of babies named in that year in the data file. For example, the likelihood of being named Joseph in 1992 is the number of babies given the name Joseph in 1992 divided by the total number of babies named in the 1992 data file.

If a name does not appear in this data set, you can assign it a probability of 0. Recall that names with less than five babies with that name in the year are not included in the data set.

Then, the likelihood of observing the distribution of names from our class for a given year is the sum of these probabilities in a year. Plot how this likelihood changes from year to year, and find the birth year that maximizes the likelihood of observing the distribution of names observed in our class.