Lab 3: Baby Names

Each year after the Census Bureau releases its list of the names given to the babies born in the most recent year, it is common for the press to release a series of stories on baby names or trends in baby names. It has also recently become a popular data set for teaching introductory data science. In fact, one year after I created the original version of this lab in 2016, a babynames package was introduced to R which has the data pre-loaded for you. However, it is good practice for us to try loading these files on our own.

This lab is meant for you to practice using R. You must submit it, but sample solutions will be provided on the second day of the lab, so completing and submitting the lab should be straightforward. You should focus in this lab on becoming more comfortable with R.

Data Downloads:

For this lab, you will need to download and unzip a file containing the Census Bureau baby names data.

Data Description

The data set is a zipped folder of .csv text files that contains the Census Bureau counts of names of baby boys and girls born each year from 1880 to 2014. These data include any name-gender combination for which there were at least five babies born in that year. There is a separate text file for each year from 1880 to 2014. For example, the file named yob1942.txt contains the counts of names for all boy and girl babies born in 1942.

Each row in a text file is of the format [name, gender, count]. For instance, some sample lines in the text file yob1942.text might be:

The first line would indicate that 7,312 girls born in 1942 were given the name Mary. This lab asks you to use these data, along with R, to analyze how naming patterns in America have changed over the last 150 years.

Remember to keep all of your work in R-notebooks or R-scripts and to save frequently! NOTE: Many of the steps below can be combined, but I have tried to write them out to make them more clear, and I have tried to rely only on "core" R functions, which is what we have covered so far in class. There are also multiple ways to complete most of these objectives. Feel free to experiment with others.


Objective 1. Load the data files into an R data frame

To get you started quickly, I have provided the code to load in the files. If this works and makes sense, you can begin with Objective 2.

Set your working directory to the folder where the name files are located on YOUR computer.

setwd("~/Dropbox/Teaching/oidd245/labs/baby/data/")

Because we are going to append rows to this dataframe, we need to start with an empty initial dataframe.

babynames = NULL

For loop to read in the file and append it to a “running total dataframe” called babynames. If loading in the data set is taking too long, you may want to change 1950 to something more recent.

for (year in (1970:2014)) { foo = read.csv(paste("yob", toString(year), ".txt", sep=''), header=FALSE) babynames = rbind(babynames, cbind(foo, year)) }


Objective 2. Plot how the popularity of your own name has been changing over the years

You should now have available in your RStudio environment a data frame containing the counts of all name-gender combinations in each year from your beginning start year to 2014.


Objective 3. Visualize the growth in girl names from 1880-2014

Create a plot of the number of different baby girl names in each year from 1880 (or whatever you have chosen as the initial year in your data set) to 2014. Plot the number of different names there are for girls reported by the Census Bureau in the data each year. In other words, no matter how many girls have a given name, it should only count as one name.

It may be interesting to visually compare the results with the following image.


Objective 4. Write a function to generate the most popular names for a given gender and year

Write a function toptennames that takes two arguments, gender and year, and returns the top ten names for that gender in that year. This function should:

After running your function, test it by calling it on girl names for the first and last years that appear in your data set:\

> print(toptennames(1880, "F")) > print(toptennames(2014, "F"))


Objective 5. Compute the most gender-neutral names

Generate a list of the ten names that, in 2014, were relatively popular for both baby boys and baby girls. To do this, use the following condition: for names for which there are at least 1000 people in the year with that name (including boys or girls), compute the difference between boys and girls with the name. Then, we will call those names with the smallest magnitude difference between the two the most “gender neutral” names. This is not the best definition of gender-neutral, but it is a definition and is straightforward to implement.

To complete this objective:


Objective 6. Convert your code into Python

Ask ChatGPT to take the code you developed for Objective 1 — including setting your directory path to the proper place and running through the for loop — and convert it into Python. Then, put it in a Python script file in RStudio as will be demonstrated in class, and test the output to ensure that it works. You can check the length of a list in Python using the len() command.

Now, for fun (!), use ChatGPT to convert your code into Julia and into Scala, which are other popular data science languages (but there is no need to test if they work).


EXTRA CREDIT.

Now, turning back to R, attempt to “prompt engineer” ChatGPT to aid you in creating the R-based solutions for Objectives 1 through 5 above from scratch (i.e. without asking it to convert existing code). For credit, provide screenshots of the prompts you use and the R-based output it generates, and demonstrate that the code — or an edited version of it — works.