Lab 3: Baby Names sample solutions

Data Downloads:

For this lab, you will need to download and unzip the following file.

Data Description

The data set is a zipped folder of .csv text files that contains the Census Bureau counts of names of baby boys and girls born each year from 1880 to 2014. These data include any name-gender combination for which there were at least five babies born in that year. There is a separate text file for each year from 1880 to 2014. For example, the file named yob1942.txt contains the counts of names for all boy and girl babies born in 1942.

Each row in a text file is of the format [name, gender, count]. For instance, some sample lines in the text file yob1942.text might be

The first line would indicate that 7,312 girls born in 1942 were given the name Mary. This lab asks you to use these data, along with R, to analyze how naming patterns in America have changed over the last 150 years. Remember to keep all of your work in R-scripts and to save frequently!

NOTE: Many of the steps below can be combined, but I have tried to write them out to make them more clear, and I have tried to rely only on “core” R functions, which is what we have covered so far in class. There are also multiple ways to complete most of these objectives. Feel free to experiment with others.

Objective 1. Load the data files into an R data frame

babynames = NULL
# To get you started quickly, I have provided the initial code to load in the files.  If this works, you can begin with Objective 2.

# Set your working directory to the folder where the name files are located.
setwd("data")
babynames = NULL

# For loop to read in the file and append it to a running total dataframe babynames
for (year in (1950:2014)) {
    foo = read.csv(paste("yob", toString(year), ".txt", sep=''), header=FALSE)
    babynames = rbind(babynames, cbind(foo, year))  
}
# After the data frame has been created (and you have checked it to make sure it is correct), provide it with sensible column names using the `names` command.  See `help(names)` for more information on how that works.  In this case, the data frame has four columns, so you will need to pass it a column vector of four text strings to name the four columns.

names(babynames) = c("name", "gender", "count", "year")

Objective 2. Plot how the popularity of your own name has been changing over the years

You should now have available in your RStudio environment a data frame containing the counts of all name-gender combinations in each year from your beginning start year to 2014.

# Create a data frame _mynames_ which only keeps rows in _babynames_ that match your name and gender. Note the use of the '&' which is how to do 'and' in R.  There are syntactically shorter ways to do this in R, which we can try later in the semester.

mynames = babynames[which(babynames$name=="Ishmael" & babynames$gender=="M"),]
# Use the `plot` command to plot counts vs. year, label the axes, and give it a title.

plot(mynames$year, mynames$count, xlab="Year", ylab="Counts", main="How the popularity of my name has changed")

Objective 3. Visualize the growth in unique girl names from 1880-2014

Create a plot of the number of unique baby girl names in each year from 1880 to 2014.

It may be interesting to compare the results with the following image.

# Create a data frame that only contains girl names.

female = babynames[which(babynames$gender=="F"),]

# Generate counts of the number of rows for each year that appears in the data frame.  A fast way to do this is to use the `table` command which does exactly this - it creates counts by year (or whatever other dimension you choose).  

girlnamecount = as.data.frame(table(female$year))

# Plot the number of unique names by year.

names(girlnamecount) = c("year", "namecounts")
plot(girlnamecount$year, girlnamecount$namecounts, xlab="Year", ylab="Number of unique girl names", main="Growth in number of unique girl names by year")

Write a function toptennames that takes two arguments, gender and year, and returns the top ten unique names for that gender in that year. This function should:

After running your function, test it by calling:\

print(toptennames(1880, "F"))

print(toptennames(2014, "F"))

# Create a function takes two arguments, the year and gender.  The function should use the values passed to this function to create a data frame _foo_ that only contains rows of that gender and year.  Order the resulting data frame by count, and then return only the top 10 rows.

toptennames <- function(year, gender) {
  foo = babynames[which(babynames$gender==gender & babynames$year==year),]
  top_names = foo[order(foo$count, decreasing=TRUE),]
  return(top_names[(1:10),])
}

# Remember to run the function in your R-script before calling it, or it will not be recognized in the R environment. 

# Test your function by calling it to make sure it does what you think it should do!  The function should return a data frame of 10 names, so it can be printed directly.  You could also have assigned it to a new R object (e.g. `a = toptennames("2012", "F")`) and then called `print(a)`.

print(toptennames("2012", "F"))
##              name gender count year
## 1263582    Sophia      F 22267 2012
## 1263583      Emma      F 20902 2012
## 1263584  Isabella      F 19058 2012
## 1263585    Olivia      F 17277 2012
## 1263586       Ava      F 15512 2012
## 1263587     Emily      F 13619 2012
## 1263588   Abigail      F 12662 2012
## 1263589       Mia      F 11998 2012
## 1263590   Madison      F 11374 2012
## 1263591 Elizabeth      F  9674 2012
print(toptennames("2014", "F"))
##              name gender count year
## 1330469      Emma      F 20799 2014
## 1330470    Olivia      F 19674 2014
## 1330471    Sophia      F 18490 2014
## 1330472  Isabella      F 16950 2014
## 1330473       Ava      F 15586 2014
## 1330474       Mia      F 13442 2014
## 1330475     Emily      F 12562 2014
## 1330476   Abigail      F 11985 2014
## 1330477   Madison      F 10247 2014
## 1330478 Charlotte      F 10048 2014

Objective 5. Compute the most gender-neutral names

Generate a list of the ten names that, in 2014, were relatively popular for both baby boys and baby girls. To do this, use the following condition: for names for which there are at least 1000 people in the year with that name (including boys or girls), compute the difference between boys and girls with the name. Then, we will call those names with the smallest magnitude difference between the two the most “gender neutral” names. This is not the best definition of gender-neutral, but it is a definition and is straightforward to implement.

To complete this objective:

# Create separate data frames that contain male and female names, merge them, and drop rows that do not have at least 1000 names.

male = babynames[which(babynames$gender=="M" & babynames$year==2014),]
female = babynames[which(babynames$gender=="F"  & babynames$year==2014),]
mf = merge(male, female, by=c("name", "year"), suffixes=c(".m", ".f"))
mf = mf[which((mf$count.m + mf$count.f)>=1000),]

# Take the absolute value of the difference, sort, and show ten with smallest absolute difference.

mf$diff = abs(mf$count.m - mf$count.f)
head(mf[order(mf$diff),], 10)
        name year gender.m count.m gender.f count.f diff
2184  Skyler 2014        M     911        F    1070  159
514  Charlie 2014        M    1670        F    1432  238
1217 Justice 2014        M     518        F     756  238
605   Dakota 2014        M     876        F    1136  260
1902 Phoenix 2014        M     901        F     629  272
1740   Milan 2014        M     748        F     424  324
2258   Tatum 2014        M     462        F     828  366
109    Amari 2014        M     970        F     585  385
2048    Rory 2014        M     741        F     326  415
2095    Sage 2014        M     399        F     834  435

Click here if you are interested in seeing what the solutions might look like in Python.