For loop explanation

This page explains how the initial for loop is working to read in all the data from the baby names files. The data are stored in a 100+ files with names like "yob1989.txt", "yob1990.txt", "yob1991.txt" and so on. The objective is to load each of these files ONE BY ONE and load them all into one big dataframe that contains data from all of the files. First, RStudio needs to know where the files are located on your computer, so the first step is to set your working directory to the folder where the name files are located on YOUR computer. The line below is for my own computer, you will need to change the file path to wherever you downloaded the files.

setwd("~/Dropbox/Teaching/oidd245/labs/baby/data/")

First, to make sure we are clear on what we are doing, let’s think about how we might do this without using a for loop.

without using a for loop

Step 1: read in each .csv file and put it in a separate data frame.

foo0 = read.csv("year1970.txt", header=FALSE)
foo1 = read.csv("year1971.txt", header=FALSE)
foo2 = read.csv("year1972.txt", header=FALSE)
foo3 = read.csv("year1973.txt", header=FALSE)
foo4 = read.csv("year1974.txt", header=FALSE)
foo5 = read.csv("year1975.txt", header=FALSE)
...
...
foo45 = read.csv("year2014.txt", header=FALSE)

To show the beginning of one of these:

head(foo0,5)

        V1 V2    V3
1 Jennifer  F 46160
2     Lisa  F 38960
3 Kimberly  F 34142
4 Michelle  F 34053
5      Amy  F 25212

Step 2: The structure of these files is “Name”, “Gender”, “Counts”.

Although the year is in the name of the file, it is not in the data itself, so if we put all these files together now, we will lose track of which year each of the entries are from, so we need to add the year into each data frame before putting it together. We can use cbind and overwrite the existing data frame, so …

foo0 = cbind(foo0, 1970)
foo1 = cbind(foo1, 1971)
...
...
foo45 = cbind(foo45, 2014)

To show the beginning of one of these:

head(foo0,5)

        V1 V2    V3 1970
1 Jennifer  F 46160 1970
2     Lisa  F 38960 1970
3 Kimberly  F 34142 1970
4 Michelle  F 34053 1970
5      Amy  F 25212 1970

We can change the names of the columns in the data frame so its easier to keep track of things.

names(foo0) = c("Name", "Gender", "Counts", "Year")
names(foo1) = c("Name", "Gender", "Counts", "Year")
names(foo2) = c("Name", "Gender", "Counts", "Year")
head(foo0,5)

      Name Gender Counts Year
1 Jennifer      F  46160 1970
2     Lisa      F  38960 1970
3 Kimberly      F  34142 1970
4 Michelle      F  34053 1970
5      Amy      F  25212 1970

Now each of these 45 data frames has "Name", "Gender", "Counts", "Year". Finally, we need to combine these 45 files into a single data frame we will call babynames by using rbind. This appends a file onto whatever else is already in babynames, so each call of cbind is adding one more year's worth of data to the bottom of the data frame. The number of rows should grow each time we add a new year's worth of data. I have added the nrow commands just to show that babynames is growing after each cbind. They are not necessary.

    babynames = NULL
    babynames = cbind(babynames, foo0)
    nrow(babynames)
    babynames = cbind(babynames, foo1)
    nrow(babynames)
    babynames = cbind(babynames, foo2)
    nrow(babynames)
    ...
    ...
    babynames = cbind(babynames, foo45)


[1] 14777
[1] 30065
[1] 45477

At the end of this process, babynames should have all the combined data from 45 text files.

with a for loop

The way that was shown above is repetitive and requires a lot of typing (although it should work just fine)! We can try using a for loop to take care of the repetition and make things easier for ourselves. The following loop runs once for each year between 1970 and 2014, so 45 times. It will do the loop once for each value, so in the first pass through it will take the value 1970. For the second it will take the value 1971, and so on. The last pass through it will take the value 2014, and then it will finish looping.

babynames = NULL

for (year in (1970:2014)) {

    # So first pass through the loop, year takes the value "1970" 
    # Then, the next line creates a string with a filename like "yob1989.txt", "yob1990.txt", depending on how far it has looped.
    # So first pass through, it creates the string "yob1970.txt"

    filename = paste("yob", toString(year), ".txt", sep='')

    # The variable called filename now should hold "yob1970.txt" 
    # Now, the next line reads in the csv file and puts it in a new variable called "foo"
    # The name foo is random. It could be cat or dog or any other variable name

    foo = read.csv(filename, header=FALSE)

    # Now the dataframe foo has the data from "yob1970.txt" stored in it

    # Add a column for the year for that file so we can keep track of it in the data
    foonewyear = cbind(foo, year)

    # Add it on to the existing dataframe
    babynames = rbind(babynames, foonewyear)  

    # After the first pass, babynames will only have the data for "yob1970.txt"
    # Then it will go back and repeat this loop for "yob1971.txt" and so on until it completes "yob2014.txt" and then it will stop
}

Combining some of these lines further gets us:

babynames = NULL

for (year in (1970:2014)) {
    foo = read.csv(paste("yob", toString(year), ".txt", sep=''), header=FALSE)
    babynames = rbind(babynames, cbind(foo, year))  
}

names(babynames) = c("name", "gender", "count", "year")