For loop explanation
This page explains how the initial for
loop is working to read in all
the data from the baby names files. The data are stored in a 100+ files
with names like "yob1989.txt", "yob1990.txt", "yob1991.txt" and so
on. The objective is to load each of these files ONE BY ONE and load
them all into one big dataframe that contains data from all of the
files. First, RStudio needs to know where the files are located on your
computer, so the first step is to set your working directory to the
folder where the name files are located on YOUR computer. The line below
is for my own computer, you will need to change the file path to
wherever you downloaded the files.
setwd("~/Dropbox/Teaching/oidd245/labs/baby/data/")
First, to make sure we are clear on what we are doing, let’s think
about how we might do this without using a for
loop.
without using a for loop
Step 1: read in each .csv file and put it in a separate data frame.
foo0 = read.csv("year1970.txt", header=FALSE) foo1 = read.csv("year1971.txt", header=FALSE) foo2 = read.csv("year1972.txt", header=FALSE) foo3 = read.csv("year1973.txt", header=FALSE) foo4 = read.csv("year1974.txt", header=FALSE) foo5 = read.csv("year1975.txt", header=FALSE) ... ... foo45 = read.csv("year2014.txt", header=FALSE)
To show the beginning of one of these:
head(foo0,5)
V1 V2 V3 1 Jennifer F 46160 2 Lisa F 38960 3 Kimberly F 34142 4 Michelle F 34053 5 Amy F 25212
Step 2: The structure of these files is “Name”, “Gender”, “Counts”.
Although the year is in the name of the file, it is not in
the data itself, so if we put all these files together now, we will lose
track of which year each of the entries are from, so we need to add the
year into each data frame before putting it together. We can use cbind
and overwrite the existing data frame, so …
foo0 = cbind(foo0, 1970) foo1 = cbind(foo1, 1971) ... ... foo45 = cbind(foo45, 2014)
To show the beginning of one of these:
head(foo0,5)
V1 V2 V3 1970 1 Jennifer F 46160 1970 2 Lisa F 38960 1970 3 Kimberly F 34142 1970 4 Michelle F 34053 1970 5 Amy F 25212 1970
We can change the names of the columns in the data frame so its easier to keep track of things.
names(foo0) = c("Name", "Gender", "Counts", "Year") names(foo1) = c("Name", "Gender", "Counts", "Year") names(foo2) = c("Name", "Gender", "Counts", "Year") head(foo0,5)
Name Gender Counts Year 1 Jennifer F 46160 1970 2 Lisa F 38960 1970 3 Kimberly F 34142 1970 4 Michelle F 34053 1970 5 Amy F 25212 1970
Now each of these 45 data frames has "Name", "Gender", "Counts",
"Year". Finally, we need to combine these 45 files into a single
data frame we will call babynames
by using rbind
. This appends a
file onto whatever else is already in babynames
, so each call of
cbind
is adding one more year's worth of data to the bottom of the
data frame. The number of rows should grow each time we add a new
year's worth of data. I have added the nrow
commands just to show
that babynames
is growing after each cbind
. They are not necessary.
babynames = NULL babynames = cbind(babynames, foo0) nrow(babynames) babynames = cbind(babynames, foo1) nrow(babynames) babynames = cbind(babynames, foo2) nrow(babynames) ... ... babynames = cbind(babynames, foo45) [1] 14777 [1] 30065 [1] 45477
At the end of this process, babynames
should have all the combined
data from 45 text files.
with a for loop
The way that was shown above is repetitive and requires a lot of typing
(although it should work just fine)! We can try using a for
loop to
take care of the repetition and make things easier for ourselves. The
following loop runs once for each year between 1970 and 2014, so 45
times. It will do the loop once for each value, so in the first pass
through it will take the value 1970. For the second it will take the
value 1971, and so on. The last pass through it will take the value
2014, and then it will finish looping.
babynames = NULL for (year in (1970:2014)) { # So first pass through the loop, year takes the value "1970" # Then, the next line creates a string with a filename like "yob1989.txt", "yob1990.txt", depending on how far it has looped. # So first pass through, it creates the string "yob1970.txt" filename = paste("yob", toString(year), ".txt", sep='') # The variable called filename now should hold "yob1970.txt" # Now, the next line reads in the csv file and puts it in a new variable called "foo" # The name foo is random. It could be cat or dog or any other variable name foo = read.csv(filename, header=FALSE) # Now the dataframe foo has the data from "yob1970.txt" stored in it # Add a column for the year for that file so we can keep track of it in the data foonewyear = cbind(foo, year) # Add it on to the existing dataframe babynames = rbind(babynames, foonewyear) # After the first pass, babynames will only have the data for "yob1970.txt" # Then it will go back and repeat this loop for "yob1971.txt" and so on until it completes "yob2014.txt" and then it will stop }
Combining some of these lines further gets us:
babynames = NULL for (year in (1970:2014)) { foo = read.csv(paste("yob", toString(year), ".txt", sep=''), header=FALSE) babynames = rbind(babynames, cbind(foo, year)) } names(babynames) = c("name", "gender", "count", "year")