HW 2: News Classification
OIDD 245 Tambe
You may NOT work with others on this assignment. However, you can ask the TAs or I for help as needed. If using Slack, please directly message us rather than posting to a public channel. It is best not to put this assignment off until very close to the due date. These data sets are large, so it can sometimes take more time than anticipated to process them.
Deliverables: You should submit your R-scripts or your R-notebook as well as a .pdf document with your findings. The document can be created as a word document or an HTML document or slides, but in any of these cases, should be converted to a pdf before submitting. The following notes and instructions are designed to guide you through this process.
Major online media channels increasingly rely on algorithms to retrieve breaking news and to classify it into categories. Google News and Apple News are leading examples of this type of algorithmic news curation. More broadly, automatic topic assignment is increasingly being used on SEC filings, investor reports, online reviews, legal documents, executive interview transcripts, and many other sources of business text.
In this assignment, you are asked to analyze news articles from CNBC, a major American business news channel. In 2006, it relaunched CNBC.com as its main digital news outlet. The news content on the website is edited 24 hours a day during the business week. The site documents almost everything that happens in U.S. financial markets that deserves investor attention. Like most traditional news websites, CNBC visitors navigate pre-defined categories to find content of interest, such as ECONOMY or FINANCE. Accurate assignment of articles into categories is important. For instance, a reader may skip an article about Tesla or Uber that appears in the AUTO section that they would have otherwise read had it been in the TECHNOLOGY section.
You have access to a database of 20,000 news articles. The objective is to use LDA topic modeling to discover the underlying thematic structure in the document collections, and then automatically classify new, incoming news articles (that you will scrape) into an existing category.
Step 1. Fit a topic model to the existing news data archive
- Preprocess the archival
That is, transform the data in the
contentcolumn into a corpus, clean the text, and generate a document-term matrix. This step is similar to exercises we have done in class.
- Use the LDA() function in the
topicmodelspackage to train a topic model with between 10 and 15 topics. With 20,000 documents, this process may take a while (it is computationally intensive), so try beginning with only about 1,000 documents, and then slowly increase until the model takes no more than five minutes to run. Using more documents provides better accuracy, so use as many as your computer can handle in about five minutes, but please try to use no less than 1000. Note that the LDA model will generate different results every time you run it unless you pass it a "seed" as outlined in the following post.
- List 10 words that appear in each of these topics. By eyeballing the words in each topic, provide a name for each topic category. This requires human judgment, and some topics will be much more clear than others. Do your best to assign a sensible category name to each group based on the words assigned to that topic.
Step 2. Retrieve new articles
Scrape the text of each of the articles that appear on the first CNBC
News Page. (i.e. about 20 articles). UPDATE: To shorten this assignment, rather than having you scrape the article headlines and URLs from the front page, we have simply provided a list of recent article URLs you can work from at this link. Download this file and read the contents into your session using the
Collect and clean the text from one of these news articles. For one of the URLs in the list, visit the page in a browser and gather the text on that page. Unlike with our earlier scraping exercises, this one requires some data cleaning outlined in the following steps.
- The page text is not one continuous block, so it will have to be
read in as several blocks (i.e. the
html_textcommand will produce a character vectors with 4 or 5 different elements, each being a separate block of text from the article). You can use the
pastefunction with the “collapse” option to compress this into a single text block per article.
- There are is whitespace and characters that you may not want in
your text. You might explore the use of functions like
gsubto clean the text and remove extra space and carriage returns. Clean the text to eliminate extra whitespace and character strings like “
\t” and “
- The page text is not one continuous block, so it will have to be read in as several blocks (i.e. the
Once this is working for a single news article, create a
forloop that puts the above pieces together. It should automatically scrape and clean the text for each of the 20 URLs you collected. The end result should be the cleaned text from 20 articles, where each element is the cleaned text from one of these articles.
Step 3. Classify news articles using your topic model
- Using the topic model you built in Step 1, assign a topic to each document in the news article that you scraped. This computes the probability that the document belongs to a particular topic, based on the words in the topic and the words in the document. To do this, you will need to create a document-term matrix for the new document and the apply the topic model that you developed above to the new document-term matrix. (This is similar to earlier exercises where we built a logistic regression model using training data, and applied it to test data.)
One “catch” is that the new document should not contain any words that were not contained in the original archival data. The code snippet below illustrates how to create the new Document-Term Matrix, handle this issue, and apply the topic model that you created earlier.
dic = Terms(dtms) # Specify this dictionary when creating the dtm for the new articles, which will limit the dtm it creates to only the words that also appeared in the archive. In the example below, 'ldaOut' would be the name assigned to the topic model you created in Step 1. new_dtm = DocumentTermMatrix(corp.foo, control=list(dictionary = dic)) new_dtm = new_dtm[rowSums(as.matrix(new_dtm))!=0,] topic_probabilities = posterior(ldaOut, new_dtm)
The probability of a document appearing in each topic will be located in
topic_probabilities$topics. Using these data, generate a vector that assigns to each document the topic for which it has the highest probability of appearing.
Finally, in a table, print the contents of any ten news articles (or just the first few words of each of these articles) and the categories you generated for them, where you should identify these categories by the names assigned to them at the end of Step 1. You can assemble this table by “hand”. You do not need to write any additional code to generate it.