HW 2: News Classification

OIDD 245 Tambe

Due Dates

You may NOT work with others on this assignment. However, you can ask the TAs or I for help as needed. If using Slack, please directly message us rather than posting to a public channel. It is best not to put this assignment off until very close to the due date. These data sets are large, so it can sometimes take more time than anticipated to process them.

Deliverables: You should submit your R-scripts or your R-notebook as well as a .pdf document with your findings. The document can be created as a word document or an HTML document or slides, but in any of these cases, should be converted to a pdf before submitting. The following notes and instructions are designed to guide you through this process.

Business Context:

Major online media channels increasingly rely on algorithms to retrieve breaking news and to classify it into categories. Google News and Apple News are leading examples of this type of algorithmic news curation. More broadly, automatic topic assignment is increasingly being used on SEC filings, investor reports, online reviews, legal documents, executive interview transcripts, and many other sources of business text.

In this assignment, you are asked to analyze news articles from CNBC, a major American business news channel. In 2006, it relaunched CNBC.com as its main digital news outlet. The news content on the website is edited 24 hours a day during the business week. The site documents almost everything that happens in U.S. financial markets that deserves investor attention. Like most traditional news websites, CNBC visitors navigate pre-defined categories to find content of interest, such as ECONOMY or FINANCE. Accurate assignment of articles into categories is important. For instance, a reader may skip an article about Tesla or Uber that appears in the AUTO section that they would have otherwise read had it been in the TECHNOLOGY section.

You have access to a database of 20,000 news articles. The objective is to use LDA topic modeling to discover the underlying thematic structure in the document collections, and then automatically classify new, incoming news articles (that you will scrape) into an existing category.

Step 1. Fit a topic model to the existing news data archive

Step 2. Retrieve new articles

Scrape the text of each of the articles that appear on the first CNBC News Page. (i.e. about 20 articles). UPDATE: To shorten this assignment, rather than having you scrape the article headlines and URLs from the front page, we have simply provided a list of recent article URLs you can work from at this link. Download this file and read the contents into your session using the read_lines command.

Step 3. Classify news articles using your topic model

One “catch” is that the new document should not contain any words that were not contained in the original archival data. The code snippet below illustrates how to create the new Document-Term Matrix, handle this issue, and apply the topic model that you created earlier.

# dtms in the line below is the Document Term Matrix from Step 1
dic = Terms(dtms)

# Specify this dictionary when creating the dtm for the new articles, 
# which will limit the dtm it creates to only the words that also appeared in
# the archive. In the example below, 'ldaOut' would be the name assigned to 
# the topic model you created in Step 1 and 'corp.foo' is the cleaned corpus 
# of news articles.

new_dtm = DocumentTermMatrix(corp.foo, control=list(dictionary = dic))
new_dtm = new_dtm[rowSums(as.matrix(new_dtm))!=0,]
topic_probabilities = posterior(ldaOut, new_dtm)