Text mining the New York Times

Posted by Marshall on July 27, 2006 | Link It

Roland Piquepaille writes for Smartmobs.com, as do I; Roland reports in ZdNet yesterday that University of California and Irvine (UCI) has found a new way to datamine large batches of data, in this case, a couple of years of New York Times articles, and come up with more intelligence than was possible in the past.  This is just good information to be aware of.

Here is a quote from one of the researchers.

"We have shown in a very practical way how a new text mining technique makes understanding huge volumes of text quicker and easier," said David Newman, a computer scientist in the Donald Bren School of Information and Computer Sciences at UCI. "To put it simply, text mining has made an evolutionary jump. In just a few short years, it could become a common and useful tool for everyone from medical doctors to advertisers; publishers to politicians."

Now, let’s look at a real example and as how the team discovered links between topics and people. Below is a graph showing "topic-model-based relationships between entities and topics. A link is present when the likelihood of an entity in a particular topic is above a threshold." (Credit: UCI)

Discovering topics in the NYT archives

Here is another example picked from the UCI news release.

For example, the model generated a list of words that included "rider," "bike," "race," "Lance Armstrong" and "Jan Ullrich." From this, researchers were easily able to identify that topic as the Tour de France. By examining the probability of words appearing in stories about the Tour de France, researchers learned that Armstrong was written about seven times as much as Ullrich.

But what exactly is ‘topic modeling’?

Topic modeling looks for patterns of words that tend to occur together in documents, then automatically categorizes those words into topics. Older text-mining techniques require the user to come up with an appropriate set of topic categories and manually find hundreds to thousands of example documents for each category. This human-intensive process is called supervised learning. In contrast, topic modeling, a type of unsupervised learning, doesn’t need suggestions for an appropriate set of topic categories or human-found example documents. This makes retrieving information easier and quicker.

This research work has been presented by Newman and his colleagues during the IEEE Intelligence and Security Informatics Conference (ISI 2006), which was held in May in San Diego. Here is a link to their technical paper, "Analyzing Entities and Topics in News Articles Using Statistical Topic Models" (PDF format, 12 pages, 248 KB). The above graph has been extracted from this paper.

For more information about the topic modeling technique used by these scientists, you can look at the works done by Mark Steyvers and his Memory and Decisions Laboratory (MADLAB).

In particular, you can try the software available from this Topic Modeling Toolbox. And as you might not have the archives of the New York Times at your disposal to do some experiments, start with something smaller and see what kind of topics you discover — using the contents of this blog for example.

Sources: University of California - Irvine, July 26, 2006; and various web sites

You’ll find related stories by following the links below.



Post a Response

Name (required)

Email (required, not published)

Website (optional)

Note: The following tags are approved for comments on this blog:
<a href=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <del> <strong>