|
An experiment in writing in public, one page at a time, by S. Graham, I. Milligan, & S. Weingart

Topic Modeling: A Hands-On Adventure in Big Data

1 Leave a comment on paragraph 1 0             In this chapter, we discuss in some depth different ways of building topic models of the patterns of discourse in your source documents. We explore what ‘topic models’ are, how topic modeling works, and explore why some tools might be better in certain circumstances than others.

2 Leave a comment on paragraph 2 0 Keywords have their limitations, as they require that we know what to search for. Topic modeling, on the other hand, allows us to come in with an open mind. Instead, in this approach, the documents ‘tell’ us what topics they contain.  The ‘model’ in a topic model is the idea of how texts get written: authors compose texts by selecting words from a distribution of words (or ‘bag of words’ or ‘bucket of words’) that describe various thematic topics.  Got that? In the beginning there was the topic. The entire universe of writing is one giant warehouse wherein its aisles are bins of words – here the bins of Canadian History, there are the bins for major league sports (a very small aisle indeed). All documents (your essay, my dissertation, this book) are composed of words plucked from the various topic bins and combined. If that describes how the author actually writes, then this process is reversible: it is possible to decompose from the entire collection of words the original distributions held in those bags and buckets.

3 Leave a comment on paragraph 3 0 In this section, we explore various ways of creating topic models, what they might mean, and how they might be visualized. We work through a number of examples, so that the reader might find a model to adapt to his or her own work. The essence of a topic model is in its input and its output: a corpus, a collection, of text goes in, and a list of topics that comprise the text comes out the other side. Its mechanism is a deeply flawed assumption about how writing works, and yet the results of this mechanism are often surprisingly cogent and useful.

4 Leave a comment on paragraph 4 0 What is a topic, anyway? If you are a literary scholar, you will understand what a ‘topic’ might mean perhaps rather differently than how a librarian might understand it, as discourses rather than as subject headings. Then there is the problem of how do the mathematicians and computer scientists understand what a ‘topic’ might be? To answer that question, we have to wonder about the meaning of a ‘document’. To the developers of these algorithms, a ‘document’ is simply a collection of words that are found in differing proportions (thus it could be, in the real world, a blog post, a paragraph, a chapter, a ledger entry, an entire book). To decompose a document to its constituent ‘topics’, we have to imagine a world in which every conceivable topic of discussion exists and is well defined, and each topic is perfectly represented as a distribution of particular words. Each distribution of words is unique, and thus you can infer a document’s topicality by carefully comparing its distribution of words to the set of ideal topics we already know exist.

Page 42

Source: http://www.themacroscope.org/?page_id=788