Topic modeling is one antidote to the overwhelming volume of content created every day that marketers must understand. In this series, we’ll explore what topic modeling is, why it’s important, how it works, and some practical applications for marketing.
Part 3: The Topic Modeling Process
Now that we have a base understanding of what a topic model is, let’s look at the process of building one. Building a topic model resembles many other statistical and data engineering processes, with four discrete stages:
Let’s dig into each stage.
As with any form of data analysis, we need to first extract our data. Data for topic modeling comes from any body, any corpus of text. Topic models are limited only by our access to computational power; thus, we could conceivably build a model spanning the entirety of the written human language. However, for more practical purposes, chances are we’ll only be working with a maximum of a gigabyte of text or so.
How much text is a gigabyte? It’s about a million social media posts, about a half million news articles, or about ten thousand books.
Topic modeling also requires text at volume. A handful of social media posts will generate a poor quality model. My general rule of thumb is that if it takes you less than an hour to read the entirety of what you’re trying to analyze, it’s not worth building a topic model. For example, fifty tweets isn’t worth building a model. Five articles aren’t worth building a model.
When extracting text, we need to store it in a machine-friendly format. For the English language, ASCII text is the preferred format. For non-English languages, UTF-8 is the preferred format.
Once extracted, text data is stored typically either in a flat file or in a text-friendly database.
As an example file, let’s use a corpus of tweets mentioning me.
The transformation of text is one of the most important parts of topic modeling. Text must be well-formed, free of junk that will cause software to reject it. If our text didn’t come out clean in extraction, we must force it to either ASCII or UTF-8 formats. Popular text editors such as BBEdit for the Mac, UltraEdit for Windows, or Atom for cross-platform use, help to reformat text correctly.
One type of text that’s prevalent in digital works is HTML-encoded entities. These are special codes used in HTML – web pages – that will show up in topic models with weird looking phrases like emsp and nbsp. I maintain a running list of these entities and use find/replace scripts to clean them up.
In our example dataset, here’s what it looks like once I clean up a lot of the obvious junk:
Once our corpus is clean of the most obvious garbage, we must undertake the next stage of transformation using our software.
A Word on Topic Modeling Software
For topic modeling, we have a number of software choices. Some software, such as IBM Watson Natural Language Understanding or Google Cloud NL, does much of the transformation, loading, and processing automatically, in exchange for a per-use fee. For beginners dealing with relatively small quantities of text, I recommend starting here.
For more technical users who have large quantities of text that would either be cost-prohibitive or technically challenging for on-demand services, the software of choice are libraries for topic modeling in R or Python. I’m fond of packages such as topicmodels and quanteda in R.
For users with very high volume or specialized customization, dedicated academic software packages such as MALLET from the University of Massachusetts are the way to go.
For the remainder of this post, we’ll look at a topic model built using R.
After we’ve obtained a clean corpus, we need to load it for further processing. Our software conducts a process known as tokenization, where we break up our text into individual words and assign the numerical values as we explored in the previous post.
Once our document has been digested into tokens, we need to normalize the words using a process called stemming. Stemming reduces words down to their root forms, for more straightforward understanding. For example, read the following paragraph:
The taste of a red wine shines best in tastings paired with aged hard cheeses. A Merlot’s tastiness comes across boldly in a pairing with a smoked gouda.
Stemming would reduce taste, tastings, and tastiness to the root word taste. Paired and pairing would be reduced to pair. By reducing these words to their root forms, we obtain a more clear understanding of the paragraph at a summary level and elevate the prominence of these key words.
When we do tokenization and stemming, we visualize our corpus of tokens using simple tools like word clouds, just to quality check:
Above, this is what the word cloud of tweets about me looks like.
Loading and Processing Text
Once we’ve completed tokenization and stemming, we’re ready to generate the actual model. The gold standard algorithm used in topic modeling today is called Latent Dirichlet Allocation, or LDA. Virtually every major topic modeling software package uses LDA to build its models.
LDA looks at the mathematical relationships described in the previous post – what words are near other words frequently, how words interact, etc. to build our actual topic models.
The output of LDA processing varies based on what we ask of our software. Sometimes we’ll want a visualization of the different topics, such as this chart of graphs:
Other times, we’ll want a straight-up spreadsheet:
Still other times, we might rely on advanced visualizations or interactive displays.
The important takeaway is that we must be able to interpret and glean insight from our LDA outputs. Does looking at the above outputs confuse, rather than clarify? It should – for now. In the next post in this series, we’ll explore how to interpret topic models and transform these raw outputs into important insights we can act on. Stay tuned!
Want to read more like this from Christopher Penn? Get updates here:
Get your copy of Marketing Blue Belt!