Using topic modelling to find meaning in text-based data

KAE presents its proprietary text analytics capabilities and explains how the text-based data available online can be used to guide business decisions

The challenge

In today’s information-saturated world, there is an abundance of data available that has the potential to offer incredible insight. We’re all good with numbers by now, but people write a lot down too. Reading it all is impossible, but there is value in the data. This poses a challenge for brands; how can we best collect and extract insight from unstructured text-based data?

Brands actively collect written feedback via customer feedback surveys, customer service channels, and transaction-based experience trackers. But there is also a wealth of unsolicited feedback that is largely untapped. Over the last decade, brands have used social media to collect organic conversations about their brands, with variable results, but this range of collected data has widened still further to include sources such as blogs, discussion forums, review sites and even threads of comments on news articles. There is more and more information available for brands to use to guide business decisions – if they know how to extract the insight.

With such large volumes of text, sorting through the data is a repetitive, time-consuming and expensive process if done manually. But text analysis is not an easy job for a machine either and multiple challenges remain:

  • understanding what is relevant and what is not
  • understanding the semantic meaning of text rather than just the words used
  • picking up on subtle differences in language and nuances like cultural capital

As a result, text analysis has evolved from basic techniques, such as word frequency, collocation and concordance, to more advanced techniques, including sentiment analysis, topic modelling and clustering. As with most forms of analysis though, there is no one size that fits all – for example generic sentiment algorithms still struggle for accuracy – and the most valuable exercises are built to answer specific business problems.

KAE has developed its proprietary text analytics capabilities, using a blend of these techniques, applied to data collected from a wide range of sources. Using this technique, we’ve helped our clients extract insight used for a variety of use-cases, often at a lower cost and under shorter timeframes compared to traditional research.

The process

Our first step is to identify the most useful data sources for to the business problem. Blogs, discussion forums, review sites and social media are all considered, taking into account the quality of data of each (avoiding ‘fake news’). Web-scraping techniques can be used to harvest these organic conversations about brands, often alongside some sort of ‘performance’ metric; this can be a review score, interactions (comments, likes, shares), as well as a ‘sentiment score’ calculated using sentiment classifier algorithms. The collection of bespoke data sets, often identified using simple secondary research and industry knowledge, allows the machine-learning to yield far more relevant insights.

Despite all of our efforts, a proportion of the text will not be relevant to the business problem. This text is filtered out – not unlike the process of screening our unsuitable respondents from a survey. We clean the text to extract its core meaning using several techniques;

  • Stop-word removal: common words with no use/purpose, such as “the”
  • Lemmatisation: grouping together the inflected forms of a word, such as “buying,” “buys,” “bought”
  • Word-vector simplification: the semantic meaning of each word is extracted and clustered based on vector similarity. As a basic example, the word ‘queen’ and ‘king’ would have a similar vector of ‘royalty’, but opposite vector of ‘gender’.

The output of this data cleaning step is an organised term matrix (terms referring to the cleaned version of words), to which topic modelling can be applied. The machine learning algorithm looks at how terms commonly appear together in topics. Using the results, we look at how often each topic comes up in conversation, as well as topics that often overlap. Using an available performance metric, we can also get a view of which topics are being discussed in a positive or negative manner.

To bring depth of insight, we manually interrogate each topic using the key words that are strongly associated with each topic and reviews/posts that are heavily weighted in a topic. The process is often iterative; if we find that the topics have ambiguous meaning, we return to the topic modelling step and tweak the number of topics that the machine learning identifies.

Additionally, we can apply a technique called spectral clustering, which uses topic weights to identify distinctive types of comments, such as reviews about rewards and customer service. This allows us to give a view of the entire online sphere, showing the relative volumes of each type of conversation.

The outcome

The outcome is an overview of online conversations; outlining what is being discussed and in what volume. We can also see how themes are positioned together. By looking at how themes overlap and interplay, we can get a clearer view of the story. To supplement this overview, we create an in-depth view of each theme, with a breakdown of nuanced issues within a topic and example conversations so that the topics are easily digestible. So far, we’ve used this process to help our clients answer a range of critical questions:

  • What features do our customers see as missing from our credit card proposition?
  • What are the main pain points in our customer journey?
  • How might our customers react to the introduction of a fee on a previously free product?
  • What card features are our competitors offering that are boosting customer satisfaction?
  • Who are our key competitors, where are they hurting our business, where are they weaker, and how are they performing?

Our proprietary text analytics capabilities can harness the power of organic customer conversation to provide a qualitative view without needing to conduct more time- and cost-intensive methods such as focus groups. We see many applications of the approach, to name a few:

  • Providing a low cost ‘first pass’ on a research topic to inform additional research activity
  • Analysing customer reaction (for example to a product or price change) in order to shape future decision making
  • Refining marketing and promotional material

If you’d like to hear more about our text analytics capabilities, or have a question that we can help answer, please get in touch