The N in Text Analytics: Text Mining with Different Sample Sizes
[Interview Reposted with Permission From Jeffrey Henning's ResearhAccess]
I recently had the opportunity to interview Tom H. C. Anderson, the founder of Anderson Analytics, about his ongoing application of text analytics to market research.
Q: What’s the process for optimally using text analytics with survey verbatim responses?
A: Well, that patented process is something that we’ve obviously put a lot of time and thought into with OdinText, and something that continues to evolve.
Generally speaking though I can say it’s important to look beyond the individual sentences, and not to get wrapped up in linguistically derived sentiment. The mistake I see being made most often is that text analytics is approached as a replacement for human coding. In our view they are apples and oranges. Yes, text analytics can replace human coding. But coding is just a small part of what we do: our real focus is on analytics, and often that means that the optimal use of verbatim responses is predictive analytics. That is the optimal use of survey verbatims.
Q: Is there a minimal sample size this makes sense for?
A: I wouldn’t say that there’s a minimum size per se, though I would say that the ROI of text analytics increases exponentially with the size of the data. In our point of view “Natural Language Processing”, “Text Analytics”, “Text Mining” and even “Data Mining” are all synonyms, the last two of which are a better description of the process. What that means is that without a certain minimum size of data there will be no meaningful patterns to find (to mine).
Focus group data generally is not suitable for text analytics. It’s partly because the n is so small. But also because — although they can produce a large amount of text in total — this text is heavily influenced by the moderator. It very much depends on the data though. The smallest data size ever looked at in OdinText had sample size of n=2. This was the Obama/Romney debates, and each candidate spoke for about 45 minutes. More typically, though, text analytics is used to analyze tens of thousands, or hundreds of thousands, of records. These data are either from customer satisfaction/loyalty survey trackers, customer service center telephone transcripts or emails, or yes, social media.
Many of our customers do find text analytics useful for smaller ad-hoc survey data with sample sizes around n = ~1,000 as well. Once you are up and running with text analytics, it’s very easy and fast to use text analytics to get insights from data such as this. But you are somewhat more limited with the kinds of analysis that you can do with these smaller data sets. But if you do enough of these ad-hoc projects, text analytics can certainly provide relatively good ROI here too.
Q: Is it better suited for tracking studies rather than one-off surveys?
A: Better ROI with bigger better data. If you only do 5 to 10 ad-hoc surveys per year with an average of n=300, then text analytics may not be worth it. As you move beyond this, it becomes more and more valuable.
Q: My initial impression after first hearing about your NPS work was simply that you improved the value of the survey by adding text analytics. But it seems like you are really about a holistic process, using CRM data and other information to build a predictive model. What are the data sources that you find produce the best value? While I think of Odin Text as text analytics, is it actually a predictive analytics solution whose differentiation is its text analytics capabilities?
A: Well, yes, you are right that OdinText is a text analytics system. We are not trying to become the next SAS or SPSS, per se; both of them have some good packages for basic statistics. Where OdinText is best is when there is also text data, and when the data gets bigger. Our clients are often working with data sets so large that they would take too long to run or more typically crash SPSS and the like. Working with text data requires more computing power. That’s something we are able to offer through our SaaS model.
In the case you mentioned, Shell was using OdinText to analyze their n = ~400,000 Jiffy Lube Net Promoter survey data. We suggested that they add some data from their CRM database, so they added actual behavioral data: visits as well as sales.
This is a unique strength to OdinText. We don’t believe it makes much sense to analyze text in isolation. We are building more analytics capabilities into OdinText currently.
Q: The text analytics space is very crowded — I’ve personally look at over 20 platforms. What sets Odin Text apart from other systems?
Three things, really all tied to our patented approach to text analytics:
- The way we allow you to use mixed data not just text.
- The way we filter our ‘noise’ and alert the analyst to things they might not have considered.
- And finally, our approach, while powerful, is also intuitive. We recognized early on that most clients don’t have any relevant training data, and when they do, using it to build models would just be mimicking inferior human coding. So unlike other enterprise solutions that require a lot of custom set up, our approach was developed to work very well off the shelf: it’s far more nimble in being able to deal with different data sources.
Jeffrey Henning, PRC, is president of Researchscape International, which provides “Do It For You” custom surveys at Do It Yourself prices. He is a Director at Large on the Marketing Research Association’s Board of Directors. You can follow him on Twitter @jhenning.