Posts tagged quantitative research
Celebrating Four Consumer Insights Industry Awards for Text Analytics!

A Note of Reflection on Text Analytics and Thanks to the Marketing Research Industry Just posting  a short video today to celebrate our recent awards in an industry that has always been so near and dear to OdinText (and Anderson Analytics before that).

If you would’ve told me 25 years ago when I began my career in consumer insights that one day I would be running my own text analytics software company, I night have laughed. But the field has changed so dramatically since I was a freshly-minted market researcher.

Back when I started out, online research did not exist. There was no social media. Tweeting was something only birds did. And teenagers bugged their parents for their own land lines

As a young researcher, the primary way we collected information and gleaned insights from consumers was by asking them questions, but one of the most pronounced shifts in research today is that so much of the data at our disposal is collected passively.

Today, in addition to the tried-and-true qualitative and quantitative tools of our trade, we have oceans of data flooding into our organizations from non-traditional sources.

And so our job has become less about accumulating information from consumers, and more about connecting the dots and making sense of it all. That’s a pretty significant shift, I think.

And it’s a monumental challenge for those of us in consumer insights.

Many companies now are turning to data scientists for answers, but I would argue that the onus is on those of us in market research to find a way to use these data for competitive advantage.

I got into this because I was conducting advanced text analytics for clients and none of the tools available did what we needed them to do as research analysts.

I did not set out to be a software developer; I just wanted something that worked for my team and our clients. It needed to be fast and easy to use for people who are not data scientists.

We are honored and humbled to be part of such a smart, creative and closely knit community of professionals.  To be recognized by peers for contributing to our progress as a profession is enormously gratifying.

Thank you all again! We hope and plan to continue to innovate and give back to our industry!

 

 

Yours faithfully,

@TomHCAnderson @OdinText

Peaks and Valleys or Critical Moments Analysis

Text Analytics Tips - Branding Peaks and Valleys or Critical Moments Analysis Text Analytics Tips by Gosia

 

 How can you gain interesting insights just from looking at descriptive charts based on your data? Select a key metric of interest like Overall Satisfaction (scale 1-5) and using a text analytics software allowing you to plot text data as well as numeric data longitudinally (e.g., OdinText) view your metric averages across time. Next, view the plot using different time intervals (e.g, the plot could display daily, weekly, bi-weekly, or monthly overall satisfaction averages) and look for obvious “peaks” (sudden increases in the average score) or “valleys” (sudden decreases in the average score). Note down the time periods in which you have observed any peaks or valleys and try to identify reasons or events associated with these trends, e.g., changes in management, a new advertising campaign, customer service quality, etc. The next step is to plot average overall satisfaction scores for selected themes and see how they relate to the identified “peaks” or “valleys” as these themes may provide you with potential answers to the identified critical moments in your longitudinal analysis.

In the figure below you can see how the average overall satisfaction of a sample company varied during approximately one month of time (each data point/column represents one day in a given month). Whereas no “peaks” were found in the average overall satisfaction curve, there was one significant “valley” visible at the beginning of the studied month (see plot 1 in Figure 1). It represented a sudden drop from the average satisfaction of 5.0 (day 1) to 3.1 (day 2) and 3.5 (day 3) before again rising up and oscillating around the average satisfaction of 4.3 for the rest of the days that month. So what could be the reason for this sudden and deep drop in customer satisfaction?

Text Anaytics Tip 2a OdinText Gosia

Text Analytics Tip 2b Gosia OdinText

Text NAalytics Tip 2c Gosia OdinText

Figure 1. Annotated OdinText screenshots showing an example of a exploratory analysis using longitudinal data (Overall Satisfaction).

Whereas a definite answer requires more advanced predictive analyses (also available in OdinText), a quick and very easy way to explore potential answers is possible simply by plotting the average satisfaction scores associated with a few themes identified earlier. In this sample scenario, average satisfaction scores among customers who mentioned “customer service” (green bar; second plot) overlap very well with the overall satisfaction trendline (orange line) suggesting that customer service complaints may have been the reason for lowered satisfaction ratings on days 2 and 3. Another theme plotted, “fast service” (see plot 3), did not at all follow the overall satisfaction trendline as customers mentioning this theme were highly satisfied almost on every day except day 6.

This kind of simple exploratory analysis can be very powerful in showing you what factors might have effects on customer satisfaction and may serve as a crucial step for subsequent quantitative analysis of your text and numeric data.

 

Text Analytics Tips with Gosi

 

[NOTE: Gosia is a Data Scientist at OdinText Inc. Experienced in text mining and predictive analytics, she is a Ph.D. with extensive research experience in mass media’s influence on cognition, emotions, and behavior.  Please feel free to request additional information or an OdinText demo here.]

Text analysis answers: Is the Quran really more violent than the Bible? (3of3)

Text analysis answers: Is the Quran really more violent than the Bible?by Tom H. C. Anderson

Text Analytics Bible Q

Part III: The Verdict

To recap…

President Obama in his State of the Union last week urged Congress and Americans to “reject any politics that target people because of race or religion”—clearly a rebuke of presidential candidate Donald Trump’s call for a ban on Muslims entering the United States.

This exchange, if you will, reflects a deeper and more controversial debate that has wended its way into not only mainstream politics but the national discourse: Is there something inherently and uniquely violent about Islam as a religion?

It’s an unpleasant discussion at best; nonetheless, it is occurring in living rooms, coffee shops, places of worship and academic institutions across the country and elsewhere in the world.

Academics of many stripes have interrogated the texts of the great religions and no doubt we’ll see more such endeavors in the service of one side or the other in this debate moving forward.

We thought it would be an interesting exercise to subject the primary books of these religions—arguably the core of their philosophy and tenets—to comparison using the advanced data mining technology that Fortune 500 corporations, government agencies and other institutions routinely use to comb through large sets of unstructured text to identify patterns and uncover insights.

So, we’ve conducted a surface-level comparative analysis of the Quran and the Old and New Testaments using OdinText to uncover with as little bias as possible the extent to which any of these texts is qualitatively and/or quantitatively distinct from the others using metrics associated with violence, love and so on.

Again, some qualifiers…

First, I want to make very clear that we have not set out to prove or disprove that Islam is more violent than other religions.

Moreover, we realize that the Old and New Testaments and the Quran are neither the only literature in Islam, Christianity and Judaism, nor do they constitute the sum of these religions’ teachings and protocols.

I must also reemphasize that this analysis is superficial and the findings are by no means intended to be conclusive. Ours is a 30,000-ft, cursory view of three texts: the Quran and the Old and New Testaments, respectively.

Lastly, we recognize that this is a deeply sensitive topic and hope that no one is offended by this exercise.

 

Analysis Step: Similarities and Dissimilarities

Author’s note: For more details about the data sources and methodology, please see Part I of this series.

In Part II of the series, I shared the results of our initial text analysis for sentiment—positive and negative—and then broke that down further across eight primary human emotion categories: Joy, Anticipation, Anger, Disgust, Sadness, Surprise, Fear/Anxiety and Trust.

The analysis determined that of the three texts, the Old Testament was the “angriest,” which obviously does not appear to support an argument that the Quran is an especially violent text relative to the others.

The next step was to, again, staying at a very high level, look at the terms frequently mentioned in the texts to see what if anything these three texts share and where they differ.

Similarity Plot

Text Analytics Similarity Plot 2

This is yet another iterative way to explore the data from a Bottom-Up data-driven approach and identify key areas for more in-depth text analysis.

For instance—and not surprisingly—“Jesus” is the most unique and frequently mentioned term in the New Testament, and when he is mentioned, he is mentioned positively (color coding represents sentiment).

“Jesus” is also mentioned a few times in the Quran, and, for obvious reasons, not mentioned at all in the Old Testament. But when “Jesus” is mentioned in the New Testament, terms that are more common in the Old Testament—such as “God” and “Lord”—often appear with his name; therefore the placement of “Jesus” on the map above, though definitely most closely associated with the New Testament, is still more closely related to the Old Testament than the Quran because these terms appear more often in the former.

Similarly, it may be surprising to some that “Israel” is mentioned more often in the Quran than the New Testament, and so the Quran and the Old Testament are more textually similar in this respect.

So…Is the Quran really more violent than the Old and New Testaments?

Old Testament is Most Violent

A look into the verbatim text suggests that the content in the Quran is not more violent than its Judeo-Christian counterparts. In fact, of the three texts, the content in the Old Testament appears to be the most violent.

Killing and destruction are referenced slightly more often in the New Testament than in the Quran (2.8% vs. 2.1%), but the Old Testament clearly leads—more than twice that of the Quran—in mentions of destruction and killing (5.3%).

New Testament Highest in ‘Love’, Quran Highest in ‘Mercy’

The concept of ‘Love’ is more often mentioned in the New Testament (3.0%) than either the Old Testament (1.9%) or the Quran (1.26%).

But the concept of ‘Forgiveness/Grace’ actually occurs more often in the Quran (6.3%) than the New Testament (2.9%) or the Old Testament (0.7%). This is partly because references to “Allah” in the Quran are frequently accompanied by “The Merciful.” Some might dismiss this as a tag or title, but we believe it’s meaningful because mercy was chosen above other attributes like “Almighty” that are arguably more closely associated with deities.

Text Analytics Plot 3

‘Belief/ Faith’, ‘Non-Members’ and ‘Enemies’

A key difference emerged immediately among the three texts around the concept of ‘Faith/Belief’.

Here the Quran leads with references to ‘believing’ (7.6%), followed by the New Testament (4.8%) and the Old Testament a distant third (0.2%).

Taken a step further, OdinText uncovered what appears to be a significant difference with regard to the extent to which the texts distinguish between ‘members’ and ‘non-members’.

Both the Old and New Testaments use the term “gentile” to signify those who are not Jewish, but the Quran is somewhat distinct in referencing the concept of the ‘Unbeliever’ (e.g.,“disbelievers,” “disbelieve,” “unbeliever,” “rejectors,” etc.).

And in two instances, the ‘Unbeliever’ is mentioned together with the term “enemy”:

“And when you journey in the earth, there is no blame on you if you shorten the prayer, if you fear that those who disbelieve will give you trouble. Surely the disbelievers are an open enemy to you

 An-Nisa 4:101

“If they overcome you, they will be your enemies, and will stretch forth their hands and their tongues towards you with evil, and they desire that you may disbelieve

Al-Mumtahina 60:2

That said, the concept of “Enemies” actually appears most often in the Old Testament (1.8%).

And while the concept of “Enemies” occurs more often in the Quran than in the New Testament (0.7% vs 0.5%, respectively), there is extremely little difference in how they are discussed (i.e., who and how to deal with them) with one exception: the Quran is slightly more likely than the New Testament to mention “the Devil” or “evil” as being an enemy (.2% vs 0.1%).

Conclusion

While A LOT MORE can be done with text analytics than what we’ve accomplished here, it appears safe to conclude that some commonly-held assumptions about and perceptions of these texts may not necessarily hold true.

Those who have not read or are not fairly familiar with the content of all three texts may be surprised to learn that no, the Quran is not really more violent than its Judeo-Christian counterparts.

Personally, I’ll admit that I was a bit surprised that the concept of ‘Mercy’ was most prevalent in the Quran; I expected that the New Testament would rank highest there, as it did in the concept of ‘Love’.

Overall, the three texts rated similarly in terms of positive and negative sentiment, as well, but from an emotional read, the Quran and the New Testament also appear more similar to one another than either of them is to the significantly “angrier” Old Testament.

Of course, we’ve only scratched the surface here. A deep analysis of unstructured data of this complexity requires contextual knowledge, and, of course, some higher level judgment and interpretation.

That being said, I think this exercise demonstrates how advanced text analytics and data mining technology may be applied to answer questions or make inquiries objectively and consistently outside of the sphere of conventional business intelligence for which our clients rely on OdinText.

I hope you found this project as interesting as I did and I welcome your thoughts.

Yours fondly,

Tom @OdinText

TOM DEC 300X250

 

Text Analytics Tips

Text Analytics Tips, with your Hosts Tom & Gosia: Introductory Post Today, we’re blogging to let you know about a new series of posts starting in January 2016 called ‘Text Analytics Tips’. This will be an ongoing series and our main goal is to help marketers understand text analytics better.

We realize Text Analytics is a subject with incredibly high awareness, yet sadly also a subject with many misconceptions.

The first generation of text analytics vendors over hyped the importance of sentiment as a tool, as well as ‘social media’ as a data source, often preferring to use the even vaguer term ‘Big Data’ (usually just referring to tweets). They offered no evidence of the value of either, and have usually ignored the much richer techniques and sources of data for text analysis. Little to no information or training is offered on how to actually gain useful insights via text analytics.

What are some of the biggest misconceptions in text analytics?

  1. “Text Analytics is Qualitative Research”

FALSE – Text Analytics IS NOT qualitative. Text Analytics = Text Mining = Data Mining = Pattern Recognition = Math/Stats/Quant Research

  1. It’s Automatic (artificial intelligence), you just press a button and look at the report / wordcloud

FALSE – Text Analytics is a powerful technique made possible thanks to tremendous processing power. It can be easy if using the right tool, but just like any other powerful analytical tools, it is limited by the quality of your data and the resourcefulness and skill of the analyst.

  1. Text Analytics is a Luxury (i.e. structured data analysis is of primary importance and unstructured data is an extra)

FALSE – Nothing could be further from the truth. In our experience, usually when there is text data available, it almost always outperforms standard available quant data in terms of explaining and/or predicting the outcome of interest!

There are several other text analytics misconceptions of course and we hope to cover many of them as well.

While various OdinText employees and clients may be posting in the ‘Text Analytics Tips’ series over time, Senior Data Scientist, Gosia, and our Founder, Tom, have volunteered to post on a more regular basis…well, not so much volunteered as drawing the shortest straw (our developers made it clear that “Engineers don’t do blog posts!”).

Kidding aside, we really value education at OdinText, and it is our goal to make sure OdinText users become proficient in text analytics.

Though Text Analytics, and OdinText in particular, are very powerful tools, we will aim to keep these posts light, fun yet interesting and insightful. If you’ve just started using OdinText or are interested in applied text analytics in general, these posts are certainly a good start for you.

During this long running series we’ll be posting tips, interviews, and various fun short analysis. Please come back in January for our first post which will deal with analysis of a very simple unstructured survey question.

Of course, if you’re interested in more info on OdinText, no need to wait, just fill out our short Request Info form.

Happy New Year!

Your friends @OdinText

Text Analytiics Tips T G

[NOTE: Tom is Founder and CEO of OdinText Inc.. A long time champion of text mining, in 2005 he founded Anderson Analytics LLC, the first consumer insights/marketing research consultancy focused on text analytics. He is a frequent speaker and data science guest lecturer at university and research industry events.

Gosia is a Senior Data Scientist at OdinText Inc.. A PhD. with extensive experience in content analytics, especially psychological content analysis (i.e. sentiment analysis and emotion in text), as well as predictive analytics using unstructured data, she is fluent in German, Polish and Spanish.]

 

The Text Analytics Opportunity

The Text Analytics Opportunity Text Analytics remains an opportunity for those wishing to gain an information advantage

[Note: This is an ongoing series of interviews on analytics ahead of the Useful Business Analytics Summit. Feel free to check out the rest of the interviews beginning here]

 

My favorite topic of any analytics conference is of course the mining and analysis of unstructured data. Whether you call it Natural Language Processing (NLP), Text Mining, or the more recently popular Text Analytics chances are you’ve heard of it. Since I started Anderson Analytics ten years ago, as the first consumer insights firm to leverage text analytics, the discipline seems to have gone from an unknown to mainstream ubiquitousness.

That said, because it is such a quickly evolving field many of the main players 10 years ago have faded into relative obscurity. Some have been purchased by other companies, many have simply not been able to keep up with advancements or proven adequate value.

Therefore perhaps I shouldn’t have been surprised that this was the one area where our analytics experts were a bit less sure of themselves, and I received relatively few responses to my questions.

That said, consensus is that there certainly are text analytics software options on the market that do provide strong value. Personally I think the main challenge is that there are still too few analysts with experience in text analytics, and too little time allocated to prove just how amazing unstructured data insights can be!

Q. What is your opinion on the current state of the unstructured/text analytics field?

 

JonathanIsernhagenTravelocity

I am the wrong guy to ask, because I was already blown away six years ago when Attensity was boiling down conversations into subject-verb pairs, and things have only gotten better since then. I think there’s a point in the life of each

 

FaroukFerchichiToyota

At this moment, I believe for the kind of experiments that people have started to leverage it for, it is good enough.

 

SofiaFreyderMasterCard

For me personally its more supplemental data. Structured data is easier to utilize, slice and dice.

Unstructured data might be very useful resource of qualitative data and supplemental to quantitative analysis.

Also there are tools that can create structured analytics from unstructured.

 

DeepakTiwariGoogle

A lot of solutions exist in the market place but it is a complex problem and we have a long ways to go.

 

Q. What if anything in text analytics have you found that really works well? What doesn’t?

 

JonathanIsernhagenTravelocity

I don’t have direct experience with text mining beyond what we’ve done with Attensity.

 

FaroukFerchichiToyota

What works well is the flexibility and ability to change and implement once you have the engine built. What doesn’t work well is the overpriced text analytics tools, which makes many, develop their own and miss the opportunity to focus on analytics instead of transforming the unstructured data.

 

SofiaFreyderMasterCard

Works well: Qualitative data, opinion based data.

Doesn’t : Certain KPIs without benchmark

 

DeepakTiwariGoogle

High level sensitivity analysis and high-level signaling works well. But the solutions are not at a place for granular actionable insights. In other words, use them as an indicator and not as an actionable solutions.

 

Stop by for the next blog post as I ask our experts about tips for selecting a software vendor, how much software should cost. I’ll even be asking how our client side speakers like to be sold to…

 

@TomHCAnderson

@OdinText

 

[Full Disclosure: Tom H. C. Anderson is Managing Partner of Anderson Analytics, developers of patented Next Generation Text Analytics™ software platform OdinText. For more information and to inquire about software licensing visit ODINTEXT INFO REQUEST]

Are All Data Created Equal?

AllDataNotCreatedEqual A tweet, a transaction, an email or a phone call - Do you have a preference?

[Note: This is an ongoing series of interviews on analytics ahead of the Useful Business Analytics Summit. Feel free to check out the rest of the interviews beginning here]

I thought this was an important question, and one I knew the answer to. My thinking, based on experience has been, certainly not, some data is far richer and more important than other data. For instance 1 or 10,000 tweets for that matter are no where near as important as one good data record of an actual customer calling or emailing your customer service center with a specific complaint, praise or suggestion.

That said as I posed the question to our panel of client side analytics experts I began to think maybe the question itself made was a mistake. The all too common mistake of putting the data before the question.

Curious to hear your thoughts. Can we legitimately ask this question about data without first answering the question of what question is to be answered? And if we can, on what side of the spectrum do you fall – all data is created equal OR some data are priceless and others are almost useless?

 

Q. WHAT TYPE OF DATA, IF ANY, DO YOU FIND MORE IMPORTANT? WHICH TYPES LESS IMPORTANT, AND WHY?

ThomasSpeidelSuncor

[Thomas Speidel - Suncor Energy]

 It depends on what we are trying to find out. For mission critical decisions, it's important to have data that was intentionally captured for that or a similar purpose (usually structured).

For exploration or low consequence questions, any data will do so long as we understand the limitations of our findings.

SofiaFreyderMasterCard

[Sofia Freyder – MasterCard]

I think all data is important: structures and unstructured, quantitative and qualitative, on- line and off-line, behavioral or opinion based. Each specific situation will define which data should be considered more accurate and precise.

DeepakTiwariGoogle

[Deepak Tiwari - Google]

It depends. We use all types of data (structured, unstructured) and depending on the problem use them to varying degree.

JonathanIsernhagenTravelocity

[Jonathan Isernhagen – Travelocity]

 I’m a finance guy at heart, and believe in the idea of net present value….the idea that every allocation decision we make can be thought of as a project that should pay out more than the investment. I’m interested in any data which directly inform such “project” decisions…the ROI stuff . I’m less interested in other data. There’s a school of thought that I’d call “Pathism” or “Funnelism” which rejects channel attribution. If you don’t have the marketing budget to justify investing in an algorithmic attribution model, that’s one thing. If you imagine that knowing your fourth-most-popular path to conversion is SEO-to-Direct is better than knowing your individual channel ROIs….I would beg to differ.

FaroukFerchichiToyota

[Farouk Ferchichi - Toyota Financial Services]

I don’t believe there is data that is not important. All data is important given the appropriate context. Internal and external structured data in the form of financials or customers’ data is important to analyze histories and develop models but internal and external unstructured data is equally as important to discover and access new type of information. The question becomes how to access data and what to acquire/store and for that you need a data discovery and acquisition strategy.

AnthonyPalellaAngiesList

[Anthony Palella - Angie's List]

- Importance is determined by the high value questions that need to be answered. When I start working with a business partner, I don't ask about KPI's. I ask, "What are the 10-12 questions you need answers to in order to successfully run your business?". The data needed to answer these questions is "important".

LarryShillerYale

[Larry Shiller - Yale]

This is a "meta" answer... "Type" means a way to slice and dice. If you are slicing data only one way, that way may be a shiny object: Look for other ways (i.e., other dimensions) to slice your data. For example, the most common dimension is time: Look for other dimensions/pivots.

 

Thanks to our speakers at the upcoming Useful Business Analytics Summit for their thoughtful answers to the above question. This Q&A is part of an ongoing series focusing on big data and business analytics in general. Feel free to check out some of our past questions on Big Data, How to Keep Up to Date on Analytics, Top 10 Analytics Tips. Our next post will be on my favorite topic, text analytics!

@TomHCAnderson

@OdinText

 

 

[Full Disclosure: Tom H. C. Anderson is Managing Partner of Anderson Analytics, developers of a patented Next Generation approach to text analytics known as OdinText. For more information and to inquire about software licensing visit OdinText INFO Request.]

 

Forget Big Data, Think Mid Data

Stop Chasing the Big Data; Mid Data makes more sense After attending the American Marketing Association’s first conference on Big Data this week, I’m even more convinced of what I already suspected from speaking to hundreds of Fortune 1000 marketers the last couple of years. Extremely few are working with anything approaching what would be called “Big Data” – And I believe they don’t need to – But many should start thinking about how to work with Mid Data!

BigDataMidDataSmallData

“Big Data”, “Big Data”, “Big Data”. It seems like everyone is talking about it, but I find extremely few researchers are actually doing it. Should they be?

If you’re reading this, chances are that you’re a social scientist or business analyst working in consumer insights or related area. I think it’s high time that we narrowed the definition of ‘Big Data’ a bit and introduced a new more meaningful and realistic term “MID DATA” to describe what is really the beginning of Big Data.

If we introduce this new term, it only makes sense that we refer to everything that isn’t Big or Mid data as Small Data (I hope no one gets offended).

Small Data

I’ve included a chart, and for simplicity will think of size here as number of records, or sample if you prefer.

‘Small Data’ can include anything from one individual interview in qualitative research to several thousand survey responses in longitudinal studies. At this level of size quantitative and qualitative can technically be lumped together as neither currently fit the generally agreed upon (and admittedly loose) definition of what is currently “Big Data”. You see, rather than a specific size, the current definition of Big Data is varies depending on the capabilities of the organization in question. The general rule for what would be considered Big Data would be data which cannot be analyzed by commonly used software tools.

As you can imagine, this definition is an IT/hardware vendor’s dream, as it describes a situation where a firm does not have the resources to analyze (supposedly valuable) data without spending more on infrastructure, usually a lot more.

Mid Data

What then is Mid Data? At the beginning of Big Data, some of the same data sets we might call Small Data can quickly turn into Big Data. For instance, the 30,000-50,000 records from a customer satisfaction survey which can sometimes be analyzed in commonly available analytical software like IBM-SPSS without crashing. However, add text comments to this same data set and performance slows considerably. These same data sets will now often take too long to process or more typically crash.

If these same text comments are also coded as is the case in text mining, the additional variables added to this same dataset may increase significantly in size. This then is currently viewed as Big Data, where more powerful software will be needed. However I believe a more accurate description would be Mid Data, as it is really the beginning of Big Data, and there are many relatively affordable approaches to dealing with this size of data. But more about this in a bit…

Big Data

Now that we’ve taken a chunk out of Big Data and called it Mid Data, let’s redefine Big Data, or at least agree on where Mid Data ends and when ‘Really Big Data’ begins.

To understand the differences between Mid Data and Big Data we need to consider a few dimensions. Gartner analyst Doug Laney famously referred to Big Data as being 3-Dimensional; that is having increasing volume, variety, and velocity (now commonly referred to as the 3V model).

To understand the difference between Mid Data and Big Data though, only two variables need to be considered, namely Cost and Value. Cost (whether in time or dollars) and expected value are of course what make up ROI. This could also be referred to as the practicality of Big Data Analytics.

While we often know that some data is inherently more valuable than other data (100 customer complaints emailed to your office should be more relevant than a 1000 random tweets about your category), one thing is certain. Data that is not analyzed has absolutely no value.

As opposed to Mid Data, to the far right of Big Data or Really Big Data, is really the point beyond which an investment in analysis, due to cost (which includes risk of not finding insights worth more than the dollars invested in the Big Data) does not make sense. Somewhere after Mid Data, big data analytics will be impractical both theoretically, and for your firm in very real economic terms.

Mid Data on the other hand then can be viewed as the Sweet Spot of Big Data analysis. That which may be currently possible, worthwhile and within budget.

So What?

Mid Data is where many of us in market research have a great opportunity. It is where very real and attainable insight gains await.

Really Big Data, on the other hand, may be well past a point of diminishing returns.

On a recent business trip to Germany I had the pleasure of meeting a scientist working on a real Big Data project, the famous Large Hedron Collider project at CERN. Unlike the Large Hadron Collider, consumer goods firms will not fund the software and hardware needed to analyze this level of Big Data. Data magnitudes common at the Collider (output of 150 million sensors delivering data 40 million times per second) are not economically feasible but nor are they needed. In fact, scientists at CERN do not analyze this amount of Big Data. Instead, they filter out 99.999% of collisions focusing on just 100 of the “Collisions of Interest” per second.

The good news for us in business is that if we’re honest, customers really aren’t that difficult to understand. There are now many affordable and excellent Mid Data software available, for both data and text mining, that do not require the exabytes of data or massively parallel software running on thousands of servers. While magazines and conference presenters like to reference Amazon, Google and Facebook, even these somewhat rare examples sound more like IT sales science fiction and do not mention the sampling of data that occurs even at these companies.

As scientists at Cern have already discovered, it’s more important to properly analyze the fraction of the data that is important (“of interest”) than to process all the data.

At this point some of you may be wondering, well if Mid Data is more attractive than Big Data, then isn’t small data even better?

The difference of course is that as data increases in size we can not only be more confident in the results, but we can also find relationships and patterns that would not have surfaced in traditional small data. In marketing research this may mean the difference between discovering a new niche product opportunity or quickly countering a competitor’s move. In Pharma, it may mean discovering a link between a smaller population subgroup and certain high cancer risk, thus saving lives!

Mid Data could benefit from further definition and best practices. Ironically some C-Suite executives are currently asking their IT people to “connect and analyze all our data” (specifically the “varied” data in the 3-D model), and in the process they are attempting to create Really Big (often bigger than necessary) Data sets out of several Mid Data sets. This practice exemplifies the ROI problem I mentioned earlier. Chasing after a Big Data holy grail will not guarantee any significant advantage. Those of us who are skilled in the analysis of Small or Mid Data clearly understand that conducting the same analysis across varied data is typically fruitless.

It makes as much sense to compare apples to cows as accounting data to consumer respondent data. Comparing your customers in Japan to your customers in the US makes no sense for various reasons ranging from cultural differences to differences in very real tactical and operational options.

No, for most of us, Mid Data is where we need to be.

@TomHCAnderson

[Full Disclosure: Tom H. C. Anderson is Managing Partner of Anderson Analytics which develops and sells patent pending data mining and text analytics software platform OdinText]

 

Text Analysis of 2012 Presidential Debates

Obama more certain and positive - Romney more negative and direct Lately there's been a craze in analyzing 140 character Tweets to make all sorts of inferences in regard to everything from brand affinity to political opinion. While I'm generally of the position that the best return on investment of text analytics is on large volumes of comments, I fear we often overlook other interesting data sources in favor of what a small percentage (about 8%) of the population says in tweets or blogs.

When the speakers are the current and possibly next president of the US, looking at what if anything can be gained by leveraging text analytics on even very small data sets start becoming more interesting.

Therefore ahead of the final presidential debate between Obama and Romney we uploaded the last two presidential debates into our text analytics software, OdinText, to see what if anything political pundits and strategists might find useful. OdinText read and coded the debates in well under a minute, and below are some brief top-line findings for those interested.

[Note, typically text analytics should not be used in isolation from human domain expert analysis. However, in the spirit of curiosity, and in hopes of providing a quick and unbiased analysis we're providing these findings informally ahead of tonight's debate.]

The Devil in the Detail

Comments from sources like a debate are heavily influenced by the questions that are asked by the moderator. Therefore, unlike analysis of more free flowing unguided comments by the many, where often the primary benefit of text analytics is to understand what is being discussed by how many, the benefit of analyzing a carefully moderated discussion between just two people is more likely to lie in the detail. Therefore rather than focusing on the typical charts quantifying exactly which issues are discussed which are technically controlled by the moderator the focus of text analytics on these types of smaller data is on the details of exactly how things are said as well as what often isn't said or avoided.

That's not the right answer for America. I'll restore the vitality that gets America working again (Governor Romney Debate #2)

In text analysis of the debates the first findings often reveal frequency differences in specific terms and issues such as the fact that Governor Romney is far more likely than President Obama to mention "America" when speaking (88 vs. 42 times across the two first debates). We make no assumptions in this analysis whether or not this is a strategic consideration during the debates, or is a matter of personal style, and whether or not it has a beneficial impact on the audience.

However, certainly the differences in frequency and repetition of certain terms mentioned by a speaker such as "Millions looking for work" obviously do reflect how important the speaker believes these issues may be. How Obama and Romney refer to the audience, the moderator and to US citizens is easy to quantify and may also play a role in how they are perceived. For instance Romney prefers the term "people" (used 77 times in the second debate vs. Obama's 26 times), whereas Obama prefers the term "folks" (19 times vs. Romney's 2 times). Text Analytics also quickly identified that unlike the case in the first debate, Obama was twice as likely as Romney to mention the moderator "Candy" by name in the second debate.

Certain terms like "companies", "taxes" and "families" were favored more by Obama/avoided by Romney. Conversely, Romney was significantly more likely to mention measuring terms though many were rather indefinite such as "number", "high" and "half" I.e. "...unemployment at chronically high level...", we did however also see an attempt by Romney to reference specific percentages as well. Obviously, text analytics cannot fact check quantitative claims; this is where domain expertise by a human analyst comes into play.

From Specific terms to general Linguistic Differences Taking text analytics a step beyond the specifics to analyze emotion and linguistic measures of speech can also be interesting...

Volume and Complexity (Obama more complex - Romney more verbose)

In both debates, Romney spoke approx. 500 more words than Obama (7% and 6% more words, respectively); this greater talkativeness sometimes reflects a more competitive/aggressive behavior. Obama on the other hand used more sophisticated language than Romney in the first debate (7% more words with 6 or more letters, see chart presenting percentage differences in the use of certain types of language by the two candidates; comparisons were done separately for the first and second debate). However, he reduced the use of such language in his speech during the second debate.

Past, Present and Future Tense (Obama explains past - Romney focuses on future)

Both candidates were equally likely to speak in the present tense. However, in both debates Obama was significantly more likely than Romney to speak in the past tense in both debates (55% and 18% more often, respectively). Romney on the other hand was more likely to speak in the future tense in both debates (60% and 34% more often). This contrast between past versus future orientation in the debates is of course in part explained by their differing status, that is, Obama's prior presidential experience and Romney's aspiration to become elected for this office in the future.

Personal Pronouns (Obama Collectivist - Romney Direct)

Whereas, both candidates expressed equally often an individualistic tone in their speaking (i.e., the frequency of the use of 1st person singular pronouns e.g., I, me, mine), Obama in both debates was more likely to use a collectivist tone (42% and 60% more, respectively). This use of 1st person plural pronouns e.g., we, us, our), often suggests a stronger identification with a group, team, nation. In part this may coincide with Obama's slogan from the first elections ("Yes, we can.), which may reflect collectivist rather than individualist values.

In the second debate, Romney used direct language more often than Obama, by addressing the president and/or the moderator. Romney was 57% more likely than Obama to use 2nd person personal pronouns, e.g., you, your). For instance, in phrases like "Let me give you some advice. Look at your pension. You also have investments in Chinese companies (...)" or "Thank you Kerry for your question." Obama, on the other hand, reduced the use of such language from the first to the second debate (using 38% more direct language in the first debate as compared to the second).

Emotion (Obama more positive - Romney more negative)

The analysis of the emotional content of the debate revealed that candidates' speeches was often emotionally charged but the focus on the positive or negative affect differed among the candidates. Both candidates used positive emotions equally often in the first debate and they used negative emotions equally often in the second debate.

Emotional tone of candidates' speech could have had an important impact on their perception by the audience. Especially, heavier use of negative affect by Romney in the first debate could have made the voters pay more attention to him and possibly offer more support.

In the first debate, Romney used significantly more negative emotions in his speech (54% more often than Obama) and in particular he expressed more words pertaining to sadness (169% more often than Obama). Conversely, in the second debate, Obama's speech was significantly more likely to contain positive emotions than Romney's (12% more).

Complexity (Obama Ideas - Romney Details)

In both debates, Obama used cognitive language more often than Romney (10% and 13% more in the first and second debates, respectively). Cognitive language contains references to knowledge, awareness, thinking, etc. Obama was also more likely to use language pertaining to causation (75% and 30% more often in the first and second debates) and in the second debate he was also 47% more likely than Romney to express certainty in his speech. The latter may also be partly reflective of a more confident tone of Obama during the second debate in which his performance has been deemed better than the first.

In this same debate, Romney was 47% more likely than Obama to make references to insight and sources of knowledge. Related to this, in both debates Romney speech indicated a greater insistence on numbers/quantitative data and details (75% and 65% more often).

General Issues Focus (Obama Society & Family - Romney Healthcare & Jobs)

Even though the topics discussed during the debates were prompted and moderated, some patterns of heavier focus on certain issues by the two candidates emerged. Romney made significantly more references to health issues than Obama did during the first debate (43% more). In the first debate, Romney was also more likely to mention occupational issues (26% more often) as well as achievement (36%). Obama, on the other hand, referred to social relationships and family significantly more often than Romney in both debates (social relationships - 9% and 6% more often; family - 104% and 138% more often). Both candidates referred to financial issues equally often in both debates, though this area was mentioned less often during the second debate.

Linguistic Summary (Key Differences by Speaker in Debates)

As mentioned earlier, whether specific use of language by the two candidates was intentional or not, whether it was part of the candidate's tactic, or a mere reflection of the character and demographic background is unclear without deeper analysis by a domain expert. Nevertheless, some of the above linguistic differences may certainly have contributed to a candidate winning over more audience support in one or both of the debates. The diagram above presents in a visual form which parts of speech differed significantly between the two candidates. Those marked in bold highlight speech categories that were used by a candidate significantly more often during only one of the debates, hinting at debate-specific language style. For instance, unique during the first debate was Obama's use of sophisticated language, where Romney relied more on negative emotions, sadness and focused more on health, occupational, and achievement issues. These speech categories were not used significantly more often by either candidate in the second debate. In the latter debate, Obama relied more on the use of positive emotions and certainty in his language, whereas Romney used more direct language and references to insight.

Conclusion (Negative VS Positive Emotion and Certainty Related to Specific Issues)

Debates are certainly a unique type of unstructured data. The debate follows a predetermined outline, is moderated, and we can assume both participants have invested time anticipating and practicing responses which their team believes will have maximum possible effect for their side. To what extent the types of speech used was intentional or simply related to these different questions and political position of the candidates is hard to say without further research and analysis.

However, if I were on either candidate's political team, I think even this rather quick text analysis would be useful. As the general consensus is that Romney performed better in the first debate and Obama in the second one, a strategic recommendation may be for Romney to counter Obama's sophistication on certain issues with negativity and focus on areas where Obama seems to want to focus less on such as health care and Jobs. Conversely, I might counsel Obama to counter Romney's negative emotion with even greater positive emotion when possible, and continue/encourage Romney to go into more detail and counter these with the certainty present in his speech from debate #2.

Further analysis would be needed to better understand exactly what impact the various speech patterns had in the debate. That said, it seems some tactics known to be successful in social and business situations have been used during the debates. For instance, Obama by using more 1st person plural pronouns (e.g., we, our, us) may be identifying better with the entire nation and thus may have created a feeling of unity, shared goals and beliefs with the public.

This simple tactic has been used by managers and orators for a long time. Sometimes the use of more individualistic language may lead to too much separation and loss of potential support. However, we also need to acknowledge that different strategies are successful for candidates at different stages. For instance, negative emotions are likely resulting from Romney's critique of the current state of affairs and Obama's actions. Negative emotion here and in moderation may well be an appropriate choice of language for someone aspiring to change things.

Conversely, Obama responding and reflecting on his past 4 years in office using more positive affect is an obvious way of presenting his experience and work as a president in the better light.

A very exciting line of further research could explore candidate's facial expressions during the debate. They may match onto findings from the text analysis (e.g., amount of positive versus negative emotions) but may also reveal interesting discrepancies and tendencies of the candidates. It would be an interesting analysis because body language can be as an important a source of information as spoken language and it can be very a powerful tool in winning over support. This new avenue of research could be very helpful in understanding which candidate received more support and whether it was only influenced by political attitudes, language, or body language of candidates or a combination of the three.

Ideally further analysis combining text analytics with other data from people meters, facial expressions, or other biometric measures could help answer some of these questions more definitively and provide insight into exactly how powerful language choice and style can be.

@TomHCAnderson @OdinText

PS. Special thanks to my colleague Dr. Gosia Skorek for indulging my idea and helping me run these data so quickly on a Saturday! ;)

[NOTE: There are several ways to text analyze this type of data. The power of text analytics depends on the volume and quality of data available, domain expertise, time invested and creativity of the analyst, as well as other methodological considerations on how the data can be processed using the software. Anderson Analytics - OdinText is not a political consultancy, and our focus is generally on much larger volumes of comments within the consumer insights and CRM domain. Those interested in more detail regarding the analysis may contact us at odintext.com]

Text Analytics World 2012

The Future of Text Analytics I look forward to participating in the closing panel on the future of text analytics at Text Analytics World in Boston tomorrow afternoon (agenda here).

This year Text Analytics world dove tailed with Predictive Analytics World event so should be a nice quantitative group.

Please come up and say hello if you'll be there. Look forward to catching up with other Next Gen Market Researchers working with unstructured data!

@TomHCAnderson @OdinText