The Movie “Arrival,” Text Analytics and Machine Translation When I speak with prospective OdinText users who’ve been exposed to other text analytics software providers, I find they tend to mention and ask about things like POS tagging, taxonomies, ontologies, etc.
These terms come from linguistics, the discipline upon which many of the text analytics software platforms in the market today are predicated.
But you may be surprised to learn that as a basis for text analytics, linguistics is shockingly inefficient compared to approaches that rely on mathematics/statistics.
One of the most popular movies in theaters right now, “Arrival,” inadvertently makes this case rather well.
Understanding Alien Languages is Easy (Provided You’re Not a Linguist)
“Arrival” begins with a flock of spaceships touching down in locations around the world. Linguistics professor Louise Banks (Amy Adams) is then recruited to lead an elite team of experts in a race against time to find a way to communicate with the extraterrestrial visitors and avert a global war.
The film proceeds to build a lot of drama around a pretty minor problem of language analysis and translation—conveniently consuming several months during which the plot can thicken—when, in fact, the task of understanding an alien language like in the movie would be quite EASY.
I daresay in all modesty that I could have done this in a fraction of the time with OdinText and with a much smaller team than Adams’ character had!
It Only Takes a Few Words
In her first conversation with the aliens, Louise introduces herself by writing the word “human” on a little whiteboard she carries, to which the aliens respond by introducing themselves in their language.
After this initial exchange, in the real world, only a few more words would be necessary to start creating and applying a code book (a taxonomy or ontology in linguistics speak), which would allow one to quickly translate anything else said and to then communicate via a small, imperfect but highly effective vocabulary.
For example, a little later in the movie, one of the aliens tells Louise that another alien who is missing from their meeting that day is “in the death process,” which, of course, means the other alien is absent because he is dying.
Everyone in the audience gets what the alien means by “in the death process.” Indeed, communicating successfully with a small, imperfect vocabulary like this is far more efficient and reliable than one might assume. My two-year-old son and I are quite good at communicating in these sorts of two- or three-word phrases. And no parts of speech tagging are necessary (nor would they be very helpful here).
I’ll come back to this idea of small, imperfect but surprisingly efficient vocabularies in a bit. But first, let’s consider a related but more challenging matter: breaking code.
How the Allies Used Text Analytics to Break the German Code
Compared to translating an alien language, it would be only slightly more difficult—though honestly not that much more difficult—to crack the Nazi Enigma code that helped the Allies win WWII today using OdinText.
Why more difficult? Because unlike the aliens in “Arrival,” who actually want the humans to learn their language in order to communicate, the Nazis wanted their encrypted language to stay indecipherable.
In the 2014 movie “The Imitation Game,” Benedict Cumberbatch stars as Alan Turing, the genius British mathematician, logician, cryptologist and computer scientist who led the effort to crack the German code.
In contrast to “Arrival,” the drama in “The Imitation Game” centers on Turing’s determination to build a decryption machine, instead of attempting to decode Enigma by hand like every other scientist assigned to the task.
When his boss refuses to fund his machine’s construction, Turing writes to Churchill, who arranges the funding and names him team leader. Turing subsequently fires the key linguists from the project and the linguistic approach to this text analysis (i.e., code breaking) is chucked in favor of computational mathematics.
Turing’s machine is, of course, critical to the solution (though the technology is simple by today’s standards), but the real breakthrough happens when the scientists realize that the machine can be sped up by recognizing routinely used phrases like “Heil Hitler” (again providing a basic code frame or taxonomy).
The Turing Test: Did You Know You Were Talking to a Computer?
In computer engineering classes on artificial intelligence there is an oft-mentioned thought experiment called “The Chinese Room,” which is used to think about the differences between human and computer cognition. It’s often referenced when discussing the Turing Test, which assesses computer intelligence based on whether a human being can distinguish between a computer and a human being’s replies to the same questions.
Going back now to my earlier point about a small taxonomy being sufficient for communication, and keeping in mind that today’s far more powerful computers running Google Translate or OdinText can process unstructured text data in any language order of magnitudes faster than any human or Turing’s machine, I think The Chinese Room analogy is not just an interesting AI thought experiment, but a good way to explain why translating the alien language in “Arrival” should have been so much easier than the film made it out to be.
The Chinese Room
Imagine for a moment a room with no windows, only a door with a small mail slot.
In the room, we find an average English speaker recruited randomly off the street, someone without any advanced education or background in foreign languages or linguistics.
This person has been paid to spend the day in this room and given a code book for a “squiggly language” he/she has been tasked with translating. In the story, it’s typically Chinese, but it could be any foreign language with which the person is totally unfamiliar. Let’s assume Chinese to stay close to the original story.
After giving him/her this code book—basically an English-to-Chinese/Chinese-to-English dictionary—we tell this person that on occasion we may pass them a note written in Chinese and that they will need to use the code book to figure out what the message means in English. Likewise, if they need anything—water, food, bathroom break, etc.—they will need to pass the request in a note written in Chinese back through the mail slot to us.
Note that this person has ABSOLUTELY NO TRAINING in the syntax or grammar of Chinese. His/her notes may be rudimentary, but certainly they will still be understood.
What’s more, if a native Chinese speaker walked by and observed the notes coming out, they would probably assume that there was a Chinese speaker in the room.
Now, instead of a code book, suppose the person in the room was using a computer program like Google Translate or OdinText, which can instantaneously translate or otherwise process any number of words coming out of the room, making it even more likely that the Chinese-speaking passerby assumes the person in the room speaks Chinese.
Think about this the next time you’re wondering whether data translated by machine—which is so much faster and cheaper than human translation—is sufficient for text analytics purposes (i.e. understanding what hundreds or hundreds of thousands of humans are saying in some foreign language).
My strong belief is yes, definitely. Whether I’m looking at Swedish or Chinese, I’m always rather impressed by how on point today’s computer translation is, and how irrelevant any nuance is, especially at the aggregate level, which is usually where we need to be.
You don’t need a team of NASA scientists, nor a month to do it. You can have it ready by morning! The technology is already here!
To learn more about how OdinText can help you learn what really matters to your customers and predict real behavior here on Earth, please contact us or request a FREE demo using your own data here!
[Key Terms: AI, Artificial Intelligence, Machine Translation, Text Analytics, Linguistics, Computational Linguistics, Taxonomies, Ontologies, Natural Language Processing, NLP]
Tom H. C. Anderson OdinText Inc. www.odintext.com
OdinText is a patented SaaS (software-as-a-service) platform for advanced analytics. Fortune 500 companies such as Disney and Shell Oil use OdinText to mine insights from complex, unstructured text data. The technology is available through the venture-backed Stamford, CT firm of the same name founded by CEO Tom H. C. Anderson, a recognized authority and pioneer in the field of text analytics with more than two decades of experience in market research. Anderson is the recipient of numerous awards for innovation from industry associations such as ESOMAR, CASRO, the ARF and the American Marketing Association. He tweets under the handle @tomhcanderson.