“Fine. You can stay out all night with your friends if you want”. If it’s your significant other speaking, you may want to tread lightly. Language isn’t easy. Scratch that – communication isn’t easy. People are complex, using multiple intricate cues to convey meaning. Words alone that seem ‘safe’ may not convey the true emotion(s) behind them.
While most of us ‘native english speakers’ are able to use our senses (and common sense) to navigate complex communication patterns, these same patterns can be very challenging for non-native speakers; and I include compute platforms in that category. This problem domain is fascinating to all of us at Tetra Insights, as we believe Natural Language Processing (NLP) along with the strategic application of Machine Learning (ML) can go a long way in helping us (and ultimately our customers) extract more qualitative data from their data.
As a specialty at the intersection of linguistics, computer science and artificial intelligence, NLP is focused on helping computers process and analyze natural language data. NLP covers the gamut; from (relatively) simple word, keyword or phrase enumeration, to sentiment analysis (inferring the emotive tone of a block of text) or other complex, compute intensive analytics. For these more complex NLP operations, platforms such as AWS, Google and IBM Watson leverage the power of massive compute resources and volumenus data to sharpen the heuristic intelligence (ML) behind NLP.
But – there are limits to what current ML algorithms can do. Consider my leading, somewhat passive / aggressive statement. For any computer (including human, organic ones), determining the correct (or ‘more’ correct) meaning requires:
- As much context as possible (the more, the better).
- An understanding of the language structure being used (how is the speaker using language).
Machines and humans both require context and language understanding. If you join a conversation among friends midway, it’s easier to misunderstand what’s being said (you might not have all the context). If you walk into a conversation with people you don’t know, it’s doubly hard (with no knowledge of how they use language).
These challenges are real for any NLP service performing complex analysis (sentiment, speaker id, etc.), and to be effective – there’s no substitute for text, text and more text. From a larger block of text (more of a conversation), context and language structures can be built to improve comprehension accuracy.
Given the ease in which we as humans can misunderstand, it can seem miraculous when a computer ‘gets it right’. But they can, and with constantly improving algorithms fed by large and dynamic learning sets – a computer’s ability to comprehend intended meaning is improving. So much so, that the emotive insights available through NLP services can now help our customers better comprehend what customers are really saying.
But first, we must ensure we have quality textual data as input. Accurate transcription of audio and video (A/V) recordings – essential for meaningful analysis, can present technical and tactical challenges. Technically, a qualitative data tool such as Tetra Insights must be format agnostic (consuming most, if not all popular A/V formats), while efficiently managing the storage, rendering and manipulation of these assets. Tactically, we must do it with fidelity and speed.
In this context, fidelity means that we’re transcribing A/V content as accurately as possible. As NLP is applied to a transcription of extracted audio, a low accuracy transcription can, and likely will, impact the quality (and usefulness) of inferences derived from NLP processing.
Fortunately, Automatic Speech Recognition (ASR) services are improving in their ability to transcribe audio with greater fidelity. To measure fidelity, the key metric being used is the Word Error Rate (WER) – which measures the fidelity of automated transcriptions by adding up substitutions (words replaced), inserts (extra words inserted), and deletions (words omitted).
WER is useful as a baseline measure of transcription accuracy, but doesn’t tell the whole story. The duration, number of speakers (context), and how language is used within the source audio (vocabulary) – can have a significant impact on an ASR services WER score. Indeed, if you’re comparing available ASR services, you’ll quickly notice that each vendor’s comparative performance data is ‘tuned’ to highlight the strengths and mask the weaknesses of their solution (vocabulary used, text size, etc.).
We could dive much deeper into available ASR services (strengths and weaknesses), but we’ll stop and underscore these takeaways:
- Using NLP and ML to extract valuable emotive inference from A/V data will always be dependent on high fidelity transcriptions.
- High fidelity transcriptions will always be dependent on the quality (and quantity) of the source audio.
- Understanding how language is used (vocabulary) can help drive down a transcription’s WER score.
While we launched with AWS Transcribe as our transcription service, we’ve recently added support for Rev.ai (which touts better accuracy), and have done it in such a way that adding additional transcription services should be relatively quick. Ultimately, we like the idea of allowing our customers to choose which transcription service they would like to use, allowing them to choose the very best for their target audience.
The first phase of Tetra Insights has been focused on ensuring that our product has a solid foundation, with the ability to efficiently manage A/V data, serving up high fidelity transcriptions. Now we’re ready for our next phase; leveraging NLP and ML to increase the effectiveness and value of our analytics through automated emotive inferences. Stay tuned – it’s going to be exciting!