Consumer-centric technology has made dramatic leaps forward in recent years. Amongst many others, voice recognition has gained tremendous ground and has reached a level of sophistication and usability which makes it a soon-to-be ubiquitous presence in our daily interactions with devices and services, thus creating a whole new range of options for Pay-TV/OTT/VOD operators to allow their customers to discover and enjoy content easier than ever before.
Voice recognition or interacting through speech means that natural language can be used to verbalize requests, acknowledge/check that they have been fulfilled and provide feedback.
Using natural language makes interaction with devices simpler and more familiar.
These developments have prompted 3SS to develop a prototype system which takes advantage of this technology but is also inherently well positioned as part of our service portfolio for media companies. The result is CLBR, pronounced ‘CALIBER’.
What is CLBR?
CLBR stands for “Cognitive Lean Back Recommendation” system. The CLBR system “recommends” video content following a verbal request.
The request is analyzed by the system on two levels:
- Speech is transformed into text, with certain keywords considered as recognized instruction for executing search queries.
- Speech is analyzed to determine the user’s current mood, based on tone of the voice, sentence structure and phrasing.
Imagine the following scenario:
After a hard day, a user arrives home, tired and a bit downbeat. S/He leans back on their favorite chair and says: “Show me some movies with Robert De Niro” with a tone suggestive of his or her state of mind.
Although this is an oversimplified scenario, CLBR will act upon the uttered request by transforming the speech into text and using some of the words as parameters for executing a search query. Obviously, recommendations will be returned to the user but without accounting for the mood the user is in. Mood analysis provides information about this and therefore recommendations will have an added level of relevance and insight which will meet the needs of the user in that particular context and at that moment in time.
So the Robert De Niro film suggested would more likely be uplifting The King of Comedy rather than The Deer Hunter.
CLBR System Architecture and Components
Image: High-level diagram of CLBR’s system architecture.
CLBR employs an array of components, some of which are custom developed and others commercially available.
They are as follows:
Amazon Echo is a smart speaker created by Amazon. It consists of a cylindrical speaker equipped with an array of microphones for acquiring and executing voice commands. The device connects to Amazon’s intelligent personal assistant service named Alexa thus giving Echo a large number of functionalities such as voice interaction, music playback, making to-do lists, setting alarms, streaming podcasts, playing audiobooks, and providing weather, traffic and other real-time information. It can also control several smart devices using itself as a home automation hub.
In the context of CLBR, Amazon Echo is the device with which the user interacts with via voice commands. Once Amazon Echo recognizes the uttered request it passes it on to Amazon Alexa as we will see later in this paper.
3SS Voice Enabled Web Client
It is not essential for CLBR to use Amazon Echo to communicate with Alexa. If Amazon Echo is unavailable, the user can communicate with Alexa using a custom 3SS-developed component: the 3SS Voice Enabled Web Client acts similarly to Amazon Echo by recording a voice command in a short audio file and passing it on to Amazon Alexa.
Additional to its ability to record, 3SS Voice Enabled Web Client functions as the single source of representation for the recommended content provided after the execution of the search query. This means that regardless of whether Amazon Echo or 3SS Voice Enabled Web Client is used to capture voice co mmands, 3SS Voice Enabled Web Client is where the search query results will be made available to the user.
Amazon Alexa Voice Service is an intelligent personal assistant created by Amazon that powers Amazon Echo. Amazon Alexa can accept the speech recording sent by 3SS Voice Enabled Web Client and transform it into text. Together with the Amazon Custom Skill Service, which will be described below, Alexa is the component which receives the uttered request or vocal command, transforms it and identifies the relevant keywords or slots to be used as parameters for aassembling the information/data/content which when aggregated will comprise the response to the search query.
The Mediation Layer is a custom developed component that consists of two sub-components:
- Amazon Custom Skill Service
- Recommendation Service
Mediation Layer: Amazon Custom Skill Service
The Amazon Custom Skill Service is a custom built sub-component which enables Amazon Alexa to recognize, interpret and execute the voice command.
Mediation Layer: Recommendation Service
The Recommendation Service is also a custom built sub-component that compiles the recommendation content to be sent to the Voice Controlled Web Client upon request.
Microsoft Cognitive Services
Microsoft Cognitive Services are a set of APIs, SDKs and services available to developers to make their applications more intelligent, engaging and discoverable. Microsoft Cognitive Services expands on Microsoft’s evolving portfolio of machine learning APIs and enables developers to easily add intelligent features – such as emotion and video detection; facial, speech and vision recognition; and speech and language understanding.
IBM Watson: Tone Analyzer
The IBM Watson™ Tone Analyzer service uses linguistic analysis to detect emotional, social, and language tones in written text.
Together with Microsoft Cognitive Services, IBM Watson: Tone Analyzer is used by CLBR to determine the user’s mood from the text into which the voice command has been transformed. Both receive requests from the Mediation Layer to analyze the text and they send back an analysis which is used for collating the recommendations.
3rd party data sources e.g. IMDB, YouTube, ThemovieDB
Any resource that stores video content and exposes it via an API can be considered as a 3rd party data source for CLBR. In the current version of CLBR, the three tools mentioned above are supported.
Step-by-Step working scenario:
- The user says: “Alexa, ask Caliber to show me movies starring Robert De Niro.”
- If using 3SS Voice Enabled Web Client (rather than using Amazon Echo) the user’s voice is recorded into a short audio file.
- The audio file is sent to Amazon Alexa and transformed from speech to text.
- The meaning of the text is:
- “Alexa, ask Caliber” invokes our Amazon Custom Skill
- “show me movies starring Robert De Niro.” Is the actual intent
- Robert De Niro is a keyword slot which is used for querying the 3rd party data sources.
The audio file is transformed by Alexa into text and the text is then sent for interpretation to Microsoft Cognitive Services and IBM Watson: Tone Analyzer for mood determination. A statistic is produced which determines mood from written text.
- All the data is amassed into the Mediation Layer and a complex list of parameters is assembled so that queries to the 3rd party data sources will yield accurate results.
- The Web Client receives the results and displays them to the user.
The CLBR system, and complementary elements from Amazon, IBM and others, provide just a glimpse into what will be possible with voice recognition and mood interpretation. CLBR is constructed in a modular and extendable way which can accommodate a wide range of scenarios and integrations with other systems. As such, CLBR is an important and emerging addition to the portfolio of leading edge products and services available to 3SS customers.
Author: Radu Curteanu, www.3ss.tv