Hands-Free Chatbot and Voice Transcription Using Amazon Sumerian

By Rani Lian | Posted November 21, 2019


In this article you will learn how to use Amazon Transcribe, Amazon Lex, and Amazon Polly to build a hands-free, voice-based chatbot experience in Amazon Sumerian. We use the recently released streaming transcription feature of Amazon Transcribe to stream the user’s voice through a WebSocket connection and perform real-time voice transcription to display what the user spoke.

The approach described in this article doesn’t require physical input, like pressing or holding a key to record your voice. Automatic Speech Recognition (ASR) allows us to keep the user engaged and speeds up the interaction when it comes to text-based input, especially on platforms where keyboard access is slow, difficult to use, or distracts you from the experience.

Additionally, we introduce Wake Word support so that certain transcribed sentences are forwarded to Amazon Lex.

Check out a demo of the scene you will be building (there is no audio in this video).

You’ll learn about:

Prerequisites

What Is Included In The Assets You Downloaded?

We provide an AWS CloudFormation template that will spin up the required resources to get you up and running quickly. The AWS CloudFormation template will do the following:

  1. Create an Amazon Cognito identity pool with IAM (Identity and Access Management) policies for Amazon Lex, Amazon Polly, and Amazon Transcribe streaming transcription.
  2. Output the Cognito identity pool ID.

We also provide the Sumerian Scene bundle that will include all of the resources required to run the Sumerian scene. Though you can download the scripts separately, the Sumerian scene bundle includes:

  1. The following scripts:
    • Script_TranscribeStreaming_GLOBAL.js: This script must be included either as a script dependency or as embedded code (by default). This code is a high-level API that takes care of the interaction with Amazon Transcribe streaming transcription. This is inspired by the Amazon Transcribe WebSocket Sample. The code was generated using Browserify with the following source.
    • Script_TranscribeStreaming.js: This script interacts with the above high-level API to manage the lifecycle of the WebSocket connection to Amazon Transcribe, and asks the user for audio permissions. It also allows the user to configure the sample rate of the audio, specify the language code to be transcribed, and configure the buffer size before transmitting it to Amazon Transcribe right from the Script component.
    • Script_TranscribeStreamingUI.js: This script manages updating the HTML UI and the audio occilloscope/waveform visualizer inspired by Mozilla voice-change-o-matic. It also allows the user to specify if they want to use Amazon Polly to speak the responses of Amazon Lex, use Wake Word support, and configure the Wake Word regular expression (regex).
  2. Visual state machine behavior that listens to an event and then forwards transcribed input to Amazon Lex.
  3. HTML UI.

How Was It Built?

In this section we will walk through how we used each service in the architecture diagram below.

High-Level Architecture

Architecture

Amazon Cognito: Authentication and Configuration

When you first start the scene, Sumerian automatically authenticates with Amazon Cognito, provided that you set a Cognito identity pool ID. This is a requirement to run the provided sample properly. To learn more about setting up the Cognito identity pool ID in Amazon Sumerian, see Configuring AWS Credentials for Your Amazon Sumerian Scene.

Additionally, Sumerian automatically initializes the AWS SDK in the scene with the AWS Region of the Cognito identity pool. For example, if you create a Cognito identity pool in the US West (Oregon) us-west-2 AWS Region, the AWS SDK’s Region defaults to us-west-2. This is important to know because this sample uses the default AWS SDK configuration (including AWS Region and AWS credentials) in the next section. To learn more about how to change the AWS SDK default configuration, see Configuring the SDK for JavaScript. If you do modify the AWS Region or use Amazon Sumerian or Amazon Cognito in a Region other than us-west-2, make sure that all of the AWS services used (Amazon Transcribe, Amazon Polly, and Amazon Lex) are available in that Region. See AWS Service Endpoints to learn more.

Because the sample relies on communicating with other AWS services, the Cognito identity pool ID that is being set up by AWS CloudFormation has an unauthenticated role with permissions to communicate with Amazon Transcribe, Amazon Lex, and Amazon Polly.

Amazon Transcribe: Streaming and Transcribing a User’s Voice

When you click (on a PC) or tap (on a mobile device) on the Start Transcribing button, we will initiate a WebSocket connection to Amazon Transcribe streaming transcription using the StartStreamTranscription API call. To initiate the WebSocket connection we need to presign the connection’s URL, based on the Cognito user’s AWS credentials and the AWS Region that we want to connect to. Luckily, those properties are easy to get and the provided sample demonstrates how you can use the prepopulated properties of the AWS SDK to retrieve the AWS Region and the AWS credentials. To learn more about how Sumerian automatically configures the AWS SDK, see the previous section: Amazon Cognito: Authentication and Configuration.

Note: At the time of this writing, the lifetime of the Amazon Transcribe streaming transcription WebSocket is limited to a maximum of four hours and allows up to one StartStreamTranscription operation per second. For more information, see Amazon Transcribe FAQ and Amazon Transcribe Limits.

When the WebSocket opens, we start recording the user’s microphone in pulse code modulation (PCM). We encode with a default sample rate of 48000Hz, a default buffer size of 4096 bytes, and default language code of en-US. All of those properties are configurable in the Script component.

Script Component

Note: At the time of this writing, Amazon Transcribe streaming transcription supports six languages with a varying maximum sample rate for each language: en-US (48000Hz), es-US (48000Hz), en-AU (8000Hz), fr-FR (8000Hz), fr-CA (8000Hz), and en-GB (8000Hz). The higher the sample rate, the higher the quality of the audio transcription. If you do modify the language or the sample rate in the Script component, and transcription isn’t working for you, check your browser’s developer console to learn more about the error. Visit Amazon Transcribe Developer Guide to learn more about streaming transcription.

Once we have generated a chunk of PCM-encoded audio events with the supported sample rate, we forward them to Amazon Transcribe directly through the WebSocket encoded in binary, and receive real-time transcription events also encoded in binary. To learn more about the request and response format, see Using WebSocket Streaming. To simplify the effort required to encode the request, decode the response, and manage the lifecycle of the WebSocket and more, we provide a high-level API that encapsulates the complexity based on the Amazon Transcribe WebSocket Static Sample alongside using Browserify to make it browser compatible.

The high-level API provides three main functions. These can be found in the sample under a script called Script_TranscribeStreaming_GLOBAL. Add that script to a Script component and it will automatically expose the TranscribeStreaming constructor to the window property, allowing you to use it as you want.

// The following code was inspired by the sample provided here:
// https://github.com/aws-samples/amazon-transcribe-websocket-static
// Browserify was used to make it work in the browser.
// By executing the following code, it will automatically set a global
// window.TranscribeStreaming constructor available for you to use.

// ******
// Usage:
// ******

// region is a string representing the AWS Region you want to use.
// credentials is an object representing the AWS credentials you want to use to presign the WebSocket request.
// If using in Sumerian, make sure to set the Cognito identity pool ID!
const credentials = window.AWS.config.credentials;
const region = window.AWS.config.region;
const transcribeStreamingClient = new window.TranscribeStreaming(region, credentials, languageCode = 'en-US', inputSampleRate = 48000, outputSampleRate = 48000)

// To open the WebSocket:
// onSocketOpen() will get fired when the WebSocket opened.
// onTranscriptionResponse(isPartial, transcript) will get fired every time an audio event has been transcribed.
// isPartial is true when the audio is still being transcribed (user is still speaking), otherwise, false.
// transcript is the transcibed text.
// onTranscriptionError(err) will fire when something goes wrong with the WebSocket or Amazon Transcribe.
transcribeStreamingClient.openSocket(onSocketOpen, onTranscriptionResponse, onTranscriptionError)

// To close the WebSocket:
// Note the socket will automatically be closed by Amazon Transcribe once you reach the limit of four hours.
// https://docs.aws.amazon.com/transcribe/latest/dg/limits-guidelines.html for more information.
transcribeStreamingClient.closeSocket();

// To send audio event data:
// audioData is a Float32Array of audio bytes.
transcribeStreamingClient.sendData(audioData);

We use the above high-level API to simplify the work we have to do to transcribe the user’s voice. As a result, when we receive a response from the WebSocket we inspect the properties called isPartial and transcript to perform some business logic.

Intro Picture

Question: How do we know what the user has said?

onTranscriptionResponse will fire with two properties: isPartial and transcript. The transcript property is a string that will contain the transcribed text of the user’s voice.

Question: How do we know when someone has finished speaking?

isPartial is a Boolean property that gets set to true when Amazon Transcribe streaming transcription is not finished with the transcription of the user’s voice. This property is quite useful because it helps us determine when a user has finished saying a sentence when it’s set to false.

Question: How do we keep track of the entire transcription during the lifetime of the WebSocket?

By using a combination of the two previous properties you can get creative with how to keep track of the entire transcription. In the sample, we create another property called fullTranscript that concatenates itself with the current transcript with a line ending if and only if isPartial is false. This means the user has finished speaking. As a result, this gives us the ability to differentiate between a sentence the user has said and the entire transcript during the lifetime of the WebSocket.

Amazon Lex: ChatBot Integration and Wake Word Support

In addition to transcribing the user’s voice, you might want to integrate with a chatbot service like Amazon Lex to perform some action based on what the user is saying - but perhaps not all the time. With Amazon Alexa there is a concept called a Wake Word where you can select a word that the device would listen for to activate itself (for example, “Alexa” or “Amazon”, or perhaps “computer”). This concept is useful to include in voice-based conversational experiences as it allows the user to control when they need to interact with a chatbot. In this sample, we build an optional Wake Word capability that when detected will forward the user’s input to Amazon Lex. It works as follows.

In the Script component, there is a property called Wake Word Regex under the Script_TranscribeStreamingUI script. By default, it has the following value: ^(Lex|Amazon|Sumerian). This is a regex string that translates to: given a string, does that string start with Lex, Amazon, or Sumerian? If the answer to that question is true, it takes the transcribed input, strips out the regex string, and forwards it to Amazon Lex as text input using the PostText API. Feel free to change the Wake Word or disable it.

When we receive a response from the PostText API, we display it to the user so that they can continue interacting with the chatbot.

Question: What if the chatbot requires additional information from the user to fulfill the intent? How can we continue to allow the user to provide more information without having to say the Wake Word again?

The answer to that question lies in the response of the PostText API call for Amazon Lex. Specifically, the dialogState property will tell us when the chatbot requires additional information and when the intent was fulfilled. We use that property to disable or enable the Wake Word requirement based on the value of it if Wake Word support is enabled. As a result, this lets us speak naturally to the chatbot without having to repeat the Wake Word every time it requires additional information from the user.

In the provided sample, we integrate with the BookTrip sample provided by Amazon Lex to demonstrate how easy it is to integrate. If you are curious to see how it works, follow the steps below to build the sample out and say “Lex make a reservation”. Notice that you can continue providing additional information to the chatbot without having to say the Wake Word.

Lex Integration

Amazon Polly: Text to Speech

In the sample, we integrate with Amazon Polly to demonstrate how you can have it speak the output of the Amazon Lex chatbot, bringing the whole project together. This would be useful if you have a Sumerian Host in your scene, however, it’s disabled by default because if you are testing it out on a speaker system, the microphone might pick up the audio output as audio input and you might get conflicting results along with your voice. Use the Script component to enable that functionality, but make sure to use a headset or place the microphone far away from the speaker system so that what you hear isn’t repeated to the microphone.

How to Create Your Own Sample

The remainder of this article focuses on creating this Sumerian scene in your own account.

Step 1: Create an Amazon Cognito Identity Pool

Before we get to editing a scene in Sumerian, we need to set up a Cognito user pool with the proper permissions.

  1. Sign in to the AWS Console Be sure to use an AWS Region that supports Amazon Polly, Amazon Lex, and Amazon Transcribe streaming. For more information about availability, see the AWS Region Table.

  2. If you have not done so already, launch the Hands Free Voice Transcription AWS CloudFormation template.

  3. When you have the AWS CloudFormation creation page open, update the Stack name to a name you provide, and select the acknowledgement box at the bottom of the page. Then choose Create.

  4. Once the stack loads and reads CREATE_COMPLETE, expand the Output section and make a note of the CognitoIdentityPoolID value. You’ll enter this value in Sumerian in the next step.

Step 2: Create the Sumerian Scene

In this step, we create a scene and use the provided scene bundle to import any necessary assets. Let’s get started on creating the scene.

  1. If you have not done so already, download the provided scene bundle.

  2. Navigate to the Sumerian Dashboard and choose Create New Scene. Give your scene a name and choose Create.

  3. After your new scene loads, choose Import Assets from the top bar, which opens the Asset Library.

  4. Once the Asset Library loads, look for the option to Import From Disk. Browse and navigate to where you downloaded the bundle on your machine then add it.

    Note: This operation may merge the contents of your scene with the bundle. As such, after importing the bundle make sure to delete any previous entities that existed prior to importing the bundle especially the MainCamera entity.

  5. In the Entities panel on the left, choose the root entity at the top. On the right side of the screen, you will see the Inspector panel. Choose the AWS Configuration component and enter the Cognito identity pool ID you copied into the box. Make sure you remove any following white spaces.