Data streaming mode with support for Protocol Buffers

Data streaming mode allows you to use the SpeechKit Cloud API to send data in chunks. In contrast to the POST HTTP API, you can get intermediate (partial) results of speech recognition when using data streaming. Pauses for sending the final results of speech recognition are set automatically. This mode is suitable for recognizing speech transmitted as a stream.

As soon as the user begins talking, small chunks of speech are immediately sent to the server for recognition. The server begins to process the data and sends the client application intermediate results and final results of speech recognition for each chunk. The intermediate results are used for showing the user the progress of speech recognition.

Protocol

Data is exchanged over a special application layer protocol that is similar to HTTP. You need to establish a TCP connection and perform an “HTTP handshake”. After this, data is exchanged within the established session, and the TCP connection remains open.

Data format

If you are using the SpeechKit Cloud API in data streaming mode, binary (serialized) messages are exchanged. The Protocol Buffers technology is used to unify the serialization and deserialization of messages.

Unlike XML or JSON, Protocol Buffers messages are binary and are not meant to be read by the user. To define the structure of serialized data, use a protobuf file (a file with the .proto extension) that contains a description of this structure. For example:

message AddData
{
  optional bytes audioData = 1;
            
  required bool lastChunk = 2;
}
Restriction.

As the developer of Protocol Buffers, Google supports the technology for several programming languages (C++, Java, and Python). You can find a list of supported languages in the official documentation.

You must compile the protobuf file using a compiler for the chosen programming language. The result is a class containing field access methods, along with methods for serializing and deserializing data.

After the field values are set, a built-in method is used for serializing data to a byte array. As a result, you get the serialized protobuf message.

For details on the rules for defining data structures and how protobuf data types match data types in resulting classes, see the official documentation

Message format

The SpeechKit Cloud API allows exchanging messages in the following format. First comes the message size in hexadecimal form, then the sequence \r\n, followed by the serialized protobuf message.

[hex of the message size]\r\n[serialized protobuf message]

The response message (from the server to the client) is transmitted in the same format as the message from the client to the server. The received protobuf message must be deserialized to a class object.

For an example of sending data using sockets, see the section Usage example.

Flow

To get the recognized text, the client application using the SpeechKit Cloud API must complete the following steps.

Connecting to the speech recognition server

The client application connects to the SpeechKit Cloud server over the TCP protocol on the server port 80 (for HTTP access) or port 443 (for HTTPS access). We recommend using port 443 so that the HTTP connection is secure.

Next, the client sends a normal HTTP GET request with the Upgrade header:

GET /asr_partial HTTP/1.1
User-Agent: KeepAliveClient
Host: asr.yandex.net
Upgrade: dictation

The HTTP Upgrade header is used to declare that the client is prepared to use a different protocol.

The client receives a 101 response from the server. First line of the server response:

HTTP/1.1 101 Switching Protocols

This means that the handshake was successful and the server changed the data exchange protocol from HTTP to another protocol.

Now the client application and the speech recognition server can exchange binary messages. Messages are exchanged over the established TCP connection, which remains open.

Sending request parameters

A protobuf message must be created in order to send a request.

The data structure described below can be used for defining request parameters.

Serialize the ConnectionRequest structure to a byte array. Send the serialized protobuf message (see Message format).

message ConnectionRequest
{
  optional int32 protocolVersion = 1 [default = 1];
  
  required string speechkitVersion = 2;
            
  required string serviceName = 3;
            
  required string uuid = 4;
            
  required string apiKey = 5;
            
  required string applicationName = 6;
            
  required string device = 7;
            
  required string coords = 8;
            
  required string topic = 9;
            
  required string lang = 10;
            
  required string format =11;

  optional bool disableAntimatNormalizer = 18 [default = false];

optional AdvancedASROptions advancedASROptions = 19;
            
 }
            
message AdvancedASROptions
{ 
  optional bool partial_results = 1 [default = true];
 
 optional string biometry = 24;
}
FieldDescription
protocolVersionVersion of the protocol. Use the default value.
speechkitVersionVersion of the server-side software. Leave an empty string.
serviceNameName of the service. Allowed value: asr_dictation.
uuidUniversally Unique Identifier — A string of 32 hexadecimal characters (without hyphens).
apiKeyThe API key.
applicationNameName of the client application.
deviceThe type of device running the client application. For example, iphone.
coordsCoordinates of the device running the mobile application. If the coordinates are unknown, pass: 0,0.
topic

The language model to use for recognition.

langThe language for speech recognition.
formatThe audio format, such as audio/x-speex. Allowed values are listed in Content-Type header.
advancedASROptions

The response will contain biometric characteristics of the recognized speech (see Speech analytics) and the intermediate speech recognition results.

disableAntimatNormalizer

Disables the profanity filter for recognized speech. Acceptable values:

  • true — Profanities are not removed from recognized speech.

  • false — Profanities are excluded from the recognition results (default value).

partial_results

Incomplete speech recognition results are returned in the response if the true value is passed. If false is passed, the response only contains the final speech recognition results for the phrase or word (see Getting speech recognition results).

biometry

Detailed description of the parameter is given in the section Speech analysis.

Getting session parameters

In response to ConnectionRequest, the server sends ConnectionResponse with the response code and the session ID.

In order to deserialize data, the client application must have a proto file with the following structure:

message ConnectionResponse
{
    required ResponseCode responseCode = 1;
            
    required string sessionId = 2;
            
    optional string message = 3;
            
    enum ResponseCode {
    OK = 200;
    BadMessageFormatting = 400;
    UnknownService = 404;
    NotSupportedVersion = 405;
    Timeout = 408;
    ProtocolError = 410;
    InternalError = 500;
    }
}
FieldDescription
responseCode

The response code. Possible values are listed in the ResponseCode enum.

  • 200 — ОК (success).

  • 400 — Missing topic or language.

  • 404 — Missing or invalid service name.

  • 405 — Unsupported version specified.

  • 408 — Automatic logoff due to inactivity in the client application.

  • 410 — Audio data not transmitted (when lastChunk=true).

  • 429 — Invalid API key.

  • 500 — Speech recognition failed on the server due to an internal error.

sessionIdThe session ID. Specify this ID when contacting tech support.
messageError message text. Included in the response if the response code is something other than 200.

Sending audio data

The server expects two types of messages from the client application: one ConnectionRequest message at the beginning, and then multiple AddData messages with audio data.

message AddData
{
  optional bytes audioData = 1;
            
  required bool lastChunk = 2;
}
FieldDescription
audioDataAudio data.
lastChunk

After sending the lastChunk = true message to AddData, the server forms the speech recognition results for any audio fragments that have been received but haven't been processed yet, and then sends the response. After sending the last response, the server closes the connection.

Getting speech recognition results

You can get speech recognition results for a single word or several words. An utterance (or phrase) is a fragment of speech consisting of one or more words. The end of an utterance is defined as a period of silence lasting 1 second and 200 milliseconds.

Final speech recognition results are formed when the speech recognition system detects the end of an utterance. However, a single utterance may be divided into multiple AddData messages. When the server receives an AddData message, the value of the messagesCount variable is incremented. The response contains the messagesCount value equal to the total number of AddData messages. However, the speech recognition result is formed for each utterance individually, and only for one utterance at a time. The results for multiple utterances are not combined in a single response.

Intermediate speech recognition results are formed when the hypothesis changes. The intermediate result contains just one hypothesis for the entire utterance (without splitting it into words).

Intermediate results are sent when there is a pause in sending AddData. The client application can allow or prohibit the formation of intermediate results. But data is sent to the client application depending on whether a hypothesis was formed and when it was formed.

The AddDataResponse structure is shown below. Both intermediate and final results are contained in the normalized field in the Result class object. The final speech recognition result also contains separate hypotheses for each word (in the value field for the Word class).

message Word
{
  required float confidence = 1;
            
  required string value = 2;
}
            
message Result
{
  required float confidence = 1;
  
  repeated Word words = 2;
            
  optional string normalized = 3;
}
            
message AddDataResponse
{
  required BasicProtobuf.ConnectionResponse.ResponseCode responseCode = 1;
            
  repeated Result recognition = 2;
            
  optional bool endOfUtt = 3 [default = false];
            
  optional int32 messagesCount = 4 [default = 1];

  repeated BiometryResult bioResult = 6;
}
FieldDescription
Word.confidenceConfidence in the hypothesis for the word.
valueRecognition result for the word.
Result.confidenceConfidence in the hypothesis for the entire utterance.
wordsWords in the utterance.
normalizedThe normalized text. In a normalized text, numbers are written as digits, and punctuation and abbreviations are included. For example, "September sixth nineteen ninety six" is shown as 06.09.1996.
responseCode

Code of the server response.

  • 200 — ОК (success).

  • 400 — Missing topic or language.

  • 404 — Missing or invalid service name.

  • 405 — Unsupported version specified.

  • 408 — Automatic logoff due to inactivity in the client application.

  • 410 — Audio data not transmitted (when lastChunk=true).

  • 429 — Invalid API key.

  • 500 — Speech recognition failed on the server due to an internal error.

recognitionA set of hypotheses in order of descending confidence.
endOfUttEnd of the utterance (phrase). If true, the recognition result contains the N-best list of speech recognition hypotheses. If false, the server returns intermediate results in the same structure as the final results, but without details for each word, and with just one hypothesis. In other words, the response contains a single utterance.
messagesCountThe number of AddData messages that were combined. A single AddDataResponse is returned for several AddData messages.
bioResultThe result of analyzing the audio signal (see Speech analytics).

Closing the connection

After sending the lastChunk = true message to AddData, the server forms the speech recognition results for any audio fragments that have been received but haven't been processed yet, and then sends the response. After sending the last response, the server closes the connection.

Note.

In the last message, the server sends just the results for the last unrecognized utterance, not the results for the audio transmitted over the entire session.

Usage example

A simplified scheme for sending data using sockets is shown below.

import socket

#1 Create a TCP connection endpoint (socket).

s = socket.socket(AF_INET, SOCK_STREAM)

#2 Establish a network connection with the server. Parameters: IP address asr.yandex.net and the destination port. Use port 80 for HTTP access or port 443 for HTTPS.

s.connect(('asr.yandex.net', 80))

#3 Send an HTTP handshake. The server sends the 101 response.

s.send('GET /asr_partial HTTP/1.1\r\n'
                   'User-Agent: KeepAliveClient\r\n'
                   'Host: asr.yandex.net:80\r\n'
                   'Upgrade: dictation\r\n\r\n')

#4 Pass request parameters via the socket, then send the audio data. All data is transmitted as follows: first the size of the serialized message (a hexadecimal number) is passed, then an escape sequence and the serialized message itself. 

s.send(hex(len(message))[2:])
s.send('\r\n')
s.send(message)

#5 Send the server the last data chunk and close the connection. To do this, pass lastChunk = true in the AddData message. After you receive the server response, call the "close" system function.

s.close()

There is a console application on GitHub that makes it possible to perform speech recognition on streamed audio. This is a Python application that transmits data using sockets.