Request format

To send a request for speech recognition, use the POST method over HTTPS.

POST /asr_xml?uuid=<user ID>&key=<API key>&topic=<language model>&lang=<language>&disableAntimat=<profanity filter> HTTP/1.1
Host: asr.yandex.net
Content-Type: <audio format>
Transfer-Encoding: chunked... 

(binary content of the audio file)
uuid

The unique user ID (Universally Unique Identifier) is a sequence of 32 hexadecimal digits (0–9, A–F) that identifies each user of the app or service. It is generated on the client. Example:

01ae13cb744628b58fb536d496daa1e6
Note.

Don't use the ID that is given as an example. The ID should be randomly generated.

key

API key. To get an API key, send a request to speechkit@support.yandex.ru.

topic

The language model to use for recognition.

Note.

The more precisely the model is selected, the better the recognition results. You can only specify one model per request.

lang (optional)

The language for speech recognition.

Allowed values: ru-RU — Russian (default), en-US — English, uk-UK — Ukrainian, tr-TR — Turkish.

Default value: ru-RU.

The topic values allowed for various lang settings are listed below.

disableAntimat (optional)

Disables the profanity filter for recognized speech. Acceptable values:

  • true — Profanities are not removed from recognized speech.

  • false — Profanities are excluded from recognition results.

    Default value: false.

Content-Type header

The HTTP Content-Type header specifies the format of audio data. The table below shows the values supported by SpeechKit Cloud.

Value of Content-TypeFormat of audio dataMedia container (file format)Recognition simultaneously with audio transmissionComments
audio/x-wavThe WAV media container can contain any format of audio data, such as PCM.WAV

No.

Recognition begins only after all the data has been transmitted to the server.

Audio formats not listed in the table are converted to PCM on the server side.

audio/x-mpeg-3MPEG-1 Audio Layer 3 (MP3)MP3

No.

Recognition begins only after all the data has been transmitted to the server.

The x-mpeg-3 value corresponds to MPEG-1 Audio Layer 3 (MP3).

audio/x-speexSpeex

OGG (.ogg, .spx)

Yes

Use the Speex audio codec and an OGG container.

Make sure that the stream has valid OGG headers. Also make sure that a broadband signal (16000 Hz) is used for encoding.

audio/ogg;codecs=opusOpus

OGG (.ogg, .opus)

Yes

audio/webm;codecs=opusOpusWebM (.webm)

Yes

audio/x-pcm;bit=16;rate=16000Linear PCM with 16,000 Hz sampling rate and 16-bit quantization.PCM (.pcm)

Yes

Shows the most accurate recognition results.

audio/x-pcm;bit=16;rate=8000Linear PCM with 8000 Hz sampling rate and 16-bit quantization.PCM (.pcm)

Yes

audio/x-alaw;bit=13;rate=8000A-law PCM. 8000 Hz sampling rate and 13-bit quantization.PCM (.pcm)

Yes

Attention.

The Content-Type header is required. If the header is empty, it is assumed that the message body contains ASCII-encoded plain text. To get recognition results, you must specify not only the type (audio), but also the subtype (x-wav), since there isn't a default subtype.

If you want to transmit audio in a format that isn't shown in the table, use a wav container.

Note that converts the received audio data to mono PCM/16 bit/16 kHz.

Transfer-Encoding header

To begin speech recognition simultaneously with transmission of audio data, specify chunked in the HTTP Transfer-Encoding header and transmit data in the message body in chunks. This allows the server to process the audio before it receives all of the data. The recognition result is formed almost simultaneously with the end of audio transmission.

To transmit data in chunks, use transfer encoding. Speech recognition begins when the first section of data (the first chunk) has been sent. To finish sending messages and get the recognition result, the client application sends a zero-length chunk.

Intermediate recognition results aren't sent to the client application. They are necessary for forming the final recognition result on the server side. The recognized text is processed before sending: some punctuation is added (such as hyphens), and numbers are expressed as digits. This converted text is the final speech recognition result that is sent in the response body.

Note.

If you don't include the Transfer-Encoding: chunked header, the request must specify the message size in the Content-Length header. This is necessary so the server can find the end of the message. The value of Content-Length should not be greater than 1 MB (see the FAQ).

Note.

When sending wav and mp3 files, there is no reason to pass Transfer-Encoding: chunked, since recognition will begin only when the data is completely transmitted to the server, in any case. For this reason, we don't recommend using wav or mp3 containers except when debugging.

To recognize streamed audio, use streaming mode.

Examples

POST /asr_xml?uuid=<user ID>&key=<API key>&topic=numbers&lang=ru-RU HTTP/1.1
Host: asr.yandex.net
Content-Type: audio/x-pcm;bit=16;rate=16000
Transfer-Encoding: chunked