Speech recognition

Automatic speech recognition (speech-to-text, or STT) is the process of turning speech into text. You can use SpeechKit Cloud for recognizing spontaneous speech in multiple languages.


  • Russian

  • English

  • Ukrainian

  • Turkish

Language models

SpeechKit has a two-stage approach to speech recognition. At the first stage, the audio signal is analyzed to detect sequences of sounds that could be interpreted as words. For each sequence of sounds, there are usually several possible words, or several hypotheses.

The second stage applies the language model in order to check each hypothesis in terms of the language's structure and the context. It evaluates how well each word fits in with the words previously uttered. The speech recognition system uses the language model as a dictionary for checking hypotheses. Creating this dictionary is a complex computational task that involves training neural networks.

The neural network is trained on speech that is typical for a particular area. This is why the language models are optimized for recognizing speech related to a specific topic. For example, the Numbers model is best suited for recognizing phone numbers, but if you need to recognize customers' first and last names, use the Names model.

The available language models are listed below.

The models are based on large datasets from Yandex services and applications. This is why we are able to continually keep improving the quality of automatic speech recognition.

Speech recognition quality

The accuracy of speech recognition depends on the quality of the incoming sound, the encoding quality, the rate and clarity of speech, and the complexity and length of phrases. It's important for the topic of speech to match the selected language model — this makes the recognition results more accurate.

The speed of speech recognition depends on how the audio data is transmitted. If data is transmitted in chunks, speech recognition is performed while the data is being transmitted. In this case, the delay between the end of data transmission and receiving results is usually less than 1 second.

The format of transmitted data is described in the section Request format. Note that SpeechKit Cloud converts the received audio data to mono PCM/16 bit/16 kHz.