Speech recognition

Automatic speech recognition (speech-to-text, or STT) is the process of turning speech into text. SpeechKit makes it possible to recognize spontaneous speech in multiple languages.

Languages

  • Russian

  • English

  • Ukrainian

  • Turkish

Language models

SpeechKit has a two-stage approach to speech recognition. At the first stage, the audio signal is analyzed to detect sequences of sounds that could be interpreted as words. For each sequence of sounds, there are usually several possible words, or several hypotheses.

The second stage applies the language model in order to check each hypothesis in terms of the language's structure and the context – to what extent this word is consistent with the words previously uttered. The speech recognition system uses the language model as a dictionary for checking hypotheses. Creating this dictionary is a complex computational task that involves training neural networks.

The neural network is trained on speech that is typical for a particular area. This is why the language models are optimized for recognizing speech related to a specific topic. For example, the Numbers model is best suited for recognizing phone numbers, but if you need to recognize customers' first and last names, use the Names model.

The language models available for the Android and iOS platforms are listed below.

The models are based on large datasets from Yandex services and applications. This is why we are able to continually keep improving the quality of automatic speech recognition.

Speech recognition quality

The accuracy of speech recognition depends on the quality of the incoming sound, the encoding quality, the rate and clarity of speech, and the complexity and length of phrases. It's important for the topic of speech to match the selected language model – this makes the recognition results more accurate.

Speech recognition occurs in real time simultaneously with the transmission of sound data. The delay from the end of data transmission to getting recognition results is less than one second.