Frequently asked questions

General questions

  1. Who can use the features of Yandex SpeechKit?
  2. Can I use Yandex SpeechKit for commercial purposes?
  3. Who can get free access to SpeechKit?
  4. How do I get an API key?
  5. What is a UUID and how do I get one?
  6. Where can I use Yandex SpeechKit?
  7. We have a project that needs to use speech technologies. We are prepared to pay for the technology. Can you help?

Who can use the features of Yandex SpeechKit?

Anyone who wants to can use Yandex SpeechKit.

Commercial use of SpeechKit Cloud is available for businesses.

Individuals who use SpeechKit Cloud technology for private and social projects can also use Yandex speech technologies free of charge.

Free usage of SpeechKit Cloud is regulated by the user agreement.

Can I use Yandex SpeechKit for commercial purposes?

Commercial use of SpeechKit Cloud is available for businesses.

To switch to commercial use of the key, fill out the application to set up an agreement and send it to us at least 10 days before the end of the trial period (the trial period lasts 30 days).

For commercial use, the rate is 20 kopecks per request for speech synthesis or speech recognition. One request is equal to 20 seconds. If a shorter data packet is processed for a server request, it is automatically rounded up to 20 seconds.

At the end of the billing period, an invoice is issued based on the actual number of requests sent. The minimum charge is 200 rubles.

Who can get free access to SpeechKit?

The SpeechKit Mobile SDK is free of charge if you make less than 10,000 requests per day.

The first month of using the SpeechKit Cloud API is free for everyone. During the free period, there is a limit of no more than 1000 requests per day.

If you want to use automatic speech recognition in research or educational projects or for commercial purposes, write to us. Specify the key you are using and describe your project.

How do I get an API key?

To get a unique API key, send a request to speechkit@support.yandex.ru.

To confirm your request, we will ask you to tell us what tasks you are planning to perform and give us an estimate of the expected load.

What is a UUID and how do I get one?

A UUID (Universally Unique Identifier) is a universal user ID. It is unique for each user or device.

The developer generates the ID randomly. For our API, you need to pass it as a 32-digit hexadecimal string (without hyphens).

Where can I use Yandex SpeechKit?

The SpeechKit library is currently used in Yandex mobile apps and services, as well as in projects of other app developers for iOS and Android.

SpeechKit is also integrated in industrial systems where automatic speech recognition is necessary.

We have a project that needs to use speech technologies. We are prepared to pay for the technology. Can you help?

At the moment, SpeechKit Cloud is provided “as is”. We do not customize the technology on request.

SpeechKit Cloud features are described in the documentation.

Speech technologies

  1. What is a request?
  2. Where does speech recognition occur?
  3. What determines the quality of automatic speech recognition?
  4. Is it possible to improve the quality of recognition for a specific user?
  5. Why is the first result in the speech recognition list sometimes not the best one?
  6. What information about the speaker can you extract from a voice query?
  7. Can SpeechKit understand the meaning of recognized text?
  8. Can you create a new voice for text-to-speech?
  9. Is it possible to create new language models for speech recognition?
  10. We want to use SpeechKit to record meetings. Is this possible?
  11. We want to use SpeechKit to convert phone conversations or interviews to text and to flag certain words. Is this possible?

What is a request?

A request is defined as a single request sent to the SpeechKit server from your API key.

For billing and statistics, a request is a unit equal to 20 seconds of recognized or synthesized speech.

To determine the number of requests, the actual length of each request is divided by twenty and rounded up.

For example, if a server call passes an audio clip that is 13 seconds long, it will count as one request.

Where does speech recognition occur?

Speech recognition occurs on Yandex servers.

What determines the quality of automatic speech recognition?

The quality of recognition depends on the quality of the incoming sound, the encoding quality, the rate and clarity of speech, and the complexity and length of phrases. The topic of a voice query is also important, since it should match the chosen language model as well as possible.

Is it possible to improve the quality of recognition for a specific user?

The quality of recognition can be improved by the quality of the input audio.

The Yandex acoustic models are trained on hundreds of thousands of speech recordings from different people and account for differences in pronunciation and accents. The models are continually being retrained with fresh data, so there is probably no need to adapt the system to a particular user.

Why is the first result in the speech recognition list sometimes not the best one?

This is usually related to poor quality in the source audio or unintelligible speech. Even if there is a good recognition result in the list, sometimes the results aren't ranked appropriately.

What information about the speaker can you extract from a voice query?

Technology is available for detecting the speaker's gender, age, and emotional state. You can use the SpeechKit Cloud API to get information about the speaker's gender and age group, and to detect the language.

Can SpeechKit understand the meaning of recognized text?

Yes. We are developing technologies for understanding natural speech (Nаtural Language Understanding). These features are currently used in internal Yandex services and are not licensed for external customers.

Can you create a new voice for text-to-speech?

Yes. Our statistical approach means that we can quickly create new voices. To create a voice, we just need to record a few hours of the dictor's speech.

Is it possible to create new language models for speech recognition?

We are not currently developing additional language models on request.

We want to use SpeechKit to record meetings. Is this possible?

Recognizing long audio files is a complex and difficult task. We do not currently accept projects like this.

We want to use SpeechKit to convert phone conversations or interviews to text and to flag certain words. Is this possible?

No. SpeechKit is designed to recognize short speech fragments that are a maximum of 30 seconds long.

SpeechKit functionality can't be used for speech analysis tasks such as identifying specific words, evaluating the emotional tone of conversations, or matching a conversation to a script.

SpeechKit Cloud API

  1. What is the request volume that SpeechKit Cloud can handle?
  2. Is there a backup for working with SpeechKit Cloud if the main one goes down?
  3. Automatic speech recognition takes a long time. Why?
  4. The server returns the 500 error for speech recognition of certain utterances.
  5. Can I use this system for processing datasets of previously recorded audio?
  6. Can I deploy the system locally, and how do I do this?
  7. Server response: "Content-size limit reached!". What does that mean?
  8. What is the maximum length of an audio fragment for speech recognition?
  9. How many decimal points can I specify for the "speed" parameter?

What is the request volume that SpeechKit Cloud can handle?

SpeechKit Cloud processes millions of voice queries from Yandex users daily, along with requests from external services and applications that use our technology.

Each key has a restriction on the maximum number of requests per day. If you need to process a large number of requests in parallel, write to us at voice@support.yandex.ru. In your message, specify your API key and the volume of requests that you anticipate.

Is there a backup for working with SpeechKit Cloud if the main one goes down?

API requests are handled by a load balancer that automatically distributes them to different servers. The SpeechKit infrastructure was designed for high loads from the start, so the system is quite reliable.

Automatic speech recognition takes a long time. Why?

This might happen if the input contains an audio fragment that is too long.

If a 20-second fragment of speech is taking longer than 7-8 seconds to process, write to us.

The server returns the 500 error for speech recognition of certain utterances.

Send us a message. We'll try to solve this issue.

Can I use this system for processing datasets of previously recorded audio?

SpeechKit supports recognition of pre-recorded speech, but this technology is designed for short fragments of speech. If a recording is more than 30 seconds long, the quality is reduced.

At this time, SpeechKit is not suitable for recognizing recorded phone conversations, interviews, or other long recordings.

Can I deploy the system locally, and how do I do this?

You can. If you have a project like this, write to us.

Server response: "Content-size limit reached!". What does that mean?

You have exceeded the maximum size of an audio fragment that can be transmitted in one POST request.

To transmit audio files larger than 1 MB over HTTP, use transfer encoding.

What is the maximum length of an audio fragment for speech recognition?

If you are sending a POST request that contains an audio fragment in its entirety, the maximum size of the audio file is 1 MB.

If you need to recognize speech from a larger file, send the data in parts using chunked transfer encoding or data streaming mode.

If these restrictions aren't acceptable for your purposes, write to us.

How many decimal points can I specify for the "speed" parameter?

You can use the hundredths place, but in practice it isn't necessary. For instance, the difference between 0.7 and 0.8 is so insignificant that the human ear is barely able to distinguish between them.