About the reference guide

Yandex SpeechKit is a multi-platform library for integrating speech functionality in your mobile apps with minimal effort. The ultimate goal of SpeechKit is to provide users with virtually the entire range of speech functionality available to Yandex.

In this release, Yandex SpeechKit offers server-side speech recognition functionality for short voice queries (Russian and Turkish are supported), server-side speech synthesis (only Russian is supported), and client-side voice activation (only Russian is supported). The list of supported languages, mobile platforms, and functionality will be expanded in future releases.

Download the distribution

SpeechKit architecture

Multi-platform approach

The SpeechKit library supports several mobile platforms using the same implementation of the basic logic. The differences between platforms are in the platform abstraction layer (recording audio, networking, etc.), API wrappers, and platform-specific components such as GUI implementation. This approach simplifies development for multiple platforms and allows for ideal synchronization of functionality between them.

Mobile platforms differ in their culture and development practices. This affects such aspects as naming of classes and methods, object instantiation, error handling, and so on. These differences are inevitably reflected on the library interface on different platforms. We try to minimize these differences while also making sure that SpeechKit fits naturally into the ecosystem of each of the supported platforms.

Components

SpeechKit contains components for each of the technologies provided, as well as a GUI for speech recognition and service components for initializing internal mechanisms.

Regardless of which component you choose, you must first configure SpeechKit using:

  • YSKSpeechKit is a class for configuring the library and managing its operation.

YSKSpeechKit

YSKSpeechKit — A class for configuring and managing SpeechKit.

Before using any of the SpeechKit functionality, you must configure it using configureWithAPIKey: or configureWithAPIKey:andLocationProvider:.

By default, SpeechKit uses geolocation to get the current user coordinates. This can help improve speech recognition quality in some cases, such as when using the YSKRcognitionModelMaps language model. To disable it, pass the configureWithAPIKey:andLocationProvider: method an empty pointer as the second argument (instead of an instance of YSKLocationProvider).

YSKInitializer

Initialization is the internal process that SpeechKit uses for initializing internal mechanisms. Initialization may require executing lengthy read operations from permanent memory or network access, and generally takes a significant amount of time. This is why the YSKInitializer has been introduced for performing initialization when it is convenient for the user.

In the current implementation, YSKInitializer sends a request to the server (the “startup request”) and gets a set of parameters and configurations in response (such as the audio format or parameters of the active voice detection algorithm), which are then used during speech recognition.

Note. Users do not have to perform initialization explicitly. If it has not yet been done, SpeechKit initializes itself automatically when the first request for speech recognition or synthesis is received. So YSKInitializer is used mainly in order to speed up the execution of the first request.

YSKInitializer uses the YSKInitializerDelegate interface to notify you when it starts and finishes (with or without errors).

YSKSpeechRecognitionViewController

This class is an iOS view controller that is designed to simplify the integration of SpeechKit speech recognition functionality into an application. YSKSpeechRecognitionViewController returns the string uttered by the user and resolves any problems that occur along the way. YSKSpeechRecognitionViewController manages the entire recognition process, including the user interface for speech recognition, management of the YSKRecognizer and YSKInitializer objects, and so on.

YSKRecognizer

YSKRecognizer is the central component of speech recognition in SpeechKit. YSKRecognizer is intended for single sessions of speech recognition. It manages the entire recognition process, including recording audio, detecting speech activity, communicating with the server, and so on. YSKRecognizer uses the YSKRecognizerDelegate interface for notification of important events in the recognition process, returning recognition results, and notification of errors.

The recognition result is represented by the YSKRecognition class, which is the N-best list of recognition hypotheses, sorted by confidence in descending order. A recognition hypothesis, in turn, is represented by the YSKRecognitionHypothesis class.

Errors that occur during the recognition process are described by the standard YSKError mechanism.

YSKVocalizer

YSKVocalizer is the main speech synthesis component in SpeechKit. YSKVocalizer is intended for single sessions of speech synthesis. It manages the entire text-to-speech process, including producing audio, communicating with the server, and so on.

YSKVocalizer uses the YSKVocalizerDelegate interface for notification of the main events in the speech synthesis process, returning synthesis results, and notification of errors.

YSKPhraseSpotter

YSKPhraseSpotter allows continual analysis of an audio stream to detect specific phrases in it. It does not require an internet connection. All computations are performed on the device. To search for phrases in an audio stream, you need a model that contains the pronunciation of these phrases.

To start working with YSKPhraseSpotter, you must specify the model and the object that will receive notifications. After this, you can stop and start phrase detection without re-initialization.