About the reference guide

Yandex SpeechKit is a multi-platform library for integrating speech functionality in your mobile apps with minimal effort. The ultimate goal of SpeechKit is to provide users with the entire range of Yandex speech technologies.

SpeechKit architecture

The SpeechKit library supports several mobile platforms using the same implementation of the basic logic. The differences between platforms are in the platform abstraction layer (recording audio, networking, etc.), API wrappers, and platform-specific components such as GUI implementation. This approach simplifies development for multiple platforms and allows for ideal synchronization of functionality between them.

Mobile platforms differ in their culture and development practices. This affects such aspects as naming of classes and methods, object instantiation, error handling, and so on. We try to minimize these differences while also making sure that SpeechKit fits naturally into the ecosystem of each of the supported platforms.

Languages

SpeechKit lets you run speech recognition (on the server side), speech synthesis (on the server side), and voice activation (on the client side).

Supported languages: Russian, English, Turkish, Ukrainian.

Technologies and components

The library contains components for each of the technologies provided, as well as a GUI (for speech recognition) and service components (for initializing internal mechanisms).

Regardless of which component you choose, you must first configure SpeechKit using the YSKSpeechKit class.

Configuration and initialization

SpeechKit — A class for configuring and managing SpeechKit.

Before using any of the SpeechKit functionality, you must configure it using the configure method.

By default, SpeechKit uses geolocation to get the current user coordinates. This can help improve speech recognition quality in some cases, such as when using the MAPS language model. To disable it, pass the configure method an empty pointer as the third argument (instead of an instance of LocationProvider).

Initializer — A class for controlling the initialization process.

Initialization is the internal process that SpeechKit uses for initializing internal mechanisms. Initialization may require executing lengthy read operations from permanent memory or network access, and generally takes a significant amount of time. This is why the Initializer class has been introduced for performing initialization when it is convenient for the user.

In the current implementation, Initializer sends a request to the server (the “startup request”) and gets a set of parameters and configurations in response (such as the audio format or parameters of the active voice detection algorithm), which are then used during speech recognition.

Note. Users do not have to perform initialization explicitly. If it has not yet been done, SpeechKit initializes itself automatically when the first request for speech recognition or synthesis is received. So Initializer is used mainly in order to speed up the execution of the first request.

Initializer uses the InitializerListener interface to notify you when it starts and finishes (with or without errors).

The Error class describes the errors that occurred during the recognition process.

Speech recognition

RecognizerActivity is an Android Activity for easy integration of speech recognition.

This class is an Android Activity that is designed to simplify integration of speech recognition in apps. RecognizerActivity returns the string uttered by the user and resolves any problems that occur along the way. RecognizerActivity manages the entire recognition process, including the user interface for speech recognition, management of the Recognizer and Initializer objects, and so on.

RecognizerActivity starts up using the Android startActivity method that passes the corresponding Intent, and returns recognition results using the resulting Intent.

Recognizer — A class for more detailed control of the speech recognition process.

Recognizer is the central component of speech recognition in SpeechKit. Recognizer is intended for single sessions of speech recognition. It manages the entire recognition process, including recording audio, detecting speech activity, communicating with the server, and so on. Recognizer uses the RecognizerListener interface for notification of important events in the recognition process, returning recognition results, and notification of errors.

The recognition result is represented by the Recognition class, which is the N-best list of recognition hypotheses, sorted by confidence in descending order. A recognition hypothesis, in turn, is represented by the RecognitionHypothesis class.

Speech synthesis (text-to-speech)

Vocalizer — Class for single sessions of speech synthesis.

Vocalizer is the main speech synthesis component in SpeechKit. It manages the entire text-to-speech process, including producing audio, communicating with the server, and so on.

Voice activation

PhraseSpotter — A class for using voice activation.

PhraseSpotter continuously analyzes the audio stream and detects the specified phrases in it. It does not require an internet connection. All computations are performed on the device. To search for phrases in an audio stream, you need a model that contains the pronunciation of these phrases. To replace the model, use the setModel method.

To start using PhraseSpotter, you must set the language model and the object that will receive notifications. After this, you can stop and start phrase detection without re-initialization.

Before using the model, you need to call the load method, which loads the model to memory. After loading, the model can be switched without stopping PhraseSpotter.

Migrating from version 2.2 to version 2.5

  1. Replace the constant names for the models:

    Recognizer.Model.freeform →  Recognizer.Model.NOTES
    
    Recognizer.Model.general →  Recognizer.Model.QUERIES
    
    Recognizer.Model.maps →  Recognizer.Model.MAPS
    
    Recognizer.Model.music →  Recognizer.Model.MUSIC

    Example:

    Recognizer rec = Recognizer.create(yourLng, Recognizer.Model.general, yourListener);// Replace with:Recognizer rec = Recognizer.create(yourLng, Recognizer.Model.QUERIES, yourListener);
  2. Replace the constant names for Russian and Turkish:

    Recognizer.Language.russian →  Recognizer.Language.RUSSIAN
    
    Recognizer.Language.turkish →  Recognizer.Language.TURKISH
  3. Add the onSpeechEnds method to the class that implements the RecognizerListener interface.