Yandex SpeechKit Mobile SDK 3.12.2 for Android reference guide

Yandex SpeechKit is a multi-platform library for integrating speech functionality in your mobile apps with minimal effort. The ultimate goal of SpeechKit is to provide users with the entire range of Yandex speech technologies.

SpeechKit architecture

The SpeechKit library supports several mobile platforms using the same implementation of the basic logic. The differences between platforms are in the platform abstraction layer (recording audio, networking, etc.), API wrappers, and platform-specific components such as GUI implementation. This approach simplifies development for multiple platforms and allows for ideal synchronization of functionality between them.

Mobile platforms differ in their culture and development practices. This affects such aspects as naming of classes and methods, object instantiation, error handling, and so on. We try to minimize these differences while also making sure that SpeechKit fits naturally into the ecosystem of each of the supported platforms.

Working with the SDK

  1. Initializing the SDK
  2. Speech recognition
  3. Speech recognition + UI
  4. Speech synthesis (text-to-speech)
  5. Voice activation

Initializing the SDK

Before using any of the SpeechKit functionality, you need to configure SpeechKit using the API key (you can get a key in the Developer Dashboard):
SpeechKit.getInstance().init(getApplicationContext(), "developer_api_key");

Speech recognition

Speech recognition uses an OnlineRecognizer object:
OnlineRecognizer recognizer = new OnlineRecognizer.Builder(Language.RUSSIAN, OnlineModel.QUERIES, this)
                 .setDisableAntimat(false)
                 .setEnablePunctuation(true)
                 .build(); // 1
recognizer.prepare(); // 2
recognizer.startRecording(); // 3
  1. To create an OnlineRecognizer object, specify which settings it will work with. Mandatory settings are: the language of recognized speech, the language model, and the listener that will receive messages about the recognition process. For the full list of settings, see the OnlineRecognizer.Builder class.
  2. OnlineRecognizer requires a network connection. Because of this, it may take slightly longer to start the recognition process the first time. To avoid this, call the prepare() method in advance. It will make all the necessary settings.
    Note.

    If the prepare() method wasn't called explicitly, it will run automatically on the first start.

  3. The start of speech recognition. Asynchronous execution.

To get recognition results and monitor changes in the state of the OnlineRecognizer object, implement the RecognizerListener interface. Main methods of the interface:

  1. onRecordingBegin — Notifies when sound starts being recorded.
  2. onPartialResults — Notifies when intermediate speech recognition results are obtained. The endOfUtterance flag indicates the end of the sentence. If true, recognition is complete.
  3. onRecognitionDone — Notifies when the recognition process is complete.
  4. onRecognizerError — Notifies that an error occurred when the OnlineRecognizer object was working.

The OnlineRecognizer object can be used for repeated speech recognition. If you need to stop the recognition process before it finishes, call cancel().

Speech recognition + UI

You can also use the RecognizerActivity UI dialog to make it easier to integrate speech recognition into an app. It manages the entire recognition process, including the user interface for recognition and management of the OnlineRecognizer and PhraseSpotter objects. RecognizerActivity starts recognition immediately after opening. The dialog window closes automatically in the following cases:

  • The recognition result was received.
  • An error occurred.
  • The user closed or minimized the app.

The dialog handles when the screen is rotated, the app is minimized, and any other events that may affect the appearance of the dialog or the behavior of the OnlineRecognizer object.

Intent intent = new Intent(getApplicationContext(), RecognizerActivity.class); // 1
intent.putExtra(RecognizerActivity.EXTRA_LANGUAGE, Language.RUSSIAN.getValue()); // 2
intent.putExtra(RecognizerActivity.EXTRA_MODEL, OnlineModel.QUERIES.getName()); // 2
intent.putExtra(RecognizerActivity.EXTRA_SHOW_PARTIAL_RESULTS, true); // 3
intent.putExtra(RecognizerActivity.EXTRA_SHOW_HYPOTHESES, true); // 3
intent.putExtra(RecognizerActivity.EXTRA_NIGHT_THEME, false); // 3

startActivityForResult(intent, REQUEST_UI_CODE); // 4
  1. Create an Intent for launching RecognizerActivity.
  2. Set the required parameters: recognition language and model.
  3. You can specify additional settings for the dialog:
    • Show partial recognition results or a list of hypotheses if the result is ambiguous.
    • Set the appearance of the window to a light or dark theme.
  4. Run RecognizerActivity and wait for results.

You can get recognition results using the standard Android method: implement the callback method onActivityResult() in the activity class. It accepts the activity exit code and the data that is passed by the finished activity. If the activity exit code is RecognizerActivity.RESULT_OK, the data will contain the recognition result:

@Override
public void onActivityResult(int requestCode, int resultCode, Intent data) {
    super.onActivityResult(requestCode, resultCode, data);
    if (data != null) {
        if (requestCode == REQUEST_UI_CODE) {
            if (resultCode == RecognizerActivity.RESULT_OK) {
                final String result = data.getStringExtra(RecognizerActivity.EXTRA_RESULT); // 1
            } else if (resultCode == RecognizerActivity.RESULT_CANCELED) {
                final String language = data.getStringExtra(RecognizerActivity.EXTRA_LANGUAGE); // 2
            } else if (resultCode == RecognizerActivity.RESULT_ERROR) {
                final Error error = (Error) data.getSerializableExtra(RecognizerActivity.EXTRA_ERROR); // 3
            }
        }
    }
}
  1. The recognition was successful.
  2. The window was closed.
  3. Recognition failed with an error.

Speech synthesis (text-to-speech)

Speech synthesis and vocalization uses the OnlineVocalizer object:

OnlineVocalizer.Builder vocalizer = new OnlineVocalizer.Builder(Language.ENGLISH, this)                .setEmotion(Emotion.GOOD)                .setVoice(Voice.ERMIL)                .build(); // 1vocalizer.prepare(); // 2vocalizer.synthesize("Tomorrow's weather", Vocalizer.TextSynthesizingMode.APPEND); // 3
  1. To create an OnlineRecognizer object, specify which settings it will work with. Mandatory settings: the language of the synthesized speech and the listener that will receive messages about the speech synthesis process. For the full list of settings, see the OnlineVocalizer.Builder class.
  2. OnlineVocalizer requires a network connection. Because of this, it may take slightly longer to start the speech synthesis process the first time. To avoid this, call the prepare() method in advance. It will make all the necessary settings.
    Note.

    If the prepare() method wasn't called explicitly, it will be executed automatically at the time of the first speech synthesis.

  3. Speech synthesis of the transmitted text. Asynchronous execution.

To get speech synthesis results and monitor changes in the state of the OnlineVocalizer object, implement the VocalizerListener interface. Main methods of the interface:

  1. onPartialSynthesis — Notifies when partial synthesis results are received. Depending on the task, you can save them to a file or play them using the built-in player.
  2. onSynthesisDone — Notifies when the speech synthesis process is completed.
  3. onVocalizerError — An error occurred in the OnlineVocalizerprocess.

The OnlineVocalizer object can be used for repeated speech synthesis. If you need to end the speech synthesis or vocalization process before it finishes, call the cancel() method.

Voice activation

Voice activation uses the PhraseSpotter object. Voice activation detects a specific word or phrase in the incoming stream for speech recognition. The activation phrase is set in the PhraseSpotter class object.

PhraseSpotter phraseSpotter = new PhraseSpotter.Builder("phrase-spotter/commands", this).build(); // 1
phraseSpotter.prepare(); // 2
phraseSpotter.start(); // 3
  1. To create the PhraseSpotter object, specify which settings it will work with. Required settings are the path to the model for the PhraseSpotter object and the listener that will receive notifications about the voice activation process. For the full list of settings, see the PhraseSpotter.Builder class.
  2. PhraseSpotter does not require a network connection, but it may take some time to load the model. To avoid this, call the prepare() method in advance.
    Note.

    If the prepare() method wasn't called explicitly, it will run automatically on the first start.

  3. Starting the work of the PhraseSpotter object. Asynchronous execution.

To get voice activation results and monitor changes in the state of the PhraseSpotter object, implement the PhraseSpotterListener interface. Main methods of the interface:

  1. onPhraseSpotterStarted — Notifies when audio recording begins.
  2. onPhraseSpotted — Notifies when the activation phrase is detected in the audio stream.
  3. onPhraseSpotterError — Notifies that an error occurred when the PhraseSpotter object was working.

After the specified phrase is detected, the PhraseSpotter object continues working. To stop it, call stop().

Need help?

If you experience problems with the SpeechKit Mobile SDK, try enabling logging using the setLogLevel method of the BaseSpeechKit class. This will provide additional information about what is happening with the system at the moment, and may help you answer any questions you might have.

SpeechKit.getInstance().setLogLevel(SpeechKit.LogLevel.LOG_DEBUG);

If the logs don't give you enough information, search the FAQ for an answer to your question or a description of a similar problem and solution.