STT Streaming API

Speech Center uses GRPC as an interface to manage requests for Speech Center’s streaming solution. Please find more information on GRPC here.


GRPC introduction

From https://grpc.io/docs/what-is-grpc/introduction/:

In gRPC, a client application can directly call a method on a server application on a different machine as if it were a local object, making it easier for you to create distributed applications and services. As in many RPC systems, gRPC is based around the idea of defining a service, specifying the methods that can be called remotely with their parameters and return types. On the server side, the server implements this interface and runs a gRPC server to handle client calls. On the client side, the client has a stub (referred to as just a client in some languages) that provides the same methods as the server.

Concept Diagram

gRPC clients and servers can run and talk to each other in a variety of environments – from servers inside Google to your own desktop – and can be written in any of gRPC’s supported languages. So, for example, you can easily create a gRPC server in Java with clients in Go, Python, or Ruby. In addition, the latest Google APIs will have gRPC versions of their interfaces, letting you easily build Google functionality into your applications.


Recognition proto

syntax = "proto3";

package speechcenter.recognizer.v1;

import "recognition_streaming_request.proto";
import "recognition_streaming_response.proto";

// Service that implements Recognition API.
service Recognizer {
  // Performs bidirectional streaming speech recognition: receive results while sending audio.
  rpc StreamingRecognize(stream RecognitionStreamingRequest) returns (stream RecognitionStreamingResponse);
}

Streaming request

syntax = "proto3";

package speechcenter.recognizer.v1;

/*
The stream of recognition requests will be composed by a first RecognitionConfig message followed by one or more audio
messages containing raw audio. It can optionally include EventMessages at any point after the first Config message.
An EventMessage of type END_OF_STREAM must be sent as a final message.
 */
message RecognitionStreamingRequest {
  oneof recognition_request {
    // Header like first streaming configuration message.
    RecognitionConfig config = 1;

    // Raw audio in the selected format.
    bytes audio = 2;

    //Message to signal an Event during the stream.
    EventMessage event_message = 3;
  }
}

// An init message with the recognition data.
message RecognitionConfig {
  enum AsrVersion {
    V1 = 0;
    V2 = 1;
  }

  // General parameters for the recognition, such as language.
  RecognitionParameters parameters = 1;

  // The request must specify a topic.
  RecognitionResource resource = 2;

  // The version of the speech recognition software to be used. Each version may support a different set of languages, topics and features.
  AsrVersion version = 3;

  // Timer configurations for MRCP.
  optional TimerConfiguration configuration = 4;

  /* Labels to apply to this recognition, for billing purposes. Can be one or multiple.
  Billing information can later be grouped by label. There can be None to up to 64 labels in a request,
  and each label can have up to 256 characters.  */
  repeated string label = 5;
}

message RecognitionParameters {

  /* This message will contain the language locale of your audio in IETF BCP 47 format.
  Supported languages will differ with each AsrVersion. */
  string language = 1;

  oneof AudioEncoding {
    PCM pcm = 2; // Linear Pulse-Code Modulation with signed 16 bit samples, little endian byte order.
  }

  //Set to the number of channels if speaker separation per channel is desired
  optional uint32 audio_channels_number = 3;

  //Enable output formatting, only available in certain languages.  Premium feature on V1.
  bool enable_formatting = 4;

  // Enable output diarization. Premium feature on V1.
  bool enable_diarization = 5;

}

message PCM {
  // Audio sample rate in Hertz.
  uint32 sample_rate_hz = 1;
}

// The request must specify a topic.
message RecognitionResource {
  enum Topic {
    GENERIC = 0;    // Suitable for any generic speech
    BANKING = 1;    // Transcription will be optimized for banking recordings
    TELCO = 2;      // Transcription will be optimized for telecommunications companies
    INSURANCE = 3;  // Transcription will be optimized for insurance companies
  }

  oneof Resource {
    // The topic will determine the topic used for the recognition.
    Topic topic = 1;

    //ABNF grammar resource
    GrammarResource grammar = 2;
  }
}

message GrammarResource {
  oneof Grammar {
    //The text of the ABNF grammar as an inline string
    string inline_grammar = 1;

    //An URI to a grammar provided by online Verbio services
    string grammar_uri = 2;

    //A binary grammar precompiled by Verbio services
    bytes compiled_grammar = 3;
  }
}

message TimerConfiguration {

  /* After voice has been detected but no results are received from the ASR engine,
  the recognition  will be finished with a “No Match” completion cause.
   */
  optional uint32 recognition_timeout = 1;

  /* If set to false, do not start timers until a message StreamingRecognizeRequest::EventMessage
   with the event type "START_INPUT_TIMERS" is received.
   */
  optional bool start_input_timers = 2;


  /* How long the recognizer should wait for more speech when already having a complete result.
   */
  optional uint32 speech_complete_timeout = 3;

  /* How long the recognizer should wait for more speech when having only a partial result.
  This differs from the Recognition-Timeout because in this case a partial match is returned instead of a No-Match.
  */
  optional uint32 speech_incomplete_timeout = 4;
}

message EventMessage {
  enum Event {
    //Event to signal the server to start the input timers.
    START_INPUT_TIMERS = 0;
    //Event to signal the server that there will be no more audio to transcribe.
    END_OF_STREAM = 1;
  }
  Event event = 1;
}

Streaming response

syntax = "proto3";

package speechcenter.recognizer.v1;

message RecognitionStreamingResponse {
  oneof recognition_response {
    // If set, specifies the error for the operation.
    RecognitionError error = 1;

    // List of results corresponding to portions of the audio currently being processed.
    RecognitionResult result = 2;
  }
  uint32 completion_cause = 3; //Numerical code corresponding to the MRCP standard completion causes.
}

// A streaming recognition result corresponding to a portion of the audio that is currently being processed.
message RecognitionResult {
  // List of one or more recognition hypotheses ordered in terms of accuracy.
  repeated RecognitionAlternative alternatives = 1;

  // Time offset relative to the beginning of the audio.
  float duration = 2;

  // Indicates whether a result represents an interim result that may change or not.
  bool is_final = 3;
}

// Hypothesis-specific information.
message RecognitionAlternative {
  // Transcript text representing the words that the user spoke.
  string transcript = 1;

  // The confidence estimate between 0.0 and 1.0.
  float confidence = 2;

  // A list of word-specific information for each recognized word.
  repeated WordInfo words = 3;
}

// Word-specific information for recognized words.
message WordInfo {
  // Time offset in seconds relative to the beginning of the audio corresponding to the start of the spoken word.
  float start_time = 1;
  // Time offset in seconds relative to the beginning of the audio corresponding to the end of the spoken word.
  float end_time = 2;
  // The spoken word.
  string word = 3;
  // The confidence estimate between 0.0 and 1.0.
  float confidence = 4;
  //Natural number identifying the speaker
  uint32 speaker_id = 5;
}

message RecognitionError {
  // The reason of the error. This is a constant value that identifies the
  // proximate cause of the error.
  string reason = 1;

  // The logical grouping to which the "reason" belongs. The error domain
  // is typically the registered service name of the tool or product that
  // generates the error.
  string domain = 2;

  // Additional structured details about this error. Usually pictured as a JSON.
  map<string, string> metadata = 3;
}

Feature summary and accepted fields

Here a plain text summary of the fields accepted by the proto are detailed:

Asr version

Only V1 and V2 are accepted, as are the only two accepted versions. Please check the Speech Streaming product documentation page to have a full comparison.

Topic

The only currently accepted topic by the system is GENERIC. Specific topics such as TELCO, BANKING and INSURANCE will soon arrive.

Language

Please check the Speech Streaming product documentation page to check for a full and updated list of accepted languages.

Sample rate

8000 Hz and 16000Hz sample rates are accepted.

Formatting

Boolean value which can be used to activate formatted transcriptions. Please check the Speech Streaming product documentation page to check for a full and updated list of accepted languages and versions which accept formatting.

Diarization

Boolean value which can be used to activate diarization. Please check the Speech Streaming product documentation page to check for a full and updated list of accepted languages and versions which accept diarization.

Encoding

Linear Pulse-Code Modulation with signed 16 bit samples, little endian byte order is accepted as the only accepted encoding.

Label

One word project name or label name. This will allow the requests to be separated by labels to check on usage and billing done for multiple projects within one user.