Speech Recognition (Microsoft Bing Speech API vs. Google Cloud Speech API)

For some reason I want to find out, where I can find better speech recognition service – on Microsoft side with Bing Speech API or on Google side with Google Cloud Speech API.

First and most important thing for my region is that Bing Speech API does not support Slovenian language while Google Cloud Speech API does. So we can go Slovenian only by Google.

You could find both examples, Bing or Google way, on my GitHub repository.

Bing Speech API Example

So, lets start with Bing Speech API in SpeechRecognition WPF project. We have to install NuGet package Microsoft.ProjectOxford.SpeechRecognition-x64 firstly.

We create some variables below. micClient represent our microphone, SubscriptionKey is key for our Bing Speech API subscription, Mode is recognition mode (LongDictation – an utterance up to 10 minutes long; ShortPhrase – an utterance up to 15 seconds long), DefaultLocale for locale settings.

private MicrophoneRecognitionClient micClient;

private string SubscriptionKey = "{your-subscription-key}";
private SpeechRecognitionMode Mode = SpeechRecognitionMode.LongDictation;
private string DefaultLocale = "en-US";

You could get Bing Speech API subscription key on this page -> https://azure.microsoft.com/en-us/try/cognitive-services/

Next thing we need to do is to create UI for our application. There we have one multi-line TextBox and two Button Controls inside of Grid like that on image below.

2017-08-29_1215

<Grid>
    <Button x:Name="btnEnd" Content="End" HorizontalAlignment="Left" Margin="625,478,0,0" VerticalAlignment="Top" Width="75" Click="btnEnd_Click"/>
    <Button x:Name="btnStart" Content="Start" HorizontalAlignment="Left" Margin="545,478,0,0" VerticalAlignment="Top" Width="75" Click="btnStart_Click"/>
    <TextBox x:Name="tbLogs" Height="463" Margin="10,10,10,35" HorizontalAlignment="Stretch" VerticalAlignment="Stretch" TextWrapping="Wrap" Text="" VerticalScrollBarVisibility="Visible" Width="690"/>
</Grid>

On both buttons we add Click event handler: btnStart_Click and btnEnd_Click. First to start speech recognition via microphone and second to stop it.

private void btnStart_Click(object sender, RoutedEventArgs e)
{
    if (micClient == null)
        CreateMicrophoneRecoClient();

    micClient.StartMicAndRecognition();
}

private void btnEnd_Click(object sender, RoutedEventArgs e)
{
    micClient.EndMicAndRecognition();
}

Take a deeper look at CreateMicrophoneRecoClient() function. Firstly, we need to initialize micClient by CreateMicrophoneClient() function with recognition mode, locale and subscription key.

micClient = SpeechRecognitionServiceFactory.CreateMicrophoneClient(
    Mode,
    DefaultLocale,
    SubscriptionKey);
micClient.AuthenticationUri = "";

We set four event handlers. First is for microphone status change events, next two are for received responses (partial or full) and the last one for error handling.

micClient.OnMicrophoneStatus += OnMicrophoneStatus;
micClient.OnPartialResponseReceived += OnPartialResponseReceivedHandler;
micClient.OnResponseReceived += OnMicDictationResponseReceivedHandler;
micClient.OnConversationError += OnConversationErrorHandler;

When we recognize changes in microphone status we check if microphone is in recording mode. If it is, we write into log some suitable notification.

private void OnMicrophoneStatus(object sender, MicrophoneEventArgs e)
{
    Dispatcher.Invoke(() =>
    {
        if (e.Recording)
        {
            WriteLine("Please start speaking ... :)");
        }

        WriteLine("");
    });
}

When we receive partial response, we just write it down.

private void OnPartialResponseReceivedHandler(object sender, PartialSpeechResponseEventArgs e)
{
    WriteLine("{0}", e.PartialResult);
    WriteLine("");
}

On the other side – when we receive full response, we check if is end of dictation or if timeout is fired. In that case we end recognition process.

if is not, then we write down best recognition response (Confidence property as criterion).

private void OnMicDictationResponseReceivedHandler(object sender, SpeechResponseEventArgs e)
{
    if (e.PhraseResponse.RecognitionStatus == RecognitionStatus.EndOfDictation ||
        e.PhraseResponse.RecognitionStatus == RecognitionStatus.DictationEndSilenceTimeout)
    {
        Dispatcher.Invoke(
            (Action)(() =>
            {
                micClient.EndMicAndRecognition();
            }));
    }

    if (e.PhraseResponse.Results.Length > 0)
    {
        WriteLine(e.PhraseResponse.Results.OrderByDescending(x => x.Confidence).First().DisplayText);
    }
}

Another one is for error handling where we just write down error text:

private void OnConversationErrorHandler(object sender, SpeechErrorEventArgs e)
{
    WriteLine("Error: {0}", e.SpeechErrorText);
    WriteLine("");
}

Two additional things which we need to do. First is WriteLine() function for writing text to log TextBox named tbLogs.

private void WriteLine(string format, params object[] args)
{
    var formattedStr = string.Format(format, args);
    Trace.WriteLine(formattedStr);
    Dispatcher.Invoke(() =>
    {
        tbLogs.Text += formattedStr + "\n";
        tbLogs.ScrollToEnd();
    });
}

Second and at the same time last thing which we need to do is to override OnClosed event handler where we dispose our microphone.

protected override void OnClosed(EventArgs e)
{
    if (null != micClient)
    {
        micClient.Dispose();
    }

    base.OnClosed(e);
}

Lets play with our app.

2017-08-30_0727

Google Cloud Speech API Example

Lets switch to Google. You need to be registered into Google Cloud Platform. Then you could enable Speech API. After that you will get JSON file with your credentials.

For that example we have SpeechRecognitionGoogle WPF project. We have to install NuGet package Google.Cloud.Speech.V1.

In this WPF project example we have exactly same WPF UI with one multi-line TextBox and two Button controls. Both buttons have event handler (btnStart_Click , btnEnd_Click).

On Window Loaded event we save path to JSON credential file to Environment Variable named GOOGLE_APPLICATION_CREDENTIALS and we start Speech API on microphone input.

private async void Window_Loaded(object sender, RoutedEventArgs e)
{
    Environment.SetEnvironmentVariable("GOOGLE_APPLICATION_CREDENTIALS", credentialPath);

    await FromMicrophone();
}

Inside FromMicrophone() function we check if computer has any microphone device plugged in.

if (NAudio.Wave.WaveIn.DeviceCount < 1)
{
    WriteLine("No microphone!");
    return -1;
}

After that we create Speech Client object and write the initial request with the configuration datas.

speech = SpeechClient.Create();

streamingCall = speech.StreamingRecognize();
// Write the initial request with the config.
await streamingCall.WriteAsync(
    new StreamingRecognizeRequest()
    {
        StreamingConfig = new StreamingRecognitionConfig()
        {
            Config = new RecognitionConfig()
            {
                Encoding =
                RecognitionConfig.Types.AudioEncoding.Linear16,
                SampleRateHertz = 16000,
                LanguageCode = "sl",
            },
            InterimResults = true,
        }
    });

Then we have to define task which is called when we get new response from speech recognition processor. It write down best recognition response (Confidence property as criterion).

// Print responses as they arrive.
printResponses = Task.Run(async () =>
{
    while (await streamingCall.ResponseStream.MoveNext(
        default(CancellationToken)))
    {
        foreach (var result in streamingCall.ResponseStream
            .Current.Results.Where(x => x.IsFinal = true))
        {
            WriteLine(result.Alternatives.OrderByDescending(x => x.Confidence).First().Transcript);
        }
    }
});

And the last thing are settings for reading from microphone and streaming to API.

writeLock = new object();
writeMore = true;
waveIn = new NAudio.Wave.WaveInEvent();
waveIn.DeviceNumber = 0;
waveIn.WaveFormat = new NAudio.Wave.WaveFormat(16000, 1);
waveIn.DataAvailable +=
    (object sender, NAudio.Wave.WaveInEventArgs args) =>
    {
        try
        {
            lock (writeLock)
            {
                if (!writeMore) return;
                streamingCall.WriteAsync(
                    new StreamingRecognizeRequest()
                    {
                        AudioContent = Google.Protobuf.ByteString
                            .CopyFrom(args.Buffer, 0, args.BytesRecorded)
                    }).Wait();
            }
        }
        catch { }
    };

In btnStart_Click event handler we have to start recording and in btnEnd_Click we need to stop it and write down the response.

private void btnStart_Click(object sender, RoutedEventArgs e)
{
    waveIn.StartRecording();
    WriteLine("Please start speaking ... :)");
    WriteLine("");
}

private async void btnEnd_Click(object sender, RoutedEventArgs e)
{
    waveIn.StopRecording();
    lock (writeLock) writeMore = false;
    await streamingCall.WriteCompleteAsync();
    await printResponses;
}

Lets play with our app, this time in Slovenian language 🙂

2017-08-30_0725

[ Complete code on GitHub ]

Cheers!
Gašper Rupnik

{End.}

Advertisements

One thought on “Speech Recognition (Microsoft Bing Speech API vs. Google Cloud Speech API)

Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Powered by WordPress.com.

Up ↑

%d bloggers like this: