Voice to Text with Chrome Web Speech API

Since 2013, when Google Chrome released version 25, the support of Web Speech API opened up a whole new world of opportunities for web apps to convert voice to text.

With the demo below, you can use Google Chrome as a voice recognition app and type long documents, emails and school essays without touching the keyboard.


Voice to Text demo


GitHub repository

You can download the complete code of the above demo in the link below:


Implementation

You might be thinking “functionality like Speech to Text is pretty complex to implement.” Well, you’d be right if you train the speech recognition model from scratch. But thanks to Google, they already did the hard work for you, by utilizing the Chrome’s build in Web Speech API, you can turn your Chrome browser into a Voice to Text app. Let’s explore more detail below.

Folder structure

chrome voice to text project folder structure
  • images – contain mic images
  • js – contain javascript files
    • languages.js – list of languages supported
    • web-speech-api.js – main application javascript
  • style – contain css style file
  • index.html – main html file

# Step 1 : Check browser support

As you can see above, Chrome is the major browser that supports speech to text API, using Google’s speech recognition engines.

You can tell whether the browser supports the Web Speech API by checking if the webkitSpeechRecognition object exists.

if ('webkitSpeechRecognition' in window) {
  // speech recognition API supported
} else {
  // speech recognition API not supported
}

# Step 2 : Create speech recognition object

The next step is to create a new speech recognition object.

recognition = new webkitSpeechRecognition();

# Step 3 : Register event handlers

The speech recognition object has many properties, methods and event handlers. For the full list, please refer to https://w3c.github.io/speech-api/#speechreco-section

interface SpeechRecognition : EventTarget {
    // recognition parameters
    attribute SpeechGrammarList grammars;
    attribute DOMString lang;
    attribute boolean continuous;
    attribute boolean interimResults;
    attribute unsigned long maxAlternatives;

    // methods to drive the speech interaction
    void start();
    void stop();
    void abort();

    // event methods
    attribute EventHandler onaudiostart;
    attribute EventHandler onsoundstart;
    attribute EventHandler onspeechstart;
    attribute EventHandler onspeechend;
    attribute EventHandler onsoundend;
    attribute EventHandler onaudioend;
    attribute EventHandler onresult;
    attribute EventHandler onnomatch;
    attribute EventHandler onerror;
    attribute EventHandler onstart;
    attribute EventHandler onend;
};

Below I would highlight the import parts for this application.

recognition.continuous = true;
recognition.interimResults = true;

When recognition.continuous is set to true, the recognition engine will treat every part of your speech as an interim result. When recognition.interimResults is set to true, interim results should be returned.

recognition.onresult = function(event) {
  var interim_transcript = '';
  for (var i = event.resultIndex; i < event.results.length; ++i) {
	if (event.results[i].isFinal) {
	  final_transcript += event.results[i][0].transcript;
	} else {
	  interim_transcript += event.results[i][0].transcript;
	}
  }
  final_transcript = capitalize(final_transcript);
  final_span.innerHTML = linebreak(final_transcript);
  interim_span.innerHTML = linebreak(interim_transcript);
};

Let’s explore this recognition.onresult event below, to get more understand of what would be return.

web speech api result

The recognition.onresult event handler returns a SpeechRecognitionEvent which contains below fields:

  • event.results[i] – the array containing recognition result objects. Each array element corresponds to a recognized word on the i recognition stage.
  • event.resultIndex – the current recognition result index.
  • event.results[i][j] – the j-th alternative of a recognized word. The first element is a mostly probable recognized word.
  • event.results[i].isFinal – the Boolean value that shows whether this result is final or interim.
  • event.results[i][ j].transcript – the text representation of a word.
  • event.results[i][j].confidence – the probability of the given word correct decoding (value from 0 to 1).

# Step 4 : Language selection

Chrome speech recognition supports numerous languages, If your users are speaking a language other than English, you can improve their results by specifying the language parameter recognition.lang

recognition.lang = select_dialect.value;

# Step 5 : Start recognition

By calling therecognition.start(), it activate the speech recognizer. Once it begins capturing audio, it calls the onstart event handler, and then for each new set of results, it calls the onresultevent handler.

$("#start_button").click(function () {
  recognition.lang = select_dialect.value;
  recognition.start();
});

That’s it! The rest of the code are just to enhance user experience. It shows the user some informative messages, and swaps the GIF image on the microphone button.


Conclusion

The Web Speech API is very useful for voice control, dialog scripting, data entry. But at the moment among the major browsers, it is only supported by Chrome on desktop and Android phones. It would be good to see this great feature can be supported by other modern browsers in the future.

Thank you for reading. If you like this article, please share on Facebook or Twitter. Let me know in the comment if you have any questions. Follow me on Medium, GitHub and Linkedin. Support me on Ko-fi.

5 Comments

  • TC on Mar 17, 2020, Reply

    Thanks for your article, I wonder to know whether it supports to read a local audio file and convert into text? Or should I only to use Google Cloud Speech API to do this? Thanks!

    • benson_ruan on Dec 6, 2020, Reply

      I would suggest you to use Google Cloud Speech API

  • moe on Dec 24, 2020, Reply

    chrome cannot activated the microphone?

    • benson_ruan on Jan 4, 2021, Reply

      Chrome should be able to prompt to grant access microphone on laptop and Android Phones, if you are using iPhone, then only Safari browser can access microphone

  • issue-173 - iTry | Open Source Community on Sep 26, 2022, Reply

    […] Chrome 浏览器的 Web Speech API,支持中文,代码开源,这里还有一篇介绍文章。(@jerrylususu […]

Leave a Reply