Voice to Text with Chrome Web Speech API
Since 2013, when Google Chrome released version 25, the support of Web Speech API opened up a whole new world of opportunities for web apps to convert voice to text.
With the demo below, you can use Google Chrome as a voice recognition app and type long documents, emails and school essays without touching the keyboard.
Voice to Text demo
GitHub repository
You can download the complete code of the above demo in the link below:
Implementation
You might be thinking “functionality like Speech to Text is pretty complex to implement.” Well, you’d be right if you train the speech recognition model from scratch. But thanks to Google, they already did the hard work for you, by utilizing the Chrome’s build in Web Speech API, you can turn your Chrome browser into a Voice to Text app. Let’s explore more detail below.
Folder structure
- images – contain mic images
- js – contain javascript files
- languages.js – list of languages supported
- web-speech-api.js – main application javascript
- style – contain css style file
- index.html – main html file
# Step 1 : Check browser support
As you can see above, Chrome is the major browser that supports speech to text API, using Google’s speech recognition engines.
You can tell whether the browser supports the Web Speech API by checking if the webkitSpeechRecognition
object exists.
if ('webkitSpeechRecognition' in window) {
// speech recognition API supported
} else {
// speech recognition API not supported
}
# Step 2 : Create speech recognition object
The next step is to create a new speech recognition object.
recognition = new webkitSpeechRecognition();
# Step 3 : Register event handlers
The speech recognition object has many properties, methods and event handlers. For the full list, please refer to https://w3c.github.io/speech-api/#speechreco-section
interface SpeechRecognition : EventTarget {
// recognition parameters
attribute SpeechGrammarList grammars;
attribute DOMString lang;
attribute boolean continuous;
attribute boolean interimResults;
attribute unsigned long maxAlternatives;
// methods to drive the speech interaction
void start();
void stop();
void abort();
// event methods
attribute EventHandler onaudiostart;
attribute EventHandler onsoundstart;
attribute EventHandler onspeechstart;
attribute EventHandler onspeechend;
attribute EventHandler onsoundend;
attribute EventHandler onaudioend;
attribute EventHandler onresult;
attribute EventHandler onnomatch;
attribute EventHandler onerror;
attribute EventHandler onstart;
attribute EventHandler onend;
};
Below I would highlight the import parts for this application.
recognition.continuous = true;
recognition.interimResults = true;
When recognition.continuous
is set to true, the recognition engine will treat every part of your speech as an interim result. When recognition.interimResults
is set to true, interim results should be returned.
recognition.onresult = function(event) {
var interim_transcript = '';
for (var i = event.resultIndex; i < event.results.length; ++i) {
if (event.results[i].isFinal) {
final_transcript += event.results[i][0].transcript;
} else {
interim_transcript += event.results[i][0].transcript;
}
}
final_transcript = capitalize(final_transcript);
final_span.innerHTML = linebreak(final_transcript);
interim_span.innerHTML = linebreak(interim_transcript);
};
Let’s explore this recognition.onresult
event below, to get more understand of what would be return.
The recognition.onresult
event handler returns a SpeechRecognitionEvent
which contains below fields:
event.results[i]
– the array containing recognition result objects. Each array element corresponds to a recognized word on the i recognition stage.event.resultIndex
– the current recognition result index.event.results[i][j]
– the j-th alternative of a recognized word. The first element is a mostly probable recognized word.event.results[i].isFinal
– the Boolean value that shows whether this result is final or interim.event.results[i][ j].transcript
– the text representation of a word.event.results[i][j].confidence
– the probability of the given word correct decoding (value from 0 to 1).
# Step 4 : Language selection
Chrome speech recognition supports numerous languages, If your users are speaking a language other than English, you can improve their results by specifying the language parameter recognition.lang
recognition.lang = select_dialect.value;
# Step 5 : Start recognition
By calling therecognition.start()
, it activate the speech recognizer. Once it begins capturing audio, it calls the onstart
event handler, and then for each new set of results, it calls the onresult
event handler.
$("#start_button").click(function () {
recognition.lang = select_dialect.value;
recognition.start();
});
That’s it! The rest of the code are just to enhance user experience. It shows the user some informative messages, and swaps the GIF image on the microphone button.
Conclusion
The Web Speech API is very useful for voice control, dialog scripting, data entry. But at the moment among the major browsers, it is only supported by Chrome on desktop and Android phones. It would be good to see this great feature can be supported by other modern browsers in the future.
Thank you for reading. If you like this article, please share on Facebook or Twitter. Let me know in the comment if you have any questions. Follow me on Medium, GitHub and Linkedin. Support me on Ko-fi.
5 Comments
Thanks for your article, I wonder to know whether it supports to read a local audio file and convert into text? Or should I only to use Google Cloud Speech API to do this? Thanks!
I would suggest you to use Google Cloud Speech API
chrome cannot activated the microphone?
Chrome should be able to prompt to grant access microphone on laptop and Android Phones, if you are using iPhone, then only Safari browser can access microphone
[…] Chrome 浏览器的 Web Speech API,支持中文,代码开源,这里还有一篇介绍文章。(@jerrylususu […]