JavaScript Speech Recognition

Who is this guy?


macdonst on Github

works at Adobe

Apache Cordova core contributor

nutty about speech recognition

The future won't be like Star Trek.

Scott Adams, creator of Dilbert

Why do I care about speech rec?

Cape Breton Island


= Cape Bretoner

Here's a conversation between two Cape Bretoners

P1: jeet?

P2: naw, jew?

P1: naw, t'rly t'eet bye.

And here's the translation

P1: jeet?

P1: Did you eat?

P2: naw, jew?

P2: No, did you?

P1: naw, t'rly t'eet bye.

P1: No, it's too early to eat buddy.

Regular Alphabet

26 letters

Cape Breton Alphabet

12 letters!

Alright, enough about me

What is speech recognition?

Speech recognition is the process of translating the spoken word into text.

The process of speech rec includes...

Record and digitize the audio data

Perform end pointing (trimming)

Split data into phonemes

What is a phoneme?

It is a perceptually distinct units of sound in a specified language that distinguish one word from another.

The English language has 44 distinct sounds

Source: English language phoneme chart

By comparison, the Rotokas speakers in Papua New Guinea have 11 phonemes.

But the !Xóõ speakers who mostly live in Botswana have 112 phonemes.

Apply the phonemes to the recognition model. This is a massive lexicon which takes into account all of the different ways words can be pronounced.

Analyze the results against the grammar

Return a confidence weighted result

    "confidence": 0.97335243225098,
    "transcript": "hello"
    "confidence": 0.19940405040800,
    "transcript": "hell low"
    "confidence": 0.19910827091000,
    "transcript": "how low"


We want it to be like this

but more often than not...

Why is that?

When two people talk comprehension rates are better than 97%

A really good english language speech recognition system is correct 92% of the time

Where does that extra 5% in error rate come from?

  • Vocabulary size and confusability
  • Speaker dependence vs independence
  • Isolated or continuous speech
  • Initiated vs spontaneous speech
  • Adverse conditions

Mobile Speech Recognition

OS  Application  SDK
Android Google Now Java API
iOS Siri Many 3rd party Obj-C SDK's, SFSpeechRecognizer (iOS 10+)

So how do we add speech rec to our app?

You may look at the W3C Speech API Specification

but only Chrome and Firefox have implemented that spec

The spec looks like this:

interface SpeechRecognition : EventTarget {
    // recognition parameters
    attribute SpeechGrammarList grammars;
    attribute DOMString lang;
    attribute boolean continuous;
    attribute boolean interimResults;
    attribute unsigned long maxAlternatives;
    attribute DOMString serviceURI;

    // methods to drive the speech interaction
    void start();
    void stop();
    void abort();

With additional event methods to control behaviour:

attribute EventHandler onstart;
attribute EventHandler onaudiostart;
attribute EventHandler onsoundstart;
attribute EventHandler onspeechstart;
attribute EventHandler onspeechend;
attribute EventHandler onsoundend;
attribute EventHandler onaudioend;
attribute EventHandler onend;

attribute EventHandler onresult;
attribute EventHandler onnomatch;
attribute EventHandler onerror;

Let's recognize some speech

var recognition = new SpeechRecognition();
recognition.onresult = function(event) {
  if (event.results.length > 0) {
    var test1 = document.getElementById("test1");
    test1.innerHTML = event.results[0][0].transcript;
Replace me...

So that's pretty cool...

But I want to do something more exciting with the result

Let's ask the web a question

Works pretty good...

...but ugly!

Let's style our button with some CSS

<a class="speechinput">
    <img src="images/mic.png">
#speechinput input {
	-webkit-transform: scale(3.0, 3.0);

And we'll add some color using



Pure-CSS-Speech-Bubbles by Nicholas Gallagher

Then pull it all together!

But wait, why am I using my eyes like a sucker?

We'll output the answer using SpeechSynthesis

Pretty much all browsers have implemented this spec

The SpeechSynthesis spec looks like this:

interface SpeechSynthesis {
      readonly attribute boolean pending;
      readonly attribute boolean speaking;
      readonly attribute boolean paused;

      void speak(SpeechSynthesisUtterance utterance);
      void cancel();
      void pause();
      void resume();
      SpeechSynthesisVoiceList getVoices();

The SpeechSynthesisUtterance spec looks like this:

interface SpeechSynthesisUtterance : EventTarget {
      attribute DOMString text;
      attribute DOMString lang;
      attribute DOMString voiceURI;
      attribute float volume;
      attribute float rate;
      attribute float pitch;

With additional event methods to control behaviour:

      attribute EventHandler onstart;
      attribute EventHandler onend;
      attribute EventHandler onerror;
      attribute EventHandler onpause;
      attribute EventHandler onresume;
      attribute EventHandler onmark;
      attribute EventHandler onboundary;

But wait, one more thing...

What if I want continuous speech rec?

Use Annyang!

An incredible library by Tal Ater

2kb and no dependencies

Setup looks like this:

<script src="//"/>
if (annyang) {
  // Let's define a command.
  var commands = {
    'hello': function() { alert('Hello world!'); }

  // Add our commands to annyang

  // Start listening.

The real genius is in the command grammar

var commands = {
    'show tps report': function() { alert('TPS Report!'); },
    'turn the background color *color': function(color) { = 'background-color: ' + color;
    'say hello (to the attendees) computer': sayHello

And if I want to develop hybrid apps using Apache Cordova/PhoneGap/Ionic?

Plugin repo's


OS  Recognition  Synthesis
Android  ✓  ✓
iOS*  ✓  Native to iOS 7.0+
* Thanks to Julio César (@jcesarmobile) for his work on iOS

Getting started

phonegap create speech com.example.speech speech
cd speech
phonegap platform add android
phonegap plugin add
phonegap plugin add
phonegap run android

For more information on hybrid applications. Seek me out during the conference, I can talk your ear off.

Types of Speech Recognition Applications

  • Voice Web Search
  • Speech Command Interface
  • Continuous Recognition of Open Dialog
  • Domain Specific Grammars Filling Multiple Input Fields
  • Speech UI present when no visible UI need be present
  • Voice Activity Detection
  • Speech Translation
  • Multimodal Interaction
  • Speech Driving Directions