We Need More Science: How Crowdsourcing Will Help Startups Build Their Own Versions of Siri

Wit.ai currently offers a demo of their product online. Wit.ai

Speech recognition is hard, even for the world’s largest tech companies. Apple and Google draw on massive collections of recordings of real speech patterns to help tune the voice recognition algorithms that power Siri and Google Now. And even though those tools are impressive, they still spend an awful lot of time mangling your voice commands.

Building speech-powered applications is even harder for smaller companies that just don’t have access to the sort of resources that Apple and Google do. In short, you can’t draw on the massive set of real voice commands that the big guys can. “When there’s a single developer, you never have enough examples to get good,” says Alexandre Lebrun.

That’s why he started Wit.ai, a service that helps developers pool their voice samples together to power a speech and natural language recognition system that Lebrun hopes will soon rival the depth and breadth of the tools available to the likes of Apple and Google. In the years to come, this could become an important thing, as developers build the next wave of technologies that require speech interfaces, such as internet-connected appliances and wearable devices that don’t have screens.

Wit.ai is still new, but it has already attracted thousands developers to its beta service, and on Wednesday, the company announced that it had just raised $3 million in seed funding from venture capital firm Andreessen Horowitz.

The Elephant in the Speech Rec Room

The company was born out of Lebrun’s frustrations with his experience at his previous company, VirtuOz, which developed speech recognition systems for companies like AT&T. The problem was that for each new system it built, the VirtuOz team had to start over—practically from scratch.

For each one, they had to gather a new set of voice samples to train the system. In many cases, there was overlap between the sets of commands different customers wanted to be able to recognize, but VirtuOz couldn’t reuse voice examples from one customer’s project to another.

“No matter how hard we tried, the elephant in the room was there – speech was never going to be perfect,” he wrote in a blog post today. “In fact, the end-user experience was sometimes catastrophic. Worse, because of the very high setup price to integrate voice into a system, no single vendor could truly address the needs of smaller companies or developers.”

Last year, Lebrun sold VirtuOz to Nuance, the speech recognition company that helps power Siri, and then he launched Wit.ai.

The Wit.ai team. Wit.ai

How It Works

Typically, a speech recognition developer begins by creating what’s called a “grammar”—a collection of words and phrases that you want the computer to be able to recognize. Then developers “train” the computer to recognize that grammar by feeding it as many different examples of people saying those words and phrases as possible. Since different users may phrase their commands differently, a grammar needs to be as robust as possible, recognizing as many different ways to express the same desire as possible.

What Wit.ai is essentially doing is making it possible for companies to share grammars and training data in the same way that software developers share code on sites like GitHub. And just as developers can create their own copies of code hosted on GitHub to modify as they like, they can copy grammars to modify for their own applications.

The business model is similar to GitHub as well. Just as GitHub is free for anyone who shares their code publicly, Wit.ai is free to anyone who shares their data. The actual voice recordings used to train the system won’t be shared, for privacy reasons and practicality. Companies that, for whatever reason, don’t want to share their grammars or data can pay a fee to use the service.

The Free Proposition

Wit.ai joins a growing number of companies and projects aimed at helping developers bring speech recognition to their applications. There are also open source projects such as Julius and CMU Sphinx, and other hosted service such as Google’s voice to text text. It interprets that speech, trying to determine what exactly the user wants to do.

By offering a free service, Lebrun hopes to attract a huge number of different grammars and training data, allowing it to offer a speech recognition capabilities on par with Apple and Google.

One big downside is that all the audio has to travel across the internet to the company’s servers. That means there could be issues with latency, availability and privacy. But Lebrun says that a “hybrid” version that works mostly on the client side and then exchanges information with the server is on the way.

We Need More Science

How Crowdsourcing Will Help Startups Build Their Own Versions of Siri

The Elephant in the Speech Rec Room

How It Works

The Free Proposition

No comments:

Post a Comment