Mashing Up Human Smarts and AI to Unlock Computer Vision

464516590 Getty Images

Computer vision is quickly advancing, but it tends to trickle into the world in scattered, specific applications. We encounter it when Facebook automatically tags a friend in a photo, or when Google suggests images similar to one we’re searching for. But the real promise is much more exciting. A camera, properly trained, could answer simple, human questions like: “Are my kids home from school?” or “Is there a parking spot open at work?” or “How many people are in line at Shake Shack?” In other words, computer vision could make our homes and our cities smart.

Today, our machines don’t understand these sorts of queries. The researchers behind Zensors want to change that. The project, developed at Carnegie Mellon University, aims to make computer vision more accessible through a clever combination of human smarts and artificial intelligence. Though it’s only a proof of concept for now, it takes a compelling approach to the problem.

Say you’re a sandwich shop owner who wants to track how many people are in line throughout the day. Here’s the Zensors vision: You mount an old smartphone on the wall, point it at your register, and ask the Zensors app how many people are waiting. The novelty is what happens behind the scenes. First, Zensors conveys your question to humans—the Carnegie Mellon researchers used crowdsourced workers while developing the concept. These workers receive images from the smartphone, which they count and tag for a small fee. The processed images are simultaneously used to train a machine learning algorithm that also tries to count the waiting patrons. When the AI is as good as the humans, it takes over. The handoff happens seamlessly; all the business owner knows is that, within minutes of setting up the camera, Zensors provided the answer to his question for a reasonable sum.

The approach solves one of the big problems with computer vision: its inflexibility. “Computer vision has made fantastic strides, and yet a lot of it is pretty specific to a situation,” says Jason Wiese, one of the researchers who worked on the project. In technical parlance, AI-trained computer vision systems are “brittle”—they often don’t adapt well to unfamiliar environments or unexpected behavior. Because every sandwich shop has a different layout, and because every camera will have a different vantage on the action, it’s hard to create a universal “line counting” algorithm. Zensors would get around this by using just the amount of human power necessary to familiarize a computer with a specific scene. “We see this as a good way of brining computer vision to the masses,” Wiese says.

It would almost certainly be cheaper than building a solution from scratch. The Carnegie Mellon group broke down the economics in a paper presented at a human-computer interaction conference last week in Seoul. The researchers asked a number of programmers how much it would cost to develop a custom computer vision system for a determining whether a bus had arrived at a bus stop. The average quote: $3,000. Zensors used its own approach to develop working sensors for a number of similarly-complex questions: “How many cars are in this parking lot?,” “How messy is the sink?,” “Is the dishwasher door open?” On average, the algorithms could be trained in the span of a week, with humans processing a handful of images each day. Pegged to minimum wage, the cheapest sensor was trained for $5. The most expensive cost $40.

The Zensors team is still working on the platform. But the real ambition for Zensors extends beyond answering questions. The model could also bring API-like structure to video feeds, which could be used by other applications. Unlike the motion sensors in your iPhone, which make themselves available to third parties like Nike and MyFitnessPal, there aren’t APIs for easily pulling data from video feeds. With Zensors, the sandwich maker could not only track how his line fluctuated throughout the day, but use that data to inform other actions, pinging someone to open a second register, say, when more than six people were waiting. Think IFTTT with a video feed as a trigger.

“Today we think of camera images as more or less an analog signal, and one without a lot of computational meaning. But the information is clearly there,” Wiese says. Algorithms may not be able to extract it on their own just yet—but they can with some time and a little human help.

No comments:

Post a Comment