Ex-Googler Shares His Big-Data Secrets With the Masses


Theo Vassilakis and Toli Lerios (left to right).

Theo Vassilakis and Toli Lerios (left to right). Theo Bryna Cain



Google’s search engine makes it wonderfully easy to locate stuff on the web, whether it’s in a news article, a corporate website, or a video on YouTube. But that only begins to describe Google’s ability to find information. Inside the company, engineers use several uniquely powerful tools for searching and analyzing its own massive trove of data.


One of those is Dremel, a tool that helps Google’s employees analyze data stored across thousands of machines, at unusually fast speeds. What’s more, Dremel lets the Google team to manipulate all of this data using a language very similar to SQL, short for Structured Markup Language, the standard way of grabbing information from databases.


Like most of its custom-built tools, Dremel is only available inside Google. But now, the rest of the world can hack data a little more like Google does, thanks to Quest, a Dremel-like query engine created by Theo Vassilakis, one of the lead developers of Dremel at Google, and Toli Lerios, a former engineer at Facebook. The tool is one of a growing number of that seek to mimic the way web giants like Google and Facebook rapidly analyze enormous amounts of online information stored across hundreds or even thousands of machines. This includes everything from a project called Drill, from a company called MapR, to a sweeping open source platform called Spark.


Vassilakis and Lerios cooked up the idea for Quest in 2012. “We were looking inside of Google and Facebook at how hard it is to get data and combine data and produce useful results,” Vassilakis says. “And we thought about what’s going on at all these companies without 15,000 engineers.” So they quit their jobs and started their own company, Metanautix, and set about building Quest. Today, after two years of development, the product is now available to any company that would like to use it.


The idea behind Quest is to make it simple for analysts to query data from anywhere in a company with a single tool, regardless of where that data is stored, without the need to learn new programming languages. Using Quest, analysts can query traditional sources such as Oracle’s flagship database, “big data” storage systems like Hadoop, log files, Word documents, images and media files, and more. But it isn’t just a search engine.


Just like Dremel, Quest lets you query data using a SQL-like language. “Our view is that if you can show people the traditional metaphors that they’re used to, such as tables and SQL queries, that’s the easiest way for them to get started,” he says. “We’re trying to support all the traditional metaphors without teaching people new things.”


Quest isn’t a database. It doesn’t store data. And although Quest can be used to move data around from system to system, it can also analyze data without moving it, making copies of the data and shuttling these copies through its own memory system. To accomplish all of this, Metanautix built connectors for several major storage systems, including Oracle, Hadoop and Amazon S3. And thanks to its use of the Java Virtual Machine, it can interface with just about any data source you can think of.


You could use it to correlate data from purchase orders stored a data warehousing system in your own data center with product photos stored in the cloud, for example, or analyze web analytics data stored in Hadoop with customer profiles stored in an Oracle database, and throw in some information laying around in Word documents on the company shared drive for good measure.


It can also keep track of the changes you make to your data. That’s a big part of what sets Quest apart from many other big data tools, says Mark Madsen, founder of the analyst firm Third Nature. Companies in regulated industries—from health care to finance to pharmaceuticals—need to be able to provide an audit trail to prove their compliance with the law. That’s not something that many new age data analytics tools account for, Madsen says.


There are a few other Dremel clones out there already, such as Cloudera’s Impala and MapR’s Drill. But these other projects are more concerned with collecting data, says Madsen, while Quest is focused on manipulating data. “Data in its raw form isn’t that useful,” he says. “You have to do things to it. You have to shape, and discard the stuff you don’t need.”



No comments:

Post a Comment