Startup Crunches 100 Terabytes of Data in a Record 23 Minutes


big-data-inline

Paul Price/Getty



There’s a new record holder in the world of “big data.”


On Friday, Databricks—a startup spun out of the University California, Berkeley—announced that it has sorted 100 terabytes of data in a record 23 minutes using a number-crunching tool called Spark, eclipsing the previous record held by Yahoo and the popular big data tool Hadoop.


The feat is impressive in and of itself, but it’s also a sign that the world of big data—where dozens, hundreds, or even thousands of machines can be used to sort and analyze massive amounts of online information—continues to evolve at a rather rapid pace. Hadoop has long served as the poster child for the big-data movement, but in recent years, the state-of-the-art has moved well beyond the original ideas that spawned it.


Based on research papers that Google published about its own big-data systems in 2003 and 2004, Hadoop sprung up at Yahoo, and it’s now used by many of the web’s biggest names, from Facebook to Twitter and eBay. In the beginning, it wasn’t something that operated in “real-time”—when crunching large amounts of data, you had to wait a while—but now, Spark and other tools, many based on Hadoop, are analyzing massive datasets at much greater speeds.


One of the main problems with Hadoop MapReduce—the original platform—is that it’s a “batch system.” That means it crunches data in batches. It takes a while to crunch each set of information, and if you want to add more data to the process, you have to start over with a new batch. But the state of the art has improved dramatically since Google released those papers in 2003 and 2004. These days, Google uses newer tools like Dremel to analyze enormous amounts of data in near real-time, and the open source world is scrambling to keep up.


Developed by researchers at the University of California, Berkeley who are now commercializing the tech through Databricks, Spark is just one part of this movement. The Silicon Valley startup Cloudera offers a system called Impala, while competitor MapR is developing a Dremel-style tool called Drill. Meanwhile, the open source Hadoop project now offers a new interface called Yarn.


Part of Spark’s appeal is that it can process data in computer memory, as opposed to just using hard disks, much move at slower speeds. But because the amount of data that can fit in-memory is limited, the tool can process data on disks as well, and that’s what Databricks was trying to highlight as it sought to break’s Yahoo’s record on the Gray Sort, which measures the time it takes to sort 100 terabytes of data, aka 100,000 gigabytes.


Yahoo did the sort in 72 minutes with a cluster of 2,100 machines using Hadoop MapReduce last year. Databricks was able to process the same amount of data in 23 minutes using Spark, using only 206 virtual machines running on Amazon’s cloud service. It also sorted a petabtye of data—about 1,000 terabytes — in less than four hours using 190 machines.


Although this appears to be a record for this type of sorting using open source software, there are ways to sort data faster. Back in 2011, Google previously conducted a petabyte sort in only 33 minutes, as pointed out by a commenter on the popular programmer hangout Hacker News. But it took 8,000 machines to do what Databricks did with 206. And, as Databricks engineer Reynold Xin tells us, Google didn’t share its process with the world, so we don’t know if it followed the rules specified as part of the Gray Sort.


But most importantly, Databricks did its test using software that anyone can use. “We were comparing with open source project Hadoop MapReduce,” Xin says. “Google’s results are with regard to their own MapReduce implementation that is not accessible to the rest of the world.”



No comments:

Post a Comment