Twitter engineer Mazdak Hashemi says the Japanese tweet like no one else on earth.
When the New Year arrives or even as they watch certain moments in shows and movies broadcast on national television, tens of thousands of Japanese will tweet at practically the same instant. “Everyone tweets at the New Year, but the Japanese are more in-sync,” says Hashemi, who, as Twitter’s director of site reliability engineering, works to make sure its mini-messaging service stays in good working order. “They do it at exactly midnight.”
This provides a small window into the unique culture of the Japanese, known for exhibiting a certain type of conformity, but there was a time when it was also an enormous problem for Twitter. As the year 2012 arrived in Japan, the country’s synchronized tweets crashed the entire site, worldwide. It was 3pm in Britain when the site went belly-up.
‘Everyone tweets at the New Year, but the Japanese are more in-sync. They do it at exactly midnight.’
So, as the next New Year approached, Raffi Krikorian, one of Twitter’s lead engineers, urged Hashemi to find a better way of ensuring the site could handle the next wave of synchronized Japanese tweets. “I think he had some post-traumatic stress,” Hashemi says of Krikorian in the wake of the 2012 New Year. As a result, Hashemi and his team built a new system—known as a software “framework,” in engineering speak—that would let them mimic events like a Japanese New Year tweet storm and actually run these synthetic creations on the thousands of computers that run the live the site.
Internet engineers call it “stress testing,” and though this sort of thing is very common, Twitter’s situation was a bit different, and its methods could serve as a model for other online operations as they reach Twitter-like sizes. Because of the real-time nature of the site—where people expect to send and receive instantly, at all times—Hashemi and his team needed tools that could very carefully shape and reshape these massive tests, and because the service is used in this real-time way across the globe—it spans 240 million users who generate about about 5,700 tweets a second—there weren’t “off hours” when they could run live tests without having to worry about massive amounts of “real” traffic.
“We can’t test outside of business hours,” says Ali Alzabarah, who works alongside Hashemi. “We don’t have business hours.”
The tests Hashemi wanted to run were so large—larger than the real traffic storm that brought down the site during the last New Year—some engineers at Twitter didn’t even want him to try them. “They thought I was smoking something,” says Hashemi, who describes the company’s wider testing efforts in a blog post published today. “You’re pretty much putting your job on the line. It’s like: ‘Am I going to be here or not?’”
But the stress testing framework he and his team built also included new monitoring tools that would let closely track the results of the tests—on a second-by-second basis—and scale them back as need be. In the end, these tests proved very successful—and the site stayed up for the next New Year, and the one after that. Last August, it also held firm when the Japanese helped set a new tweets-per-second record as they all tweeted at the arrival of a particular moment in the television airing of an animated movie called Castle in the Sky .
Much of this is thanks to an sweeping effort to rebuild the site using a software programming technology called Scala. And the company may be expanding into data centers in other parts of the world, so that it serve foreign countries like Japan with dedicated local machines—though Hashemi declines to comment on this. But the company’s new stress testing framework plays its own important role. According to Adrian Cockcroft, a technology fellow with venture capital firm Battery Ventures who previously served as a chief architect at Netflix, another company that deals with rather usual types and amounts of online traffic, this sort of thing isn’t easy.
‘We can’t test outside of business hours. We don’t have business hours.’
“As soon as you get to enormous scale, the off-the-shelf testing products fail,” he says. “You have to synthesize so much traffic with a pattern that actually matters. You have to put a lot of thought into what the traffic pattern is, and it’s quite hard, then, to actually build it. There are certain subtleties to this.”
As other services across the net continue to grow, they too will face similar testing problems, and the good news is that companies like Netflix and Twitter are showing the way. Netflix has opened sourced many of the tools it has built to test its site, and Twitter is a company that works in similar ways, sharing many of its software creations with the world at large in an effort boost the larger community of sites and services.
Twitter has already open sourced a tool called Iago that generates the “fake traffic” for its stress tests, and though it has not released its stressing testing framework for carefully building and monitoring these tests—the thing doesn’t even have a name—the company could do so in the future. That could come in handy. After all, the Japanese aren’t going anywhere. Nor is the rest of the net.
No comments:
Post a Comment