Twitter Now Lets You Search For Any Tweet Ever Sent


twitter-inline

Yi Zhuang, Paul Burstein, Gilad Mishne, three of the engineers behind Twitter’s new search engine. Josh Valcarcel/WIRED



Paul Burstein was trying to fix a software bug, and Twitter was helping him out.


The year was 2011. Burstein worked as an engineer at the massive internet company Salesforce.com, and the bug—a rather annoying flaw in the popular Java programming tools—was causing problems with the company’s online services. He’d learned of the bug when someone tweeted out a webpage describing the thing; each time he needed to re-check the details, he would search Twitter, find that tweet, and return to the webpage.


It’s the sort of thing people so often do as they look for stuff they’ve previously visited online. But then, after about a week, that tweet disappeared. When Burstein searched Twitter, it no longer turned up.


This was the way things were supposed to work. Originally, Twitter built its search engine to provide quick access to what people are tweeting right now—not to what they’ve tweeted in the past—and that meant removing every tweet from its search index after a week or so. But Burstein also knew this wasn’t ideal. It’s one of the reasons he soon left Salesforce for a job at Twitter. “I felt like there were interesting search problems to be solved,” he says.


Indeed there were. Shortly after he arrived at Twitter, Burstein and a small team of other engineers started work on a new search engine that could quickly comb through not only the millions of tweets sent over the past several of days, but also the hundreds of billions of tweets sent since the service first launched in 2006. Along the way, they rolled out preliminary versions of this tool that could search parts of its massive archive—the first in 2012, another last year—and now, the project is complete.


This morning, Twitter began rolling out a search service that lets you search for any tweet in its archive.


Outside services have long offered ways of searching old tweets, including tools like Topsy (now owned by Apple) and Tweet Machine, and such services are still the best way to find tweets that have been deleted from Twitter proper. But Twitter’s new search engine fills a conspicuous hole in its own micro-messaging service, and shows how internet search services continue to evolve, providing ever faster access to an ever growing corpus of online information.


Though the new Twitter search engine is limited to rather rudimentary keyword searches today, the company plans to expand into more complex queries in the months and years to come. And the foundational search infrastructure laid down by the company will help drive other Twitter tools as well. “It lets us power a lot more things down the road—not just search,” says Gilad Mishne, the Twitter engineering director who helped oversee the project.


From the First Tweet to the Last


Mishne recently demonstrated the new search engine during a gathering of Twitter employees at the company’s headquarters in San Francisco. The money moment was when he showed that Twitter search now lets you find the first ever tweet: founder Jack Dorsey telling the world he’s “just setting up my twttr.”


That tweet isn’t that hard find through Google and other web search engines, simply because it’s been cited so often. But the new Twitter search can just as readily find Dorsey’s second tweet and his third and so on—all the way up to tweets sent in the last few minutes.


It may seem perplexing that Twitter didn’t offer such a search engine long ago. But Twitter did not even have a search engine for recent tweets until 2011, five years after the company was founded. Though it handles enormous amounts of online traffic—the microblogging service now boasts 284 million users—the company’s engineering team is still relatively small, and it tends to expand its online tools at a rather gradual pace.


Building an all-encompassing search is rather difficult—and quite different from fashioning a tool that searches recent tweets. As Mishne puts it, the company’s first order of business was to provide a window into what’s happening now. “We’re a realtime platform. This is what Twitter is,” he says. “So we focused first on solving the realtime search problem.”


Beyond Memory


Its original realtime search engine was based on what’s called an “in-memory” system. Basically, in order to provide quick access to tweets, the company stored them in the main memory subsystems of a vast network of computers—as opposed to on hard disks, which read and write data at much slower rates.


But it was too expensive and, at least in the short-term, too difficult to set up enough machines to store all tweets in memory. So, after several days, the company would drop tweets out of its index and store them elsewhere. “We had to make a tradeoff—do things as soon as possible while trading off the depth of the index,” Burstein says.


This worked well enough, as the system could store a few billion tweets in memory, but Burstein and company knew the search engine needed to do more. As has so often been the case with other Twitter tools, the company had spent years standing back as third parties built search engines that could search for older tweets.


Some of these worked pretty well, with Twitter providing them with direct access to its “firehose” of tweets. But they didn’t necessarily provide instant access to brand new tweets. They didn’t tightly integrate with Twitter itself. And they didn’t always last. So, in late 2011, Burstein and few others, including engineer Yi Zhuang, went to work on a search engine that would directly tap the Twitter archive.


‘Can We Really Do This?’


To hear Burstein tell it, this wasn’t an easy thing. “When we started,” he remembers, “I would frequently come into the office and say: ‘can we really do this?'”


It wasn’t just that they needed to index every tweet in existence. They needed to find a way of constantly merging this index with the millions of new tweets that go out with each passing second. This, says Mike Miller, the CTO of online database outfit Cloudant, which has worked with outside companies on Twitter search engines, is the really difficult part.


When Twitter and other realtime services rose to prominence several years ago, Google refashioned its search engine so that it could handle the most resent of internet posts alongside much older data, and this required a massive overhaul of the sweeping software systems that drive its search engine. Now, Twitter has done much the same.


Basically, Burstein and crew use hundreds of machines running Hadoop MapReduce—the popular open source data-crunching tool—to collect and arrange all the data needed for its master search index, and then they use separate custom-built software to actually build the index. The trick is that a relatively small number of machines builds each part of the index. “We can massively parallelize the process,” says Burstein.


In short, one group of machines can build a portion of the index for older tweets while another is building a portion for newer tweets. The same basic software that handles the archive can also handle the realtime stuff.


Flash to the Future


The system can still do all this at speed—but it doesn’t try to stuff everything in memory. Instead, it uses machines equipped with solid-state disks, or SSDs. Basically, these are modern replacements for hard disks, built from flash memory, the same stuff that stories data and applications on your smartphone.


Reading and writing data on SSDs is significantly faster than juggling information on hard disks, and SSDs aren’t quite as expensive as storing data in main memory. This is part of a larger shift in the world of computing, with so many large operations aiming to provide quicker access to more online data. In Twitter, you can see a reflection of the internet as a whole.



No comments:

Post a Comment