* ~100 servers, higher-end hardware, SAN & DAS storage. * MAPI, POP, IMAP, OWA, Blackberry, Goodmail, ActiveSync. * ~600 servers, commodity hardware, designed to work around frequent failures. * Homegrown, Linux based, POP3, IMAP, webmail, RSS feeds, shared calendaring, Outlook sync, Blackberry sync. 2 hosted mail products: Noteworthy, MS Exchange.80K business customers, 700K mailboxes.Founded in 1999, merged with Rackspace in 2007, previous name:.Several hundred gigabytes of email log data is generated each day.The system stores over 800 million objects (an object = a user event such as receiving an email or logging into IMAP) within Solr and 9.6 billion within Hadoop, which equals 6.3 TB compressed.The mail system and logging servers are currently in 3 of the Rackspace data centers.Rackspace has more than 50K devices and 7 data centers.This post is a little different than normal because most all the content past this point is by Bill, I've just organized it a little differently. A document sent to me by Bill Boebel, CTO of Mailtrust (Rackspace's mail division).In the rest of this post Bill describes the evolution of their system and the forces that caused them to move from a relational database solution to a MapReduce system.īefore getting started, I'd really like to thank Bill Boebel for spending so much time and effort in creating this very valuable experience report. Stu Hood nicely sums up the impact: "Now whenever we think of complex question about our customers’ usage patterns, we can pull the answer from our logs within hours via MapReduce. This switch has changed how they run their business. Not really possible in your typical ETL system. When they wanted to find out which part of the the world their customers logged in from, a quick MapReduce job was created and they had the answer within a few hours.Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins.The advantage of their new system is that they can now look at their data in anyway they want: The future came a little early this year. Moving to a partitioned MySQL data set was an option, but they thought it would only buy time until and a more scalable solution would need to be created in the future anyway. As more and more data this solution broke down with a combination of load and operational problems.įacing exponential growth they spent about 3 months building a new log processing system using Hadoop (an open-source implementation of Google File System and MapReduce), Lucene and Solr. Data was then broken into Merge Tables based on time so index updates weren't a problem. Perdiodic bulk loading was the remedy to this problem, but the shear size of the indexes slowed it down. Inserts quickly became the bottleneck as the huge torrents of data flooding caused a lot of index churn. The next big evolution was a single machine MySQL version. Then came a scripted version of the same process. Where do you store all that data? How do you do anything useful with it? In the first version of their system logs were stored in flat text files and had to be manually searched by engineers logging into each individual machine. How do you query hundreds of gigabytes of new data each day streaming in from over 600 hyperactive servers? If you think this sounds like the perfect battle ground for a head-to-head skirmish in the great MapReduce Versus Database War, you would be correct.īill Boebel, CTO of Mailtrust (Rackspace's mail division), has generously provided a fascinating account of how they evolved their log processing system from an early amoeba'ic text file stored on each machine approach, to a Neandertholic relational database solution that just couldn't compete, and finally to a Homo sapien'ic Hadoop based solution that works wisely for them and has virtually unlimited scalability potential.
0 Comments
Leave a Reply. |