Friday, March 4, 2011

Building a Faster Ruby Garbage Collector

Since late 2009, much of www.twitter.com has run on Ruby Enterprise Edition (REE), a modified version of the standard MRI 1.8.7 Ruby interpreter. At the time, we worked with the REE team to integrate some third-party patches that allowed us to tune the garbage collector for long-lived workloads. We knew this was not a perfect choice, but a switch to a new runtime (even MRI 1.9x) would introduce compatibility problems, and testing indicated that alternative runtimes are not necessarily faster for our workload. Nevertheless, the CPU cost of REE remained too high.

To address this problem, we decided to explore options for optimizing the REE runtime. We called this effort Project Kiji, after the Japanese bird.

Inefficient garbage collection

Our performance measurements revealed that even after our patches, the Ruby runtime uses a significant fraction of the CPU for running the garbage collector on twitter.com. This is largely because MRI's garbage collector uses a single heap:

  • The garbage collector’s naive stop-the-world mark-and-sweep process accesses the entire memory set several times. It first marks all objects at the “root-set” level as “in-use” and then reexamines all the objects to release the memory of those not in use. Additionally, the collector suspends the system during every sweep, thereby periodically “freezing” some of the programs.
  • The collection process is not generational. That is, the collector does not move objects between heaps; they all stay at the same address for their lifetime. The resulting fragmented memory extracts a penalty in bookkeeping cost because it can neither be consolidated nor discarded.

We needed to make the garbage collector more efficient but had limited options. We couldn't easily change the runtime’s stop-the world-process because internally it relies on being single-threaded. Neither could we implement a real generational collector because our interpreter relies on objects staying at the same address in memory.

Two heaps are better than one

While we could not change the location of an allocated object, we assumed we could allocate the objects in a different location, depending on their expected lifetime. So, first, we separated the live objects into transient objects and long-lived objects. Next, we added another heap to Ruby and called it "longlife" heap.

According to the current implementation, Kiji has two heaps:

  • An ordinary heap that has a variety of objects, including the transient objects
  • A longlife heap for objects that we expect to be long-lived.

How the longlife heap works

In the new configuration, AST (Abstract Syntax Tree) nodes occur in longlife heaps. They are a parsed representation of the Ruby programs' source code construct (such as name, type, expression, statement, or declaration). Once loaded, they tend to stick in memory, and are largely immutable. They also occupy a large amount of memory: in twitter.com's runtime, they account for about 60% of live objects at any given time. By placing the AST nodes in a separate longlife heap and running an infrequent garbage collection only on this heap, we saw a significant performance gain: the time the CPU spends in garbage collection reduced from 18.5% to 14%.

With infrequent collection, the system retains some garbage in memory, which increases overall memory use. We experimented with various scheduling strategies for longlife garbage collection that balanced the tradeoff between CPU usage and memory usage. We finally selected a strategy that triggers a collection scheduled to synchronize with the 8th collection cycle of the ordinary heap if an allocation occurs in the longlife heap. If the longlife heap does not receive an allocation, subsequent collections on the longlife heap occur after the 16th, 32nd, 64th collection cycle and so on, with each occurrence increasing exponentially.

Improved mark phase

A second heap improved garbage collection but we needed to ensure that the objects in the longlife heap continued to keep alive those objects they referenced in the ordinary heap. Due to a separation in heaps, we were now processing the majority of our ordinary heap collections without a longlife collection. Therefore, ordinary objects—reachable only through longlife objects—would not be marked as live and could, mistakenly, be swept as garbage. We needed to maintain a set of "remembered objects" or boundary objects that would live in the ordinary heap but were directly referenced by objects living in the longlife heap.

This proved to be a far greater challenge than originally expected. At first we added objects to the remembered set whenever we constructed an AST node. However, the AST nodes are not uniformly immutable. Following a parse, the Ruby interpreter tends to rewrite them immediately to implement small optimizations on them. This frequently rewrites the pointers between objects. We overcame this problem by implementing an algorithm that is similar to the mark phase, except that it is not recursive and only discovers direct references from the longlife to the ordinary heap. We run the algorithm at ordinary collection time when we detect that prior changes have occurred in the longlife heap. The run decreases in frequency over time; the longer the process runs, the more the amount of loaded code that stagnates. In other words, we are consciously optimizing for long-running processes.

An additional optimization ensures that if an ordinary object points to a longlife object, the marking never leaves the ordinary heap during the mark phase. This is because all outgoing pointers from the longlife heap to the ordinary heap are already marked as remembered objects. The mark algorithm running through the longlife heap reference chains does not mark any new objects in the ordinary heap.

Results

The graph below shows the performance curves of the twitter.com webapp on various Ruby interpreter variants on a test machine with a synthetic load (not indicative of our actual throughput). We took out-of-the-box Ruby MRI 1.8.7p248, REE 2010.02 (on which Kiji is based), and Kiji. In addition, we tested REE and Kiji in two modes: one with the default settings, the other with GC_MALLOC_LIMIT tuned to scale back speculative collection. We used httperf to stress the runtimes with increasing numbers of requests per second, and measured the rate of successful responses per second. As you can see, the biggest benefit comes from the GC tuning, but Kiji's two-heap structure also creates a noticeable edge over standard REE.

We have also measured the CPU percentage spent in the GC for these variants, this time with actual production traffic. We warmed up the runtimes first, then explicitly shut off allocations in the longlife heap once it was unlikely that any genuinely long-lived AST nodes would be generated. The results are below.

Lessons and future trends

With Kiji, the garbage collector now consumes a smaller portion of the CPU for Ruby processes, and more importantly, enables additional improvements. As we identify additional objects to move to the longlife heap, we can further decrease the overall CPU usage. The theoretical floor is about 5% of total CPU time spent in the GC. We will be posting updates with new results.

References

A major source of inspiration was the patch by Narihiro Nakamura (available here). Narihiro’s patch is a proof-of-concept; it handles few AST nodes. It is also written as a patch for MRI 1.9, and we needed to cross-port it to REE, which is a derivative of MRI 1.8.7. We substantially extended Narihiro's work to include algorithmic boundary set calculation and stop the ordinary mark from leaving the ordinary heap. We also ensured our solution integrated well with REE's strategy for avoiding copy-on-write of objects in forked process in the mark phase. These changes delivered significant gains.

Try it!

We have released the Kiji REE branch on GitHub, and hope that a future version will be suitable for merging into REE itself. In our case, switching to Kiji brought a 13% increase in the number of requests twitter.com can serve per second.

Acknowledgements

The following engineers at Twitter contributed to the REE improvements: Rob Benson, Brandon Mitchell, Attila Szegedi, and Evan Weaver.

— Attila (@asz)

Friday, October 22, 2010

Hack Week

Here at Twitter, we make things. Over the last five weeks, we’ve launched the new Twitter and made significant changes to the technology behind Twitter.com, deployed a new backend for search, and refined the algorithm for trending topics to make them more real-time. To keep with the spirit of driving innovation in engineering, we’ll be holding our first Hack Week starting today (Oct 22) and running through next Friday (Oct 29). In this light, we’ll all be building things that are separate from our normal work and not part of our day-to-day jobs. Of course, we’ll keep an eye out for whales. There aren’t many rules – basically we’ll work in small teams and share our projects with the company at the end of the week. What will happen with each project will be determined once it’s complete. Some may ship immediately, others may be added to the roadmap and built out in the future, and the remainder may serve as creative inspiration. If you have an idea for one of our teams, send a tweet to @hackweek. We’re always looking for feedback.

Wednesday, October 6, 2010

Twitter's New Search Architecture

If we have done a good job then most of you shouldn’t have noticed that we launched a new backend for search on twitter.com during the last few weeks! One of our main goals, but also biggest challenges, was a smooth switch from the old architecture to the new one, without any downtime or inconsistencies in search results. Read on to find out what we changed and why.

Twitter’s real-time search engine was, until very recently, based on the technology that Summize originally developed. This is quite amazing, considering the explosive growth that Twitter has experienced since the Summize acquisition. However, scaling the old MySQL-based system had become increasingly challenging.

The new technology

About 6 months ago, we decided to develop a new, modern search architecture that is based on a highly efficient inverted index instead of a relational database. Since we love Open Source here at Twitter we chose Lucene, a search engine library written in Java, as a starting point.

Our demands on the new system are immense: With over 1,000 TPS (Tweets/sec) and 12,000 QPS (queries/sec) = over 1 billion queries per day (!) we already put a very high load on our machines. As we want the new system to last for several years, the goal was to support at least an order of magnitude more load.

Twitter is real-time, so our search engine must be too. In addition to these scalability requirements, we also need to support extremely low indexing latencies (the time it takes between when a Tweet is tweeted and when it becomes searchable) of less than 10 seconds. Since the indexer is only one part of the pipeline a Tweet has to make it through, we needed the indexer itself to have a sub-second latency. Yes, we do like challenges here at Twitter! (btw, if you do too: @JoinTheFlock!)

Modified Lucene

Lucene is great, but in its current form it has several shortcomings for real-time search. That’s why we rewrote big parts of the core in-memory data structures, especially the posting lists, while still supporting Lucene’s standard APIs. This allows us to use Lucene’s search layer almost unmodified. Some of the highlights of our changes include:

  • significantly improved garbage collection performance
  • lock-free data structures and algorithms
  • posting lists, that are traversable in reverse order
  • efficient early query termination

We believe that the architecture behind these changes involves several interesting topics that pertain to software engineering in general (not only search). We hope to continue to share more on these improvements.

And, before you ask, we’re planning on contributing all these changes back to Lucene; some of which have already made it into Lucene’s trunk and its new realtime branch.

Benefits

Now that the system is up and running, we are very excited about the results. We estimate that we’re only using about 5% of the available backend resources, which means we have a lot of headroom. Our new indexer could also index roughly 50 times more Tweets per second than we currently get! And the new system runs extremely smoothly, without any major problems or instabilities (knock on wood).

But you might wonder: Fine, it’s faster, and you guys can scale it longer, but will there be any benefits for the users? The answer is definitely yes! The first difference you might notice is the bigger index, which is now twice as long -- without making searches any slower. And, maybe most importantly, the new system is extremely versatile and extensible, which will allow us to build cool new features faster and better. Stay tuned!

The engineers who implemented the search engine are: Michael Busch, Krishna Gade, Mike Hayes, Abhi Khune, Brian Larson, Patrick Lok, Samuel Luckenbill, Jake Mannix, Jonathan Reichhold.

Thursday, September 30, 2010

Tool Legit

Hi, I'm @stirman, and I'm a tool. Well, I build tools, along with @jacobthornton, @gbuyitjames and @sm, the Internal Tools team here at Twitter. To build or not to build internal tools is usually a debated topic, especially amongst startups. Investing in internal projects has to be weighed against investing in external-facing features for your product, although at some point the former investment shows greater external returns than the latter. Twitter has made it a priority to invest in internal tools since the early days, and with the growth of the product and the company, our tools have become a necessity.
I often hear from friends in the industry about internal tools being a night and weekend additional project for engineers that are already backlogged with "real" work. We have decided to make building tools our "real" work. This decision means we have time to build solid applications, spend the necessary time to make them look great and ensure that they work well. Our team's mission is to increase productivity and transparency throughout the company. We increase productivity by streamlining processes and automating tasks. We increase transparency by building tools and frameworks that allow employees to discover, and be notified of, relevant information in real time. Many companies use the term "transparency" when discussing their company culture, but very few put the right pieces in place to ensure that a transparent environment can be established without exposing too much information. Twitter invests heavily in my team so that we can build the infrastructure to ensure a healthy balance. We have built tools that track and manage milestones for individual teams, manage code change requests, provide an easy A/B testing framework for twitter.com, create internal short links, get approval for offer letters for new candidates, automate git repository creation, help conduct fun performance reviews and many more. We release a new tool about once every other week. We release a first version as early as possible, and then iterate quickly after observing usage and gathering feedback. Also, with the help of @mdo, we have put together a internal blueprint site that not only contains a style guide for new apps, but also hosts shared stylesheets, javascript libraries and code samples, like our internal user authentication system, to make spinning up a new tool as simple as possible. We put a lot of effort into ensuring our tools are easy to use and making them look great. We have fun with it. Here's a screenshot of a recent app that tracks who's on call for various response roles at any given time.
We also have fun learning new technologies. Here's a screenshot of a real-time Space Invaders Twitter sentiment analysis visualization that is part of a status board displayed on flat screens around the office. @jacobthornton wanted to learn more about node.js for some upcoming projects and he built "Space Tweets" to do just that! If you're interested in the code, get it on github. While we're talking about open source, we would like to mention how much our team values frameworks like Ruby on Rails, MooTools and their respective communities, all of which are very important to our internal development efforts and in which you'll find us actively participating by submitting patches, debating issues, etc. We are proactively working towards open sourcing some of our own tools in the near future, so keep an eye on this blog. Does this stuff interest you? Are you a tool? Hello? Is this thing on? Is anyone listening? (If you are still here, you passed the test! Apply here to join our team or hit me up at @stirman!)

Monday, September 20, 2010

The Tech Behind the New Twitter.com

The Twitter.com redesign presented an opportunity to make bold changes to the underlying technology of the website. With this in mind, we began implementing a new architecture almost entirely in JavaScript. We put special emphasis on ease of development, extensibility, and performance. Building the application on the client forced us to come up with unique solutions to bring our product to life, a few of which we’d like to highlight in this overview.

API Client

One of the most important architectural changes is that Twitter.com is now a client of our own API. It fetches data from the same endpoints that the mobile site, our apps for iPhone, iPad, Android, and every third-party application use. This shift allowed us to allocate more resources to the API team, generating over 40 patches. In the initial page load and every call from the client, all data is now fetched from a highly optimized JSON fragment cache.

The Javascript API

We built a JavaScript library to access Twitter's REST API for @anywhere which provided a good starting point for development on this project. The JavaScript API provides API fetching and smart client-side caching, both in-memory and using localStorage, allowing us to minimize the number of network requests made while using Twitter.com. For instance, timeline fetches include associated user data for each Tweet. The resulting user objects are proactively cached, so viewing a profile does not require unnecessary fetches of user data.

Another feature of the JavaScript API is that it provides event notifications before and after each API call. This allows components to register interest and respond immediately with appropriate changes to the UI, while letting independent components remain decoupled, even when relying on access to the same data.

Page Management

One of the goals with this project was to make page navigation easier and faster. Building on the web’s traditional analogy of interlinked documents, our application uses a page routing system that maintains a strong relationship between a URL and its content. This allows us to provide a rich web application that behaves like a traditional web site. Doing so demanded that we develop a rich routing model on the client. To do so we developed a routing system to switch between stateful pages, driven by the URL hash. As the user navigates, the application caches the visited pages in memory. Although the information on those pages can quickly become stale, we’ve alleviated much of this complexity by making pages subscribe to events from the JavaScript API and keep themselves in sync with the overall application state.

The Rendering Stack

In order to support crawlers and users without JavaScript, we needed a rendering system that runs on both server and client. To meet this need, we've built our rendering stack around Mustache, and developed a view object system that generates HTML fragments from API objects. We’ve also extended Mustache to support internationalized string substitution.

Much attention was given to optimizing performance in the DOM. For example, we’ve implemented event delegation across the board, which has enabled a low memory profile without worrying about event attachment. Most of our UI is made out of reusable components, so we've centralized event handling to a few key root nodes. We also minimize repaints by building full HTML structures before they are inserted and attach relevant data in the HTML rendering step, rather than through DOM manipulation.

Inline Media

One important product feature was embedding third-party content directly on the website whenever tweet links to one of our content partners. For many of these partners, such as Kiva and Vimeo, we rely on the oEmbed standard, making a simple JSON-P request to the content provider's domain and embeds content found in the response. For other media partners, like TwitPic and YouTube, we rely on known embed resources that can be predicted from the URL, which reduces network requests and results in a speedier experience.

Open Source

Twitter has always embraced open-source technology, and the new web client continues in this tradition. We used jQuery, Mustache, LABjs, Modernizr, and numerous other open-source scripts and jQuery plugins. We owe a debt of gratitude to the authors of these libraries and many others in the JavaScript community for their awesome efforts in writing open-source JavaScript. We hope that, through continuing innovations in front-end development here at Twitter, we'll be able to give back to the open-source community with some of our own technology.

Conclusions

With #NewTwitter, we’ve officially adopted JavaScript as a core technology in our organization. This project prompted our first internal JavaScript summit, which represents an ongoing effort to exchange knowledge, refine our craft and discover new ways of developing for the web. We're very excited about the doors this architectural shift will open for us as we continue to invest more deeply in rich client experiences. If you're passionate about JavaScript, application architecture, and Twitter, now is a very exciting time to @JoinTheFlock!

This application was engineered in four months by seven core engineers: Marcus Phillips, Britt Selvitelle, Patrick Ewing, Ben Cherry, Dustin Diaz, Russ d’Sa, and Sarah Brown, with numerous contributions from around the company.