Summary: Many high-traffic web apps use an http://status.* site to communicate platform availability notices. We found ourselves needing this so we built it. We think this is a general need so we open-sourced ours at http://github.com/thinkthroughmath/status_site/.
There’s a lot that goes in to running a high-traffic, production web app, especially one that grows quickly. Over the past several months we’ve found ourselves needing to schedule small amounts of downtime to run migrations, scripts, reorganize, and denormalize data. We generally schedule these off-hours since our traffic slows significantly outside of school hours, but we want to provide a mechanism to notify interested customers about downtime or issues without spamming our entire population over email, or using in-app messaging.
We did a survey of big-name companies and how they were handling this issue. Most commonly we saw combinations of a custom web app and twitter feed that was specifically dedicated to platform status. We searched for an existing repo to use as a base and we didn’t find anything, so we built a version that tries to use the best practices of other status sites. Because contributing to OSS is important to us we’ve open-sourced our baseline at http://github.com/thinkthroughmath/status_site/. The initial version allows for an admin login that can CRUD a new issue (meaning there is a problem with system performance or there is an outage), CRUD a new maintenance event (for planned downtime), and display system metrics from New Relic if applicable. status_site supports subscriptions via email and RSS to Issues and Maintenance updates. Lastly status_site provides a javascript snippet that you can embed on your main app that will automatically display upcoming planned maintenance events.
Other things we’ve thought would be cool to add:
* twitter integration
* timeline visualization
* automatic calculation of uptime based on outages
* tie issue creation to New Relic alerts
What do you think should be added? Feel free to drop an issue in the repo or fork it and issue a pull request. Get involved, and let us know how you decide to use it!
by Jim Wrubel, Think Through Math, & Scott Shea/John Lewis, Unicon
Summary: Heroku’s random routing algorithm causes significant problems with high-traffic, single-threaded Rails apps. This post describes how we’ve configured our Unicorn dynos to reduce the impact of the router. We set the unicorn backlog to extremely low values to force the router to retry requests until it gets a dyno that’s not loaded. The post also presents a wishlist for Heroku of things that would help futher address the problem.
At Think Through Math, we run an e-learning platform designed to help students become successful at math. We’re running a Rails stack and hosted on Heroku. Over the past six months, our usage skyrocketed thanks to a string of big customer wins and a compelling solution. As our usage grew, performance started to suffer for reasons we couldn’t explain. We monitor our application with New Relic. What it showed us was a fairly flat app server response time, but as usage grew during the day, the end user response time grew dramatically. All of the growth was in the graph section marked network. We also saw waves of dyno restarts along with H12 timeouts.
At the end of every browser using our application is a math student. Many of them have struggled with math for years. They’re conditioned to think they are bad at it; that math is too hard for them. When our app is slow or times out that’s just another reason for the student to lose confidence. So we took these performance issues very seriously, and we felt powerless to stop them since we couldn’t figure out the cause of the elevated response times.
This post from Rap Genius illustrated the problem we were seeing. This post on Timelike covers the same topic, but goes in depth on the math behind routing algorithm efficiencies. Math always gets to us. Heroku has created a blog post recommending Unicorn as a way to minimize the inefficiency in random routing. For high traffic apps, the default configuration they provide won’t have much impact since dynos are severely memory-limited. Each dyno can ony run a small number of worker processes, so they are still at risk of taking on more requests than they can handle in a reasonable fashion, and the default Unicorn settings are not optimal as a solution to request queuing.
Over the course of a couple of months of research, tweaking, and performance monitoring, we’ve finally gotten a handle on our request queuing in Heroku. It’s not eliminated, and it probably won’t be as long as we are using a single-threaded Ruby implementation. However, we have been able to minimize our end user response time and virtually eliminate H12s, even at peak throughput, with some specific Unicorn configuration settings. We’re writing this post not to attack or defend Heroku’s routing implementation. Our goal is to describe our strategy to minimize request queuing, developed with invaluable guidance and research from Unicon:
Get visibility into your request queue depth
If you use New Relic (and you definitely should if you are running a high-traffic app), grab their latest gem (since the issue in this post primarily affects Rails, we’ll link directly to the relevant information). Once you have this gem fielded you will see the impact of request queuing on your app. For us, the picture was pretty grim.
Do the dirty work of query optimization.
Start by optimizing long-running queries, per Heroku recommendation. This topic is worth a blog post itself (so we wrote one).
Dramatically reduce the Unicorn backlog.
The default backlog for Unicorn is 1024. With a queue depth that high there’s no penalty for a random router algorithm in sending traffic to an overloaded dyno. We’re currently set at 6 and might still go lower 16 since Unicorn uses Linux sockets and will round up any number below 16 to 16. According to Heroku documentation, if the router sends a request to a dyno with a full backlog, it will retry the request. The retry will also be random, but with an extremely short backlog setting it’s more likely that the request will end up in a short queue. There will be some overhead in the retry (and that’s not shown anywhere, even in New Relic, at the moment), but our experience has been that request queuing ends up being 2-3x more time on the server than processing time, so a little retrying shouldn’t make things worse overall. Warning though - after 10 attempts, if the router hasn’t been able to find a free dyno it will give up and throw an H21 error. You can see if this is happening using whatever log drain add-on you have set up in Heroku. It will take some experimentation, but the goal here is to set the backlog low enough that you minimize queuing time, but not so low that you throw H21s.
Changing the Unicorn backlog setting for Heroku is similar to what is needed for setting Unicorn up with Nginx. In ./config/unicorn.rb you will want to create a listen command with the port and then specify the backlog number:
1
listenENV['PORT'],:backlog=>200
The above example will change the backlog to 200 thus rerouting a request if the queue is full. Instead of a socket or even a specified port though, you will need to pull from the Heroku PORT environment variable. You may want to adjust this setting in a staging, sandbox or production environment so we highly recommend that you make a second ENV parameter for the backlog amount and a default:
That way you can alter the backlog via a heroku config command rather than doing a deploy.
1
herokuconfig:setUNICORN_BACKLOG=16-a<appname>
This has the added benefit of not relying on the Rails environment name for configuration should you be running more than the standard “development”, “test” and “production” environments.
What else we would do
Despite all of our optimizations we still see a fair bit of request queuing. Part of this definitely stems from the single-threaded nature of the stack we have chosen. We’ve started research on switching to a multi-threaded Ruby stack. In the short term there are some ways we could think of for Heroku to reduce this problem.
From following the conversations on the web, we understand the challenge in implementing and managing an ‘intelligent’ routing mesh at the scale Heroku is working with. One option would be to segment high-traffic apps onto a separate mesh; one that could reasonably use a routing algorithm such as least-connections. That would have cost and significant engineering effort, no doubt. We imagine it would need to be part of a package specifically aimed at Heroku’s higher-end tier of customers.
An alternative that would seem to be a lower-effort solution would be to offer a dyno with increased memory; 1024MB, 2048, or even 4096. With a 4GB dyno we could run 16 workers per dyno. Since Unicorn’s master process manages queuing once the request is on the dyno, this would likely be dramatically more efficient overall. We would gladly pay 8x per dyno-hour for a 4GB dyno vs 8 512MB dynos, since we would need far fewer of them overall, and our performance would improve at the same overall infrastructure footprint. Everything else about the dyno model could stay the same.
Summary: For single-threaded Rails web apps, a good strategy to improve concurrency is to ensure a low and consistent transaction time across all of your controller actions. New Relic provides excellent tools for identifying and prioritizing the worst offenders.
At Think Through Math, we’re always happy when we can use math to solve a real-world problem. When it comes to modeling operational efficiency in Rails, math once again provides a great framework. Many Rails apps are single-threaded; as of the time of this writing ours is no different. Let’s assume that the average controller method in our app takes 250ms. That means in a perfectly ordered environment we can process four requests per second, per web server queue. We’re hosted on Heroku, using the Unicorn Web server with four worker processes per dyno. Since each worker is its own queue, we can process on average 16 requests per-second-per-dyno, or 960 requests per minute. To generalize this we can write an equation: (1000 / avg_response_time) * queues_per_dyno * 60 = dyno_throughput.
This is where New Relic comes in. New Relic will always show you your average app server response time, both real-time and as a graph over the time range you have selected.
To start using New Relic to optimize queries, click the Web transactions tab under Monitoring. Click App server if it isn’t already selected, then sort by Slowest average response time. This will give you a list of controller methods, sorted by highest average tranaction time. Heroku’s recommendation is that all transactions have an average response time below 500ms, so that’s a good place to start. What we do is make the time window reflect at least three hours of peak usage, then export the list as CSV (there’s an option in New Relic). We then look at the combination of average response time, max response time, std deviation, and call count to prioritize our efforts.
Implementing the refactor to reduce the average is going to be specific to your application, but we generally end up with one of four different strategies:
Use the transaction tracing features in New Relic to identify the slowest parts of the transaction and optimize that code
Switch elements of the page to use an ajax callback strategy
If it’s not time-sensitive, move the transaction to a background job (we use and love the awesome sidekiq for this)
We typically dedicate a portion of development effort each iteration to this activity. Rather than assign it to a specific developer we like to do optimization parties - divvy the list up and have people pair on solutions. It gives people a chance to work on parts of the code they don’tnormally interact with, and it’s an excellent way to add technical breadth to the team.
My name is Jim Wrubel - I’m the Chief Technical Officer at Think Through Math, one of Big Nerd Ranch’s clients. In my spare time (such as it is) I’m a long-course triathlete and marathoner. Recently we partnered with Big Nerd Ranch to rebuild our core e-learning platform. During the process I realized that there are a lot of similarities between training for an endurance event like triathlon and launching a major web initiative.
The work you put in during the build determines your success in the event.
If you haven’t put in the time, no amount of last-minute workouts or coding will make up for it. This is something developers know instinctively - unfortunately it’s not common knowledge among managers. Also, in product launches and major races, you usually have a specific date that is set in stone. But if you haven’t done the work in the months leading up to the event, be prepared for a rough day. Coincidentally, training for an Ironman and developing a significant web app both take about the same amount of effort and calendar days. If you can do one, you can do the other.
Repetition is the key to success.
In training and development, having a routine is critical. Plot your development iteration schedule - one week, two weeks, a month - and stick to it. Your workout schedule should also be iterative. I tend to follow a three-week build followed by one week of recovery. Even inside of an iteration, build routines and stick to them. Run the test suite (you do have a test suite, don’t you?) and make sure it passes before you push code. Try to schedule the same workouts at the same time each week. It makes it easier to plan around your life.
It’s easier with a partner or a group.
An Ironman bike leg is 112 miles. The only way to be ready to ride that distance in a race is to ride that distance in training. I am fortunate to have a great friend who is right around my cycling ability level. Without someone like that to keep you honest it’s too easy to look at a six-hour planned bike ride and decide that it’s too windy, or too hot, or it might rain. The same is true in application development. Having another developer on the project who can help internalize the business requirements, offer suggestions and code reviews, and serve as a sounding board is invaluable. If you haven’t already discovered the benefits of pair programming, it’s worth reading up on the subject. It also helps to be part of a group. At several times during our project, our lead developers at Big Nerd Ranch were able to solicit advice from the group that turned up an elegant solution that we hadn’t thought of previously. I get the same kind of support in endurance athletics from my local club, the Pittsburgh Triathlon Club.
Take the long view
Not every day will be as productive as you want it to be - in workouts or in code. It takes months to build an exceptional application and it takes months to train for an Ironman. Chances are you will get sick during this time. Or have a family emergency. Or any of a thousand things that could distract you from your goal. Make sure you plan for them and don’t get discouraged when they hit - because they will.
Climbs are never good for your heart rate.
Whether it’s average page response time for a web app or elevation gain for a cycling course, it’s the spikes that get you. Graphs of each metric are helpful in showing where the problem spots are. Rely on data you have gathered to help you make decisions. Both developers and endurance athletes tend to be data junkies, for good reason.
Summary
Great developers - architects and big thinkers - are hard to find, but if you hire only the best you will be far better off than if you settle. If you are interviewing a candidate and find out they’re a triathlete, give ‘em a deeper look. A lot of the things they do for fun might make them a good fit.