Think Through Math launched a completely rebuilt version of our core math instruction platform in time for the 2012-2013 School Year. We launched on Heroku - it’s the platform that provides the least deployment friction and has the most developer-friendly interface for managing production environments. Almost from the beginning we were plagued by performance issues. TTM added some very large customers at the same time we launched our application, and that exacerbated the problem, but as it turns out the greatest source of our production issues were the ones that the Rap Genius incident exposed around request queuing at the router level. Once we found the issue we were able to mitigate it by switching to double dynos and dramatically increasing our dyno count, but even our best-case scenario had 500-700ms of request queuing at peak load.
After consideration of options we made the decision to transition away from Heroku. As a company we believe the public cloud is the right place to host our application at this stage in our growth, and for a host of reasons (some of which we will cover in future posts) we decided to use AWS vs. other cloud providers. There are many things heroku does really well with their platform; we wanted to maintain those features in our new platform as well. We made the following list of features we needed to have on AWS:
- Deploy with git
- Environment variables in the same manner that heroku uses them
- API ‘parity’ - ideally mimic heroku’s CLI commands to reduce learning curve
- Templatized server roles
- Cron support
- Rails Console and psql console through CLI
- Centralized logging through papertrail
- Scaling through the CLI
- Database backups
- Follower databases, including fast db cutovers
After some research we settled on Scalr as a replacement for much of the functionality we got from Heroku. Scalr provides AWS templates for our rails stack, background workers (we use Sidekiq Pro), and Postgres. All of our other production services we left as cloud solutions: redis, logging, and solr.
Scalr allows us to pick instance sizes based on our needs; after some experimentation we settled on c1.xlarge for our web and background tier. We use Unicorn for our web tier so based on the available memory and our app size we set the worker processes per server to 20. We used scalr’s Postgres 9.2 template running on m3.2xlarge with the EBS-optimized IO feature, and we tuned postgres settings based on our app parameters. One of the things we always suspected about running our app on Heroku is that our inability to tune work_mem and other postgres attributes was responsible for some of our performance limitations, and caused us to need larger databases overall to support our traffic. Scalr supports slave databases that are similar to heroku followers, although a bit more limited in that it’s a one-to-one ratio. If the master database runs in to problems Scalr does automatically promote the slave to master and build a new replacement slave.
At the routing layer we settled on ELB for load balancing with random routing to the rails servers. If we had more time we might have gone with ha_proxy, but as an education platform we have a narrow window in the summer when school’s out to launch and test significant changes like a platform switch.
Scalr uses ‘farms’ which are conceptually similar to Heroku apps, and inside each farm we were able to add roles based on the templates we defined. We added two special roles; one for cron (to mimic Heroku scheduler) and one for ad-hoc remoting to mimic heroku console. Each of these roles runs a full stack of our app and the servers are updated along with the rest of the app whenever we deploy.
Speaking of deployments, we again wanted to follow the model that heroku uses - deploy via CLI and git. To implement this we set up separate private tracking repos for all of our farms in github. Scalr provides event hooks to their application workflow, so we configured each farm to pull from its equivalent repo when it receives a deploy command. One thing we don’t get with this workflow is the ability to watch the deploy process do its thing, but in our experience 90% of that process on heroku was asset compliation.
As a team we’ve gotten used to managing everything through the command line, so an additional goal for our team was to maintain ‘parity’ in syntax for management of our production deployments. Scalr has a gem wrapper for its API, but it wasn’t well maintained. So we forked it and added syntactic sugar to match the heroku API. So where we previously could list our environment variables with heroku config -a APP_NAME we now use ttmscalr config -f FARM_NAME. To build out the API we implemented deploy, rails c, psql, restart, and maintenance, along with some commands we couldn’t use under heroku like ssh and scp.
The results we’ve seen from the cutover are pretty dramatic. According to New Relic our App Server average response is now below 200ms even at usage levels that are 40% over last year’s peak. On heroku even at our most stable we were seeing 500-700ms of request queuing on our best days. We can’t say with certainty why things are so much better. We are using larger rails servers with higher workers per server, so the impacts of random routing are minimized. We also suspect that Postgres tuning makes a big difference. Beyond that we don’t have good answers, but with performance now stable we’re able to refocus attention on functionality, so we haven’t spent a lot of time investigating.
One thing we did not see was a lot of cost savings from the switch from Heroku to Amazon. Costs are much lower on a server-to-server basis, but factoring in support costs, the fact that we now pay for dedicated slave databases, and now we pay for things like backups, storage and data transfer, our overall hosting cost is roughly the same. The big win here for us was stability and response time. Those benefits alone made the cutover worthwhile.
