Pennsylvania dev team uses this one weird trick to save 67% on their New Relic bill

We love New Relic. It’s saved us countless times in production, it’s the tool we use when we turn developer attention toward optimization, and it’s the gold standard for application monitoring. The only issue we have honestly is that as we’ve grown the monthly spend on New Relic starts to get steep.

Our primary platform is a rails app. We help nearly 10% of the U.S. math students in grades 3 through Algebra, so on any given school day we’re pushing a lot of traffic. At peak loads we run a pool of 20 web servers on AWS to handle our traffic. We typically only need max capacity for 6 hours a day as our night and weekend traffic is much lower, but New Relic charges by the server for their Standard and Pro accounts (we use Pro and would recommend it to anyone). Although there’s some allowance in their pricing model for ‘burst’ models, at list price $149/mo per server on an annual contract we would end up spending nearly $3k per month for their service.

At Think Through Math we’re always looking to spend money wisely. Anything we save can help us reach more students at lower cost or add staff to help us add features more quickly. So any infrastructure spend is a candidate. Our app server pool uses random request routing (yeah, we know). But an interesting side effect of random routing and a large pool of server targets is that a statistical sample of overall traffic becomes a valid representation of the whole. In other words, of our 20 servers if we only reported data to New Relic from 7, the data generated would be equally actionable.

New Relic provides average, median, 95th percentile, and 99th percentile values for overall browser page load time and app server response time, as well as breakdowns for frequently accessed controller methods, on the default dashboard home page. Error rate is also reported on the dashboard. On the Web Transactions page there are more detailed breakdowns for controller methods and on the Database tab there are similar breakdowns for individual database calls. All of these are calculated based on a total population of calls. As long as your throughput is high enough on the servers that are sending traffic to New Relic, these statistics will closely mirror the population as a whole. Likewise New Relic’s Apdex value is calculated based off of the overall statistics so the value should be the same for the subset as the whole.

So what’s the downside? If you use New Relic ppm or rpm values as indicators of capacity or throughput, you’ll need to translate the values for the subset to the population as a whole. We post metrics to an internal dashboard, so in order to make our statistics match we are using two New Relic accounts, one Pro and one on the free Lite tier, and combining the values from both API calls. This gives us the advantage of being able to use New Relic Lite features for the rest of the servers, which are very robust even with their limitations. Also if you use any of the New Relic reports (SLA, Capacity, Scalability, etc) again the absolute values reported by your subset will not reflect your true throughput.

Implementation

New Relic uses an API Key to match data from a reporting server to an account. In our environment we use a YAML file that’s loaded during initialization. The key is, in order to set this up you need a method for a specific server to effectively determine the total server count for its type during initialization. We use Scalr for managing our production environment. Their API provides SCALR_INSTANCE_INDEX, so we can use the modulo of that property to determine whether the server should register with the Pro New Relic account. We added a different app on the Lite plan and use modulo to balance between them - every third server gets the Pro key:

license_key: <%= ENV["SCALR_INSTANCE_INDEX"] % 3 == 0 ? ENV["NEW_RELIC_LICENSE_KEY"] : ENV["NEW_RELIC_LITE_KEY"] %> (from config/newrelic.yml in our rails app)

On boot, depending on the index of the server it will pick up the Pro or Lite key. We’ve set up gists for implementing on Scalr and Heroku for reference. These are Ruby/Rails-specific, but we welcome feedback on implementations in other platforms and languages. If we get any we’ll update this post. In any event this should be enough to get you started - hit up the comments if you need more guidance.

We realize this is gaming the New Relic system to an extent by signing up for multiple accounts at different tiers. At TTM we’re big fans of paying for software and services that we use (either through cash or contributions to OSS), and if we don’t need Pro monitoring on our entire farm it should be easier to specify only the servers we want to use for Pro. Right now it’s not, but this enhancement made it possible for us to control costs and get the same value out of New Relic regardless of how big we get.

Tech Blog

The technology team behind Think Through Math

Pennsylvania Dev Team Uses This One Weird Trick to Save 67% on Their New Relic Bill

Implementation

Comments