Everyone knows you need data to make decisions. If you’re developing software of any sort, quick access to metrics can enable you to answer important questions like this:
- Is your app working?
- If not, which piece of your infrastructure specifically broke?
- Which areas need performance optimization?
- Do you need to scale?
- What was the impact of your recent code deployment?
- Do your users care?
At development time, we can get all of the data we need, we use it for debugging, performance optimization etc. but what good is data if it's not real-world?
Things that should be true in 2013
- Pushing your app to production shouldn’t mean a loss of insight.
- We should be able to measure everything, at all layers of the stack.
- We should get infrastructure and application monitoring by default.
- We should be able to visualise that data quickly and easily, so we can use it to solve problems and make informed decisions.
The current state of metrics tracking
Collecting metrics used to be painful, this is definitely no longer true. There are a massive number of analytics services that have cropped up in recent years. Here are just a few:
Application Monitoring – New Relic, AppDynamics
Infrastructure Monitoring – Nagios, Pingdom
Business Insights – Google Analytics, MixPanel, KissMetrics etc.
Logs/Exceptions – Airbrake, LogStash
These services take you 80% of the way, they provide a good base set of metrics you can start using immediately, and they are excellent solutions. But only you know the specific metrics you care about, you can do better.
The whole idea of tracking custom metrics is to get back the insight you lose when you push your application to production. You want to avoid grappling for a set of UNIX tools like top, free, df, netstat from your bag as you rapidly hunt down problems that have broken your site. This data should be in your hands already, not in log files or system utilities.
Enter CollectD, StatsD, and Graphite
They are a suite of tools unlimited in flexibility, they can track anything and everything. They are platform agnostic, all you need is the ability to send a UDP packet, and with library bindings for all the major languages such as Node, Ruby, Python, Go, you won't even need to do that.
How do they fit together?
Application & Business Metrics – Tracked by StatsD
Infrastructure Metrics – Tracked by CollectD
The simplicity and speed of StatsD
- UDP Based – Fire and Forget, instrument your app liberally because it's so fast it's effectively free.
- Very simple to use
- Lightweight Node.Js daemon
- Aggregates stats and sends them to pluggable backends over TCP (Graphite, Librato)
There are also a number of libraries that tap into common frameworks, providing liberal instrumentation without cluttering your codebase. Take a look at the Nunes gem for Ruby on Rails or ActiveSupport::Notifications for more details.
The convenience of CollectD
- Collects system metrics such as CPU, Memory, Disk, Network
- An array of plugins which extend metrics gathering all the way through your stack to the likes of Varnish, Nginx, Memcached, Redis, MySQL, Postgres, RabbitMQ.
- Custom extensibility with the ‘exec’ plugin if that’s not enough.
- Pluggable backends just like StatsD (Graphite, Librato)
The (ahem) joys of Graphite
Graphite is a very powerful backend to the previous two services. It provides real-time visualisation of all of this data. However, I didn’t get on with it too well.
Problems: It’s ugly, it’s time consuming to generate nice reports, its annoying to deploy.
Librato is a paid service seemingly pitched as a hosted graphite alternative. This is odd, as the rest of the stack is tried and tested open source software. Librato was compelling for me for the following reasons:
- No time spent setting up and deploying graphite.
- Extremely malleable, create custom dashboards from metrics in seconds.
- Great interface, you could give this directly to anyone non-technical.
- Email, Campfire and PagerDuty alerts based on thresholds.
Librato gets all the more powerful when you plug in features like annotations, these provide a great way of signalling ‘events’ such as code deployments, and infrastructure changes.
I added a simple Capistrano task to push the current commit info to Librato upon deployment:
Fast, flexible insight into your application. Fully instrumenting the effect of your changes and enabling continuous delivery. Low risk changes with immediate feedback mean you can deploy with confidence every time. Give it a try.
I think the key to success in metrics tracking is to automate it. Make this trio a part of your default stack, write a Chef cookbook or an Ansible playbook. In fact, use mine!
Do you have a better way to monitor your app and business metrics? Let me know in the comments section below!