Earlier this year, I gave more than 200 eager participants a look inside the DevOps toolkit at Zumba during an informative webinar hosted by Loggly, New Relic, and PagerDuty. You can watch the webinar here:
During the session, I mentioned that Zumba uses 30 different software solutions and tools to keep all of our web properties operating at their best. Several people asked me if I could share the complete list, so I have assembled it below (in alphabetical order). You’ll note there are now 29, but that’s because things move fast on our team.
Our 30-Piece DevOps Toolkit
- Amazon AWS: Powers all of our user-facing applications with a variety of AWS tools, including EC2, RDS, Route53, S3, SQS, ElasticTranscoder, SNS, CloudFront, ElastiCache, Lambda, Glacier, Redshift, CloudTrail, and WorkSpaces.
- Atlassian Jira: Tracks all our business-facing work. My standard answer to any request is “I’ll look into it, but while I do can you create a Jira and assign it to me?”.
- Atlassian HipChat: Helps my team stay in touch about issues and projects, whether we are right next to each other or worlds away. We like the fact that it integrates with Loggly alerts.
- BlazeMeter: Supports our performance and load-testing initiatives. Our favorite feature is its support for running standard Apache JMeter files.
- BrowserStack: Makes it easier for us to test our applications with different desktop and mobile browsers. We currently test 11 different browser/ OS/ device combinations.
- CloudFlare: Delivers the content for all of our 300+ domains. We have a mix of Free, Business, and Enterprise plans. Our favorite feature is the SSL Certificate Management, which means we never again have to worry about a production certificate expiring on us.
- Code School: Gives us an efficient way to keep our team up to speed on all of the tools in our arsenal. Our favorite paths are Ruby and Git.
- Compose (formerly MongoHQ): Handles our MongoDB administration, which is something we currently don’t want to bring in-house because we use it for a small workload.
- Datadog: Aggregates, organizes, and beautifies all our metrics. We really like the event stream, which is like a log of everything that is going on across our infrastructure.
- Dead Man’s Snitch: Gives us a simple way to know that our cron jobs are actually firing.
- DigitalOcean: Hosts applications and services that should be run from outside of our production network on AWS, such as uptime checks and offsite backups.
- Fastly: Speeds up our site, plain and simple. Our favorite feature is the varnish configuration file, which allows us to deploy complex cache rules.
- FullStory: Records the customer experiences on our site. There are just some things in your Apache log files that can only be explained by playing back what a user is actually doing to your app.
- GitHub: Gives us the code review and code management capabilities we need to work together as a team.
- Kraken: Provides image optimization services over an API. In addition to speeding up load times, it saves bandwidth and storage space.
- Heroku: Hosts our internal tools. For the win, since it’s really just AWS on the backend, traffic to S3 is still zero cost within the same region.
- HuBoard: Turns GitHub issues into the best Kanban board ever. We track internal issues with GitHub, alongside Pull Requests.
- Loggly: Collects and organizes all of our log data, and I mean “all.” We send about 11GB of pre-filtered data per day to Loggly via rsyslog, and every engineer on the team gets Loggly account access from day one.
- New Relic: Collects performance data for all of our applications and services with a variety of products including: APM, Browser, Synthetics, Mobile, Servers, and Plugins.
- PagerDuty: Helps us manage our incidents smoothly. We have schedules for the Quality, Operations, and Engineering teams.
- PathDefender: Certifies our sites for PCI Compliance by running weekly scans for common vulnerabilities.
- Quay: Hosts our private Docker images. We provide a Docker image for the team to run that has production parity.
- Rackspace: Serves as a great partner for site hosting.
- Segment: Integrates our tracking tools so we only need to track data once. You could argue that this isn’t an Operations tool, but you’d be wrong. Just wait until you have to correlate the number of times a user clicks the add to cart button with successful POST requests to your shop API. It’s brutal without front-end metrics.
- SendGrid: Helps us simplify our email infrastructure and keep email deliverability high.
- StatusPage: Powers our user-facing status pages and hosts our post-mortems.
- Travis CI: Runs checks against our code changes to help ensure we aren’t breaking things too often.
- Wistia: Streams videos for our content-delivery platform. After HBO GO, we didn’t want to risk it.
This is a pretty big list, and we have seen all of these solutions pay for themselves. But when you’re moving down the DevOps path, it’s always important to remember that software by itself isn’t enough.
Three Best Practices for Successfully Deploying Software to Support DevOps
Best Practice #1: Take care of your “Type 1” issues first.
By this, I mean the basics of good DevOps. It’s an analogy from the famous CrossFit coach Dr. Kelly Starrett: If you want to improve your deadlift, you had better start out by making sure that you’re getting enough water and sleep. In DevOps, some Type 1 issues are:
- Version control everything: If you aren’t doing this yet, then stop reading and get started. At Zumba we even use Gists to share log output, specifically because they are version controlled.
- Have runbooks for all alerts: Document what an on-call engineer should do to act on an alert. Don’t make me think more than I have to at 3 a.m. We use Datadog monitors, which allow us to inline our runbooks, which are just markdown formatted technical documents.
- Automate everything you can: Automation does have a point of negative return (Google “The Hazards of Going On Autopilot” for more examples), but it also has its place.
- Keep the lines of communication open with your business units: Add them to your email threads and invite them into your chat rooms. Keep business units “in the know,” and they’ll treat you the same way. Ever have your site crash because marketing did something without you knowing?
- Remove “your” and “mine” from your team’s vocabulary and replace with “our.” This seemingly simple exercise works wonders. It will feel weird at first, but stick with it and watch your teamwork improve drastically.
Best Practice #2: Give your team access to everything.
Invite the entire team to your tools and services. As long as we aren’t billed for user accounts or access isn’t restricted for some other reason, every new team member gets account access on day one. By doing this, you not only empower them to get their jobs done better but also encourage everyone to speak a common language. For example, the Zumba team consistently thinks about the user experience of our sites in terms of the apdex metric, which is a prominent part of New Relic’s interface.
One of my favorite realizations of this is using Loggly deep links to share interesting finds from our log data.
Best Practice #3: Eliminate false positives.
A false positive is the worst thing that can happen to an on-call engineer. Resentment, frustration, and distrust all simmer until eventually the engineer goes off call. If you want to keep everyone focused on solving problems as quickly as possible, you need to create an environment where the problems are real. Ensure that engineers have the ability to edit alerts, and continually prune your alerts to align them with reality.
Our team puts fixing false positives at the top of our backlog every week. Usually, all we have to do is edit our alert thresholds to reflect the current reality, but in situations where more work is involved, we take the time to do that work. Recently we released a new product that increased our average memcache get counts, for example. The on-call engineer noticed that the new trend started at the same time as the product launch, adjusted our alert thresholds, and went back to bed.
If you want more insight into how to make DevOps work at your organization, be sure to check out the webinar above. I know you’ll walk away with a lot of good ideas!
Douglas Jarquin is a technologist, father, husband, and avid reader living in Miami, FL. He serves as Director of DevOps for Zumba Fitness, a global lifestyle brand that fuses fitness, entertainment, and culture into an exhilarating dance-party workout.