30 best practices for logging at scale
Almost all software services start life small where logging is handled simply on the developer console or perhaps to a file. Life is easy … until the service starts to grow.
Professionals running large-scale software applications are probably familiar with the words of Notorious BIG: “Mo money, mo problems.” As an application grows to multiple servers, the people running it need a centralized place to store and search logs. When the volume gets too large for command-line tools like grep, they might set up an open-source search engine to search the logs. As the service grows even larger, they’ll face new challenges including scalability, reliable delivery, security, performance, ease of use, and more. Even top companies can struggle for months working out the kinks, and that’s when they call in logging experts.
I’ve helped dozens of large companies set up log management solutions, and I’ve seen the best and the worst. Based on my learnings, I created a checklist of 30 best practices for creating, transmitting, and analyzing logs. Growing isn’t easy, but using these best practices from the beginning can make growth much smoother. How many does your company meet?
Best practices for creating logs
- Use a standard and easily configurable logging framework.
log4j, log4net, etc. allow faster config changes than hard-coded or proprietary frameworks.
- Use a logging framework with flexible output options.
View console logs in development and centralize prod logs without extra plugins or agents.
- Use a standard structured format like JSON.
Custom formats and raw text logs need custom parsing rules to extract data for analysis.
- Create a standard schema for your fields.
Adding fields ad hoc can create a rat’s nest. A standard lets everyone know where to look.
- Don’t let logging block your application.
Write logs asynchronously with a buffer or queue so the application can keep running.
- Avoid vendor lock-in.
Don’t hardcode vendor libraries. Use a standard library or wrapper for portability.
- Beware of restrictions in Platform as a Service (PaaS) or container-based environments.
Environments like Heroku and Docker set restrictions on host access, syslog daemons, and more.
- Offer a standard logging configuration for all teams.
Avoid chaos as the company grows. Start with a best practice and let teams deviate as needed.
- Don’t forget legacy application logs.
Find a way to send logs from legacy apps, which are frequently culprits in operational issues.
Best practices for transmitting logs
- Use fault-tolerant protocols.
Use TCP or RELP to transmit logs instead of UDP, which can lose packets.
- Automatically retry if sending fails.
Unreliable protocols and poorly written libraries can drop logs unexpectedly.
- Configure local storage for network outages and server back pressure.
Good agents and libraries offer disk-assisted queues, but you must set them up.
- Don’t let local storage use all your memory or disk space.
If you don’t configure a limit or rotation policy, logs can crash your server.
- Monitor for backlogs and outages.
Backlogs could indicate network or server problems and lead to data loss.
- Prevent bursting your logs when sending files.
Rotate logs before configuring new servers, or set daemons to only send new logs.
- Filter sensitive data before transmitting it.
Lower exposure by not logging sensitive data or by scrubbing it before it leaves your network.
- Encrypt data in transit.
Use HTTPS or set up TLS certificates to keep data secure.
- Configure your proxy and firewall.
Check your firewall ports for syslog, and route traffic to servers with Internet access.
- Use your configuration management system to set up logging.
Automate large deployments of logging configuration using Ansible, Puppet, etc.
Best practices for managing logs
- Check your IT department’s security requirements up front.
IT often gets veto power on deals, so clarify requirements before evaluating options.
- Store logs outside your data center.
You’ll need logs during outages and fires, so store them in a different availability zone or region.
- Compare the TCO of self-hosting, cloud-hosting, and SaaS.
Open source is not free. Consider storage, compute, bandwidth, operational, and hidden costs.
- Get input from users of the system before making a decision.
Let end users try out the options instead of making a top-down purchase decision.
- Remember that user experience problems are deal breakers.
If end users find a tool hard to use, they will avoid it, offload to experts, or force a switch.
- Test whether ingestion time is less than a few seconds.
Low latency is important for live monitoring and troubleshooting.
- Test search performance at full volume and query complexity.
Small scale tests are not meaningful, so send realistic volumes and test real-world queries.
- When self- or cloud-hosting, optimize for query performance.
No one wants slow performance. Add hot/cold nodes and optimize shard size and indexes for speed.
- Automatically parse your logs at ingestion.
Parsing logs at search time is slower, and automatic rules save time over custom ones.
- Onboard users and integrate into workflow.
Availability != effective use. Users need to understand it and fit it into their workflow.
- Set up common searches, dashboards, and alerts for your team.
It’ll be easier to get buy-in from your team if they see a valuable pattern and can build on it.
A good log management solution is not just a place to store your logs; it provides an out-of-the-box framework for implementing many of these best practices. For example, Loggly offers set-up instructions that will help you create and transmit logs using best practices. You don’t need to reinvent the wheel, just follow the instructions and ask for help if needed.
In addition, SaaS solutions like Loggly already include the costs of operations, scaling, performance, and fault tolerance. Keep in mind that some hosted solutions may take away the effort of managing servers, but you still need to manage performance, scaling, fault tolerance, and more.
More logs, more problems
With log management, complexity can increase at a faster pace than scale —– not only for operations but also for end users like development teams, customer support, etc. Following these best practices will eliminate complexity where possible and make it easier to grow your operation.