Log files are usually used for two purposes. First, they can serve as indicators of problems. Certain log messages will indicate problems; there are tools that monitor log files and can notify system administrators of such problems. Second, once it is known that a problem exists, either from log file monitoring itself or due to an external source, log files can help identify the cause of the problem, greatly helping solve the problem.
Analyzing log files and troubleshooting problems based on log messages is highly application- and situation-specific and an art in itself, but there are some general approaches that can apply in most situations.
Problems can appear in log files in many forms and it is not possible to provide a way to identify them in all cases, but Python log messages often follow a common format that makes it easier to tell actual problems from less problematic things.
Log messages, even when sent through
syslog, usually include the log level. This is a string like
ERROR that indicates the severity of the event emitted. For Python programs, actual problems have the levels
CRITICAL. Everything else usually does not indicate a problem. Looking for these strings can be a good first check for problems.
For plain text log files, this is easily possible through
egrep '(WARNING|ERROR|CRITICAL)' /var/log/the-logfile
journalctl --since '-2 days' | egrep '(WARNING|ERROR|CRITICAL)'
journalctl --priority "warning..crit"
One small warning regarding the priority, or log level as it is also called. One standard practice for new-style daemons is to simply send log messages to standard output, relying on
systemd to add these lines to the journal. By default, this method of sending log messages does not preserve the log level. This is also true when log messages are sent by Docker services. If you do not find the messages you are looking for when filtering by log level, also look in other log levels.
A typical setup for smaller hosts utilizes the
logcheck utility to regularly parse log files and send a digest email of any possible errors. The tool relies on a manually curated whitelist of log messages that can be ignored, which usually requires regular work to update. This approach works for a small number of systems, but due to the configuration work involved and using email as a communication channel, it does not scale well at all.
A more structured approach, used by error tracking services like Sentry, does not use log files at all for problem notification, but requires programs to notify an external service provider of problems that require human action. This has the advantage of strictly separating the two purposes of problem notification and regular status event emission, focusing log files to the latter, and making both easier to handle.
The disadvantage here is that it requires application programs to support these services directly. If the program only writes log messages, this solution is not applicable.
Centralized log management solutions such as SolarWinds® Loggly® include features that can alert users of problems that appear in log messages. Alerting on typical errors in Python logs can easily be configured here.
For example, your application might be emitting log messages when it cannot access some back-end system, like an authentication system. You can configure most log management solutions to react to a pattern that matches such log messages, and send an alert about the situation.
Also, when not using an error tracker, these alerts can double as a limited replacement for such a system by reacting to typical contents of an exception and sending an alert. For Python, the string “
Traceback (most recent call last)” is the most common signal for a traceback. Having the log management solution issue an alert when it sees this message would make it a very limited, but usable, replacement for an error tracker.
When errors occur, it’s important to identify the exact root cause of the problem. Log messages ideally should provide all the information needed to do so, but sadly, they are frequently a bit tricky to read.
Locating the source of a problem starts with understanding the log messages. While there are usually a few common fields in Python log messages, the largest part of the log message is free form, which means it looks different from program to program.
A good log message should start with a free-form string that explains the reason for the message to occur, but everything that can vary between different occurrences of this type of message should be separated out to make it easier to understand the individual message. For example, when a user tried to access a resource that they were not supposed to, the message should look like this:
ERROR:myapp.auth:Access denied; user="cookiemonster" resource="biscuitjar"
When scrolling through logs, this makes it easier to identify a log message (the “access denied”) part, but then also find the relevant information for this specific example. This also translates easily to structured fields for a log management solution.
Sadly, few programs follow such a convention.
Log messages usually contain at least three fields, as shown in the following default log format example:
ERROR:frobnicator.main:Could not frobnicate
This message consists of three fields. The first,
ERROR, is the log level.
frobnicator.main, is the logger name, which should, by convention, be the module name in which the error occurred. This is often the first point of reference as to where an error might come from. If the module is some PostgreSQL adapter, it’s likely a database problem. If it’s something inside the application, it’s more likely to be a bug.
The third field, “
Could not frobnicate”, is the freeform log message. How useful this is for locating the problem is all up to the programmer of the software at hand, and it’s difficult to give general guidelines here.
Most problems in production are not caused by known and expected exceptional situations, which would result in log messages hopefully describing the problem well, but rather by unexpected, unhandled problems. In Python, such problems generate tracebacks where the interpreter tried to include all important information it could gather. Yet this makes the traceback a bit hard to read sometimes.
This traceback, from a small toy Django application, shows all the common aspects of a Python traceback. The error itself, usually at the end of traceback but here highlighted in the beginning, is quite telling: “
AnonymousUser object has no attribute is_stuff”. Knowing Django, this sounds like a typo. There’s an
is_staff attribute on the user model, but no
is_stuff attribute by default.
The traceback itself includes the lines of all stack frames that were touched when this error occurred. In general, many of these stack frames are from the framework involved. A good rule of thumb is, when reading a traceback, you can ignore all lines from files that do not belong to your own project. Django does this for us here, graying out all but one entry as they are all from Django and not our own application. That line tells us the error likely occurred in
logexample/broken/views.py, at line 6—and indeed, this excerpt shows an
is_stuff field reference. Changing this to
is_staff will likely fix the problem!