Ultimate Guide to Logging

Your open-source resource for understanding, analyzing, and troubleshooting system logs

Troubleshooting With Java Logs

Extracting useful information from your log files is critical to the success of your Java application. Log data will give you valuable insight into your application’s performance, stability, and usability.

Analyzing log data might seem tedious, but it doesn’t have to be. There are a variety of tools for reading, parsing, and consolidating log data. Basic command-line tools such as grep, uniq, and sort can combine and extract useful information from log files. More advanced log parsers such as Logstash or Fluentd can extract key data from your logs into easily searchable tokens. Cloud-based logging services such as SolarWinds® Loggly® store your log data for you and offer sophisticated analysis capabilities, eliminating the need to maintain logs yourself.

This section explores some methods and tools for analyzing logs with the goal of improving your applications.

Finding the Most Common Exceptions

Finding the most common exceptions can help you pinpoint areas of poor performance in your Java application. Most logging frameworks record the type of exception, the exception message, and the method in which the exception occurred. Using Log4j, a log of exceptions will look similar to one of the following.

09:54:44.565 [main] ERROR Log4jTest.MathClass - java.lang.ArithmeticException: / by zero
10:00:10.157 [main] ERROR Log4jTest.Log4jTest - java.io.FileNotFoundException: myFile (No such file or directory)
10:00:10.157 [main] ERROR Log4jTest.Log4jTest - java.io.FileNotFoundException: newFile (No such file or directory)

If your output includes stack traces, see the Parsing Multiline Stack Traces section of this guide.

Let’s start with a simple example using the popular GNU tool grep. Then, we’ll show how a log management tool can make it even easier.

Finding Exceptions by Type Using Grep

The following Unix command will find exceptions, extract the exception type, and count the number of occurrences. grep is a popular command-line tool that performs pattern matching, while uniq and sort group and sort the results, respectively:

$ grep -o "\w*Exception" myLog.log | sort -r | uniq -c

In this example, the regular expression \w*Exception matches any word that ends with “Exception.” The -o flag tells grep to print only the parts of the output that match the search string. sort -r sorts the result in reverse order, while uniq -c groups the results and counts the number of occurrences. As a result, we end up with a count of exceptions by type.

2 FileNotFoundException
1 ArithmeticException

We can also use grep to search the log for each specific instance of these exceptions.

$ grep ArithmeticException myLog.log

09:54:44.565 [main] ERROR Log4jTest.Log4jTest - java.lang.ArithmeticException: / by zero

This returns each line containing an instance of the search string ArithmeticException, with the search string itself highlighted in the output.

Finding Exceptions by Class Using Grep

You can also search for exceptions by the class in which they occurred. For single-line log events, the following command groups the number of exceptions by class and by type. grep extracts the class name by searching for a word followed by a particular character (this example uses a hyphen, although any character can be used as long as it’s unique to the entry). While this character isn’t essential, it helps us locate the class name in the log event. With the OR operator (denoted by |), grep will also extract the name of the exception by searching for a word containing the string “Exception.”

sed is another command-line utility that can be used to format output from grep. In this example, sed removes the special character from our output as well as any new-line characters. The output is then piped to the uniq and sort commands to respectively group and sort the results.

$ grep -o "w* -|\w*Exception" myLog.log | sed 'N; s/ -n/ /' | sort -r | uniq -c

The result brings us closer to the key problem areas:

2 Log4jTest FileNotFoundException
1 MathClass ArithmeticException

With these results, we can use grep to find more details about problems occurring in specific classes. For example, the following command retrieves exceptions that occurred in the MathClass class.

$ grep -e "MathClass.*Exception" myLog.log

09:54:44.565 [main] ERROR Log4jtest.MathClass - java.lang.ArithmeticException: / by zero

Using a Log Management Solution

Most log management solutions offer ways to group and search log entries based on log type, message content, class, method, and thread. If your logs are already being parsed and stored, many solutions can graph and sort exceptions by the number of occurrences. This is a point and click operation in Loggly so you don’t need to memorize complicated grep commands. Loggly also indexes the log records, making searches and counts much faster than grep. For example, we can use the field explorer to see a list of exceptions and their frequency, then click on an exception to view all relevant logs.

Using the Loggly Field Explorer to quickly find logs by exception type.

Loggly also offers visualization tools that can be more informative than the text-based output from grep. Graphs like this can help you prioritize resources for development sprints or for patches following a new release.

Charting exceptions by frequency in Loggly.

Debugging Production Problems

When it comes to production problems, time is of the essence. A critical error in your application won’t just leave your users unhappy; it’ll also lower sales and reduce trust in your application or service.

In most cases, resolving a problem can be broken down into three key steps:

  1. Gather information about the problem
  2. Identify the cause of the problem
  3. Find a solution and prevent the problem from recurring

Gather Information About the Problem

The first step is to collect information about the problem. Gather as much information as possible — screenshots, crash reports, logs, links (for web services), etc.—to help narrow down potential causes. You’ll want the user who experienced the problem to provide detailed information about the event, including: when and where in the program the problem occurred, his or her actions leading up to the problem, the operating environment, and any strange or unusual behavior from the program before and after the problem occurred.

From there, you can start gathering log information. If your application is a hosted service, start retrieving logs from the web and application servers. If your application is a distributed software package, ask the user to include log data in his or her bug report. Alternatively, if your application sends log events to a centralized log server, then your logs will be immediately available.

Once you have a reasonable amount of data regarding the issue, you can begin tracking it down in code.

Identify the Cause of the Problem

After collecting information about the problem, the next step is to identify its cause. Reproducing a bug in a development environment is one of the easiest ways to validate its existence, but it can be time consuming and may not work in all cases. Having a thorough set of logs will bring you straight to the source of the problem, saving you time and frustration.

A bug report will give you a general idea of what the problem is and where it occurred. Using your log management tool of choice, you can narrow down your search to a smaller range of log entries by searching for a unique data token such as a username, session ID, or message text.

Let’s run through an example scenario to demonstrate how to debug a system. Imagine we have a web-based interface for remotely logging in to a website. The webpage has a login screen that performs basic authentication using a username and password. Users have been reporting they can’t log in to the website. The webpage accepts their input but then fails with a generic error.

A sample website reporting an error after a login attempt.

This message doesn’t give us much information other than a generic indication of the log’s severity. Searching through a log manager for entries with “Severe” in the level or message could result in hundreds of results with no guarantee any of them are related to the issue at hand. Fortunately, our Logger also recorded the username of the user who experienced the error, so we can filter on the “admin” username.

Viewing a Java exception in Loggly.

If we drill down, we see the cause of the error is a missing or invalid table. This is a serious problem, as it could indicate an accidental deletion or database corruption.

ERROR: Exception for user admin
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table 'test_schema.users' doesn't exist
...
com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1651)
  at TestApplication.Test.doPost(Test.java:33)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:646)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
  at...

Looking through the stack trace, we see the service failed on line 33 of Test.java. This line consists of a SQL statement that pulls data about a user from the test_schema.users table.

rs = stmt.executeQuery("SELECT * FROM test_schema.users WHERE username = " + username + “ AND Password = “ + password);

When we try to run this query in a SQL front end, we find the “test_schema.users” table doesn’t exist. However, the database does have a “test_schema.user” table. At some point, a developer might have mistakenly typed the wrong table name and pushed the revised form to production. Now we know what the issue is and where it appears in our code.

Using SolarWinds Loggly, you can debug and find the root cause of problems by tracing errors through your application stack, across multiple log events, and even pinpointing the relevant line of code in GitHub.

Resolve the Problem and Prevent It From Recurring

Now that we’ve identified the problem and its cause, the final step is to fix it. Our login example was an exaggerated case with a relatively easy solution, but you may come across more complicated bugs rooted in different areas of your application. Before jumping for the quick and dirty fix, carefully consider how your change will impact the application. Is this really the best solution for the problem? Is it possible your change will interfere with another component? Will this fix make it difficult to introduce new fixes or features further down the road? Will this fix also prevent similar issues from cropping up later?

For instance, what if two different users managed to create two separate accounts with the same username and password? You could enforce a unique username for all users, but how would that affect your database structure, your authentication code, and your existing user base? You could add a new unique field such as an email address and make it required, but what if some of your current users haven’t supplied email addresses? What if some email addresses are shared across multiple accounts? How will these new rules affect your registration, search, and administration pages?

You’ll also want to consider the delay between creating a fix and deploying it. In an enterprise environment, your code will need to be reviewed by other developers, integrated into the code base, built, tested by QA, staged, and maybe go through several more steps before it hits production. Your company may have specific protocols for critical or time-sensitive bugs that you need to follow. Once you’ve found a solution, include your resolution in your documentation, as well as any reasonable workarounds to the problem. This way, your customers can still use your product, and your support team can reduce the number of duplicate bug reports while the fix makes its way to production.

After deploying the fix, continue monitoring your production application to verify the issue was resolved. For example, after deploying the fix for the database issue in the above example, we should no longer see events with “Table ‘test_schema.users’ doesn’t exist”. If the errors continue or if we start seeing new errors, we know the fix didn’t work. Ideally, we’ll see a pattern like that shown in the following screenshot, where the number of errors drops to zero immediately after we deploy the patch.

Charting the number of logs containing an error in Loggly.

More Production Debugging Tools

While logging is the tried-and-true method of troubleshooting and debugging applications, these tools can help you gain more insight into how your application operates.

jdb

jdb, the Java Debugger, is a command-line utility for debugging Java classes. jdb is easy to use and comes bundled with the Java Development Kit. Using jdb creates a new Java Virtual Machine (JVM), allowing you to debug a class without affecting any running programs. You can also use jdb to debug running applications by adding the following parameters to your java command:

-agentlib:jdwp=transport=dt_socket,server=y,suspend=n

When the JVM starts, it assigns a port number for incoming jdb connections. You can then attach to the running JVM instance using jdb -attach:

$ jdb -attach

You can use this, for instance, to connect your debugger to a running production instance. Be careful with using breakpoints in this scenario, since it could pause an active thread. This can have consequences if, for example, a customer is using the application while you’re debugging it. For more information, see the Java documentation on jdb.

OverOps

OverOps is a suite of tools for monitoring applications, analyzing code, and detecting issues. Unlike logging software, which relies on output generated by a running application, OverOps connects directly to the Java Virtual Machine to map out the application’s code base, read variables, and record errors. This allows it to catch more errors and record more data than even the application’s logging framework. OverOps uses a software as a service (SaaS) model, where metrics are collected and stored on OverOps’ cloud servers. However, you can also deploy it on-premises. In either case, you can view your data using a web-based interface.

BTrace

BTrace is a tracing tool that lets you monitor all aspects of your application, from class names to errors. BTrace uses an aspect-oriented programming approach involving the use of annotations, which specify where and how BTrace monitors your application. For example, the following BTrace script monitors and logs each and every call to the javax.swing package.

import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;
 
@BTrace public class AllMethods {
    @OnMethod(
        clazz="/javax\\.swing\\..*/",
        method="/.*/"
    )
    public static void m(@ProbeClassName String probeClass, @ProbeMethodName String probeMethod) {
        print(Strings.strcat("entered ", probeClass));
        println(Strings.strcat(".", probeMethod));
    }
}

For more information about BTrace, see the BTrace GitHub repository and Wiki.

Chronon

Chronon allows you to rewind and play back the entire execution flow of an application. It records each individual change made during an application’s lifetime, letting you reproduce the state of the application at any given time. Recordings are saved to file, making it easy to transfer the execution history from a production machine to a development machine for testing.

Chronon consists of the Chronon Recording Server, which lets you record Java applications remotely; Embedded Chronon, which embeds the recorder inside of an application; and the Time Travelling Debugger, which lets you replay recordings.

jhsdb

jhsdb (Java HotSpot Debugger) is a suite of tools for debugging, analyzing, and profiling the default JVM provided with both OpenJDK and the Oracle JDK. jhsdb lets you attach to running Java processes, take snapshots of stack traces, and even analyze crashed JVMs. You can use it to access the heap, code cache, garbage collection statistics, and more. To learn more, see the jhsdb documentation page.

Tracing Transactions

When a problem occurs, it’s important to know where the issue started and how it affected the rest of your application. This is difficult enough in a monolithic application, but becomes even harder in a distributed service-oriented architecture, where a single request may hit dozens of services. It’s not always obvious which service contains the root cause of the error, or how it affected other services. Tracing provides the data needed to follow the execution path of your application and drill down to the exact cause of a problem.

In the Debugging Production Problems section of the guide, we walked through an example where a user was having difficulty logging into a web application due to an invalid database table. In this section, we’ll show how tracing transactions played a central role in resolving the problem.

Tracing Unique IDs

In order to trace a sequence of log events, you need some way of uniquely identifying related logs. A multi-user environment could generate hundreds of identical logs, making it difficult to search based on timestamp or Logger. An easier solution is to include a unique identifier with related log entries. This identifier could be a username, a session ID, an API key, or a Universally Unique Identifier (UUID). Fortunately, ThreadContext is perfectly suited for this job.

In the Debugging Production Problems example, we had a web-based user interface that was hosted in a Tomcat servlet, which connected to a MySQL database. Users entered their credentials on the webpage, and after pressing submit, the servlet ran a query comparing their credentials against those stored in the database. If the user was successfully authenticated, then they were redirected to the main page. If an error occurred, details about the error were logged and the users were presented with a generic message.

We were able to debug this problem by including the users’ usernames in the log message. This allowed us to quickly search for log events related to the admin user. Building off this same example, we can use ThreadContext.put() to map a username to a Logger. When the users submit their credentials, the servlet enters the doPost() method, which adds the users’ usernames to the ThreadContext:

public void doPost(HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException {
ThreadContext.put(“username”, request.getParameter(“username”));
logger.info(“Entering doPost().”);
...
}

Using Log4j with the pattern %p [%X{username}]: %m%n results in the following entry:

INFO [admin]: Entering doPost().

We can add similar events for accessing the MySQL database, leaving the doPost method, and performing other actions. This way, if the user triggers an exception, we know exactly what the user was doing when the exception happened.

Tracing Method Calls

Many logging frameworks provide native methods for tracing an application’s execution path. These methods vary slightly between frameworks, but follow the same general format.

  • traceEntry() marks the beginning of a method.
  • traceExit() marks the end of a method. For methods that return an object, you can simultaneously return the object and log the event with return logger.exit(object).
  • throwing() marks an exception that is unlikely to be handled, such as a RuntimeException.
  • catching() marks an exception that is not going to be re-thrown.

You can find framework-specific information on these methods in the Logger documentation for Log4jLogback, and java.util.logging.

Tracing Methods in Log4j

As an example using Log4j, we’ll replace the Logger.info() methods in our servlet with tracing methods.

public void doPost(HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException {
  logger.entry();
  ...
  logger.exit();
}

We’ll also change the Appender’s PatternLayout to show the class name, method, and line number (the %class, %M, and %line conversion patterns).

<PatternLayout pattern="%p %class %M %line: %m%n" />

Tracing methods log events at the TRACE level, which means we’ll have to change the Logger’s level from DEBUG to TRACE. Otherwise, the log messages will be suppressed.

<Loggers>
  <Root level="trace">
    <AppenderRef ref="Console"/>
  </Root>
</Loggers>
 
TRACE DatabaseApplication.Login doPost 26: entry
...
TRACE DatabaseApplication.Login doPost 59: exit

Tracing methods also provide their own Markers. For example, the Logger.entry() and Logger.exit() methods display ENTER[ FLOW ] and EXIT[ FLOW ] respectively.

ENTER[ FLOW ] 14:47:41.074 [main] TRACE DatabaseApplication.Login - Enter
EXIT[ FLOW ] 14:47:41.251 [main] TRACE DatabaseApplication.Login - Exit

Tracing Methods in SLF4J

SLF4J users can take advantage of MDC to trace log events. Similar to Log4j, MDC values can be used with an Appender using the %X conversion pattern.

SLF4J also provides entry(), exit(), throwing(), and catching() methods through the XLogger (Extended Logger) class. You can create an XLogger instance using XLoggerFactory.getXLogger().

package DatabaseApplication;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
 
import org.slf4j.ext.XLogger;
import org.slf4j.ext.XLoggerFactory;
 
public class Login extends HTTPServlet {
  final static XLogger xlogger = XLoggerFactory.getXLogger(Login.class.getName());
  final static Logger logger = LoggerFactory.getLogger(Login.class.getName());
 
  @Override
  public void doPost(HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException {
    xlogger.entry();
    ..
    xlogger.exit();
  }
}

Add the %class and %line conversion patterns to logback.xml.

<configuration>
  <appender name="Console" class="ch.qos.Logback.core.ConsoleAppender">
    <encoder>
      <pattern>%-5level %class{36} %M %L: %msg%xEx%n</pattern>
    </encoder>
  </appender>
  <root level="trace">
    ...
  </root>
</configuration>

The resulting log entries are identical to those created by Log4j.

Managing Memory Usage

Memory management is often overlooked in higher-level languages such as Java. While the average amount of memory in modern devices is increasing, high memory use can have a big impact on your application’s stability and performance. Once the Java Virtual Machine can no longer allocate memory from the operating system, your program could terminate and crash. Knowing how to manage memory will prevent problems as your application grows in size.

For example, imagine we wanted to store a set of numbers starting from one. A user supplies an upper limit and the program stores each integer from “1” to that limit. We’ll use a while loop with a counter to add each number to an array.

import java.util.Scanner;
...
System.out.print("Please enter the maximum size of the array: ");
Scanner scanner = new Scanner(System.in);
int limit = scanner.nextInt();
 
ArrayList intArray = new ArrayList();
 
int count = 1;
while (count <= limit) {
  intArray.add(count);
}

You might have noticed count doesn’t increment in the loop. This is a big problem; if the user enters any number greater than zero, the array will continue to grow until the JVM uses all of its available memory and the program crashes.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2245)
at java.util.Arrays.copyOf(Arrays.java:2219)
at java.util.ArrayList.grow(ArrayList.java:242)
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208)
at java.util.ArrayList.add(ArrayList.java:440)
at MemoryTest.main(MemoryTest.java:10)

Garbage Collection

To reduce memory usage, the Java Virtual Machine performs a periodic cleanup process known as garbage collection. Garbage collection seeks out blocks of memory no longer in use and makes them available for reuse. The following resources explain Java’s garbage collection process in greater detail.

[delete]

The garbage collector generates diagnostic information that can be useful for performance profiling applications. You can log this information by passing -Xlog:gc to the JVM when starting your application. After each run, the garbage collector prints statistics about the garbage collection process in the Unified JVM Logging format. Here is an example.

[0.009s][info][gc] Using G1
[0.024s][info][gc] Periodic GC disabled
[0.313s][info][gc] GC(0) Pause Young (Normal) (G1 Evacuation Pause) 7M->1M(64M) 8.450ms

Using G1 says which garbage collection method is being used. The Garbage-First (G1) collector is often enabled by default on multiprocessor computers with large amounts of RAM. Periodic GC disabled indicates that the garbage collection process won’t repeat. The third line tells us that this is an Evacuation Pause, where objects are copied between memory regions based on whether they are still in use. Pause Young tells us that the process cleaned the young generation, which is where new objects are allocated. As a result, total memory usage by objects in the young generation dropped from 7M to 1M out of 64M allocated, and the process took 8.450ms.

With the garbage collector printing diagnostic information to the console, we can see how the amount of allocated memory (the heap size) continues to grow over time. We’ll remove the CPU times for simplicity.

$ java -Xlog:gc MemoryLeakDemo

[0.006s][info][gc] Using G1
[0.027s][info][gc] Periodic GC disabled
[0.146s][info][gc] GC(0) Pause Young (Concurrent Start) (G1 Humongous Allocation) 25M->23M(64M) 4.372ms
[0.148s][info][gc] GC(1) Concurrent Cycle
[0.202s][info][gc] GC(1) Pause Remark 50M->34M(64M) 0.250ms
[0.272s][info][gc] GC(1) Pause Cleanup 94M->94M(124M) 0.057ms
[0.318s][info][gc] GC(1) Concurrent Cycle 169.695ms
[0.328s][info][gc] GC(2) Pause Young (Concurrent Start) (G1 Humongous Allocation) 94M->94M(124M) 3.738ms
[0.328s][info][gc] GC(3) Concurrent Cycle
...

Chart showing a gradual increase in JVM heap usage over time.

You can add time stamps to each entry with -XX:+PrintGCDateStamps and -XX:+PrintGCTimeStamps. You can also log the garbage collector’s output to a file by passing—Xlog:gc:file: to the JVM when starting your application.

If you monitor your log files with rsyslog, you can then forward the logs to a centralized logging system where they’ll be parsed and ready for near-real-time analysis. Here is what a timeline graph looks like in Loggly. It shows total heap size, as well as the heap size before and after the garbage collector runs.

Memory Leaks

A memory leak is a situation where a program allocates memory faster than it can release it. The easiest way to detect a memory leak is when your program becomes unresponsive, becomes unstable, or causes OutOfMemoryErrors. In Java, you’ll find more instances of Full GC collections as memory use increases.

You can learn more about the garbage collector’s output and pinpointing memory leaks in the Troubleshoot Memory Leaks chapter of the Java documentation.

Additional Resources

Cut Debugging Time Without Writing a Single Line of Code (Loggly)—Guide to using Takipi

The Ultimate Guide: 5 Methods for Debugging Production Servers at Scale (High Scalability)—Tools and techniques for debugging production problems

Memory Management

Garbage Collection Tuning Guide (Oracle)—Guide to understanding and fine-tuning Java’s garbage collection

Understanding Java Garbage Collection (CUBRID Blog)—Guide to garb