LoggingThe Ultimate Guide

your open-source resource for understanding, analyzing, and troubleshooting system logs

curated byloggly

3

Troubleshooting with Java Logs

Extracting useful information from your log files is critical to the success of your Java application. Log data will give you valuable insight into your application’s performance, stability, and usability.

Analyzing log data might seem tedious, but it doesn’t have to be. There are a variety of tools for reading, parsing, and consolidating log data. Basic command-line tools such as grep, uniq, and sort can combine and extract useful information from log files. More advanced software-based log analyzers such as Logstash or Fluentd can automatically extract key data from your logs and store it in a searchable database. Cloud-based logging services such as Loggly can be easier to use because they offer more sophisticated analysis capabilities and eliminate the need to host or maintain the service yourself.

This section explores some methods and tools for analyzing logs with the goal of improving your applications.

Finding the Most Common Exceptions

Finding the most common exceptions can help you pinpoint areas of poor performance in your Java application. Most logging frameworks record the type of exception, the exception message, and the method in which the exception occurred. Using Log4j, a log of exceptions will look similar to one of the following:

If your output includes stack traces, see the Parsing Multi-line Stack Traces section of the guide.

Finding Exceptions by Type

The following Unix command will find exceptions, extract the exception type, and count the number of occurrences. grep is a popular command-line tool that performs pattern matching, while uniq and sort group and sort the results respectively:

In this example, the regular expression \w*Exception matches any word that ends with “Exception.” The -o flag tells grep to print only the parts of the output that match the search string. sort -r sorts the result in reverse order, while uniq -c groups the results and counts the number of occurrences. As a result, we end up with a count of exceptions by type:

We can then use grep to search the log for instances of the most common exceptions:

This will return each line containing an instance of the search string ArithmeticException, with the search string itself highlighted in the output.

Finding Exceptions by Class

You can also search for exceptions by the class in which they occurred. For single-line log events, the following command groups the number of exceptions by class and by type. Grep extracts the class name by searching for a word followed by a particular character. (This example uses a hyphen, although any character can be used as long as it’s unique to the entry.) While this character is not essential, it helps us locate the class name in the log event. With the OR operator (denoted by |), grep will also extract the name of the exception by searching for a word that contains the string “Exception.”

sed is another command-line utility that can be used to format output from grep. In this example, sed removes the special character from our output as well as any new-line characters. The output is then piped to the uniq and sort commands to respectively group and sort the results.

The result brings us closer to the key problem areas:

With these results, we can use grep to find more details about problems occurring in specific classes. For example, the following command retrieves exceptions that occurred in the MathClass class:

Using a Log Management Solution

Most log management solutions offer ways to group and search log entries based on log type, message content, class, method, and thread. If your logs are already being parsed and stored, many solutions can graph and sort exceptions by the number of occurrences. Graphs like this can help you prioritize resources for development sprints or for patches following a new release.

prioritizing graph java logs

Debugging Production Problems

When it comes to production problems, time is of the essence. A critical error in your application won’t just leave your users unhappy — it will also lower sales and reduce trust in your application or service.

In most cases, resolving a problem can be broken down into three key steps:

  1. Gather information about the problem.
  2. Identify the cause of the problem.
  3. Find a solution and prevent the problem from reoccurring.

Gather Information About the Problem

The first step is to collect information about the problem. Gather as much information as possible—screenshots, crash reports, logs, links (for web services), etc.—to help narrow down potential causes. You’ll want the user who experienced the problem to provide detailed information about the event, including: when and where in the program the problem occurred, his or her actions leading up to the problem, the operating environment, and any strange or unusual behavior from the program before and after the problem occurred.

From there, you can start gathering log information. If your application is a hosted service, start retrieving logs from the web and application servers. If your application is a distributed software package, ask the user to include log data in his or her bug report. Alternatively, if your application sends log events to a centralized log server, then your logs will be immediately available.

Once you have a reasonable amount of data regarding the issue, you can begin tracking it down in code.

Identify the Cause of the Problem

After collecting information about the problem, the next step is to identify its cause. Reproducing a bug in a development environment is one of the easiest ways to validate its existence, but it can be time consuming and may not work in all cases. Having a thorough set of logs will bring you straight to the source of the problem, saving you time and frustration.

The bug report will give you a general idea of what the problem is and where it occurred. Using your log management (or command-line) tool of choice, you can narrow down your search to a smaller range of log entries by searching for a unique data token such as a username, session ID, or message text.

Let’s run through an example scenario to demonstrate how to debug a system. Imagine we have a web-based interface for remotely logging in to a website. The web page has a login screen which performs basic authentication using a username and password. Users have been reporting that they can’t log in to the website. The web page accepts their input but then fails with a generic error.

login screen

error message

This message doesn’t give us much information other than a generic indication of the log’s severity. Searching through a log manager for entries with “Severe” in the level or message could result in hundreds of results with no guarantee that any of them are related to the issue at hand. Fortunately, our Logger also recorded the username of the user who experienced the error, so we can filter on the “admin” username.

filter admin

If we drill down, we see that the cause of the error is a missing or invalid table. This is a serious problem, as it could indicate an accidental deletion or database corruption:

Looking through the stack trace, we see that the service failed on line 33 of Test.java. This line consists of a SQL statement that pulls data about a user from the test_schema.users table:

When we try to run this query in a SQL front end, we find that the “test_schema.users” table doesn’t exist. However, the database does have a “test_schema.user” table. At some point, a developer mistakenly typed the wrong table name and pushed the revised form to production. Now we know what the issue is and where it appears in our code.

Here is a video showing how to debug and find the root cause of errors using Loggly. It shows how to trace a request through an application stack, find the line of the bug in GitHub, and report a bug using JIRA.

Resolve the Problem and Prevent It from Reoccurring

Now that we’ve identified the problem and its cause, the final step is to fix it. Our login example was an exaggerated case with a relatively easy solution, but you may come across more complicated bugs that are rooted in different areas of your application. Before jumping for the quick and dirty fix, carefully consider how your change will impact the application. Is this really the best solution the problem? Is it possible that your change will interfere with another component? Will this fix make it difficult to introduce new fixes or features further down the road? Will this fix also prevent similar issues from cropping up later?

For instance, what if two different users managed to create two separate accounts with the same username and password? You could enforce a unique username for all users, but how would that affect your database structure, your authentication code, and your existing user base? You could add a new unique field such as an email address, but what if some of your users haven’t supplied email addresses? What if some email addresses are shared across multiple accounts? How will these new rules affect your registration, search, and administration pages?

You’ll also want to consider the delay between creating a fix and deploying it. In an enterprise environment, your code will need to be vetted by other developers and tested by QA before it hits production. Your company may have specific guidelines for critical or time-sensitive bugs. Once you’ve reached a solution, include your resolution in your documentation, as well as any reasonable workarounds to the problem. This way, your customers can still use your product, and your support team can reduce the number of duplicate bug reports while the fix makes its way to production.

It’s a good idea to monitor your patches or updates in production to verify that the issue is resolved. For example, after deploying the fix for the database name in the example above, we should see the events with “Table ‘test_schema.users’ doesn’t exist” stop occurring. If the errors continue or a new one pops up, we can see it and and try again. Below is an example showing that the count went to zero when we deployed the patch about a minute ago.

test schema table java logs

More Production Debugging Tools

While logging is the tried-and-true method of reviewing an application’s performance, these tools can help you gain more insight into your program.

jdb

jdb, the Java Debugger, is a command-line utility for debugging Java classes. jdb is easy to use and comes bundled with the Java Development Kit. Using jdb creates a new Java Virtual Machine, allowing you to debug a class without affecting any running programs. You can also allow the JVM to accept incoming jdb connections by starting your program with the following parameters:

The JVM will assign a port number for incoming jdb connections. You can then attach to the running JVM instance using the -attach option:

You can connect your debugger to running production instance. Be careful using breakpoints, though, because it could stop a customer’s request. For more information, see the Java documentation on jdb.

Takipi

Takipi is a software suite that monitors your application for issues. Unlike logging software, which relies on output generated by the application, Takipi collects data directly from the Java Virtual Machine. This lets you read the values of variables, generate stack traces, and monitor errors that occur outside of the application’s logging framework. Takipi also provides a web-based interface for accessing Takipi instances on multiple machines.

BTrace

BTrace provides a scripting language that monitors your program’s classes. BTrace takes an aspect-oriented programming approach, using “probe points” to determine when to perform an action. For example, you could have probe points that increment when an object is created, or that record the time it takes to execute a method. For a comprehensive tutorial on BTrace, see Piotr Nowicki’s BTrace guide.

Chronon

Chronon records individual changes made during the execution of an application, allowing you to play back and rewind the path taken by the application. Recordings are stored in a single file, making it easy to transfer execution history from a production machine to a development machine for testing. Chronon consists of the Chronon Recording Server, which runs in a Java application; the Time Travelling Debugger, which replays Chronon recordings; and Embedded Chronon, which embeds the recorder inside of a Java application.

jstack

jstack is another JDK tool which prints Java stack traces for a specified process, file, or remote debug server. Rather than monitor or attach to an application, jstack takes a snapshot of the process and returns its current state. This is helpful for tracking problems as they occur, such as slow response times or high resource usage. Note that jstack is experimental and unsupported by Oracle.

Tracing Transactions

When a problem occurs, it’s helpful to know where the issue lies in your program’s execution path. This is even more difficult in a distributed service-oriented architecture where a single request may hit dozens of services. It’s not always obvious which service contains the original  root cause of the error, or how it affected other services. Tracing provides enough data to follow the execution path of your application and drill down to the exact cause of a problem.

In the Debugging Production Problems section of the guide, we walked through an example where a user was having difficulty logging into a web application due to an invalid database table. In this section, we’ll show how tracing transactions played a central role in resolving the problem.

Tracing Unique IDs

In order to trace a sequence of log events, you need some way of uniquely identifying related logs. A multi-user environment could generate hundreds of identical logs, making it difficult to search based on time stamp or Logger. An easier solution is to include a unique identifier with related log entries. This identifier could be a username, a session ID, an API key, or a Universally Unique Identifier (UUID). Fortunately, NDC, MDC, and ThreadContext are perfectly suited for this job.

In the Debugging Production Problems example, we had a web-based database front end that was failing to authenticate users. The users’ credentials were passed to a Tomcat servlet, which pulled their information from a MySQL database. If the users were successfully authenticated, then they were redirected to the main page. If an error occurred, details about the error were logged and the users were presented with a generic message.

We were able to debug this problem by including the users’ usernames in the log message. This allowed us to quickly search for log events related to the admin user. Building off this same example, we can use ThreadContext.put() to map a username to a Logger. When the users submit their credentials, the servlet enters the doPost() method, which adds the users’ usernames to the ThreadContext:

A Log4J pattern of %p [%X{username}]: %m%n results in the following entry:

We can add similar events for opening the MySQL database and leaving the doPost method. When the user triggers an exception, we have a better idea of what the user was doing when the exception happened:

By tracing the admin user’s path through the application, we were able to determine that the database wasn’t properly configured.

Tracing Method Calls

Many logging frameworks provide native methods for tracing an application’s execution path. These methods vary slightly between frameworks, but follow the same general format:

  • entry() marks the beginning of a method.
  • exit() marks the end of a method. For methods that return an object, you can simultaneously return the object and log the event with return Logger.exit(object).
  • throwing() marks an exception that’s unlikely to be handled, such as a RuntimeException.
  • catching() marks a caught exception that’s not going to be re-thrown.

You can find framework-specific information on these methods in the Logger documentation for Log4j, Logback, and java.util.logging.

Tracing Methods Transactions in Log4j

As an example for Log4j, we’ll replace the Logger.info methods in our servlet with tracing methods:

Using Log4j, we’ll change the Appender’s PatternLayout to show the class name, method, and line number (the %class, %M, and %line conversion patterns).

Tracing methods log events at the TRACE level, which means we’ll have to change the Logger’s level from DEBUG to TRACE. Otherwise, the log messages are suppressed.

Tracing methods also provide their own Markers. For example, the Logger.entry() and Logger.exit() methods display ENTRY[ FLOW ] and EXIT[ FLOW ] respectively.

Tracing Methods in SLF4J

SLF4J users can take advantage of MDC to trace log events. Similar to Log4j, MDC values can be used with an Appender through the %X conversion pattern.

SLF4J also provides entry(), exit(), throwing(), and catching() methods through the XLogger (Extended Logger) class. You can create an XLogger instance using XLoggerFactory.getXLogger():

The resulting log entries are identical to those created by Log4j.

Additional Resources

Memory management is often overlooked in higher-level languages such as Java. While the average amount of memory in modern devices is increasing, high memory use can have a big impact on your application’s stability and performance. Once the Java Virtual Machine can no longer allocate memory from the operating system, your program could terminate and crash. Knowing how to manage memory will prevent problems as your application grows in size.

For example, imagine we wanted to store a set of numbers starting from 1. A user supplies an upper limit and the program stores each integer from 1 to that limit. We could use a while loop with a counter to add each number to an array:

You might have noticed that count doesn’t increment in the loop. This is a big problem: If the user enters any number greater than 0, the array will continue to grow until the JVM uses all of its available memory and the program crashes.

Garbage Collection

To reduce memory usage, the Java Virtual Machine performs a periodic cleanup process known as garbage collection. Garbage collection seeks out blocks of memory that are no longer in use and makes them available for reuse. You can also learn more about Java’s garbage collection process through Oracle’s Java Garbage Collection Basics guide, as well as CUBRID’s guide, Understanding Java Garbage Collection.

You can log diagnostic information from the garbage collector by passing –XX:PrintGCDetails to the JVM. After each run, the garbage collector prints statistics about the garbage collection process. A log entry from the garbage collector looks similar to the following output:

The initial words (GC and Full GC) indicate the type of collection that was performed. Following the collection type is the effect of the garbage collection process on the size of the young generation (and the old generation if a full collection was performed). Following the generation statistics is the combined size of objects in memory before and after the garbage collection process. The number in parentheses is the total amount of memory allocated (referred to as the heap size) by the Java Virtual Machine, followed by the permanent generation and the time taken to perform the collection. Last is the total CPU time spent on the garbage collection process.

With the garbage collector printing diagnostic information to the console, we can see how the amount of allocated memory (the heap size) continues to grow over time. We’ll remove the CPU times for simplicity:

time stamps

You can add time stamps to each entry with -XX:+PrintGCDateStamps and -XX:+PrintGCTimeStamps. You can also log the garbage collector’s output to a file by passing -Xloggc: to the JVM when starting your application.

If you monitor your log files with rsyslog, you can then forward the logs to a centralized logging system where they’ll be parsed and ready for near-real-time analysis. Here is what a timeline graph looks like in Loggly. It shows total heap size, as well as the heap size before and after the garbage collector runs.

java garbage collector

Memory Leaks

A memory leak is a situation where a program allocates memory faster than it can release it. The easiest way to detect a memory leak is when your program becomes unresponsive, becomes unstable, or causes OutOfMemoryErrors. In Java, you’ll find more instances of Full GC collections as memory use increases.

You can learn more about the garbage collector’s output and pinpointing memory leaks in the article Java troubleshooting: out of memory errors by Teaspoon Consulting.

Debugging Production Problems

Cut Debugging Time Without Writing a Single Line of Code (Loggly)—Guide to using Takipi

The Ultimate Guide: 5 Methods for Debugging Production Servers at Scale (High Scalability)—Tools and techniques for debugging production problems

Memory Management

Garbage Collection Tuning Guide (Oracle)—Guide to understanding and fine-tuning Java’s garbage collection

Understanding Java Garbage Collection (CUBRID Blog)—Guide to garbage collection in Java

Written & Contributed by

Andre

Tony

This guide will help software developers and system administrators become experts at using logs to better run their systems. This is a vendor-neutral, community effort featuring examples from a variety of solutions

Meet Our Contributors Become a contributor