Ultimate Guide to Logging

Your open-source resource for understanding, analyzing, and troubleshooting system logs

Troubleshooting with Node Logs

Now that your logs are being stored and you have the tools to analyze production-scale log volume, you’ll want to use them to troubleshoot production problems. With a number of tools and techniques at your disposal, it can be daunting to choose the best approach. We’ve chosen to focus on production logging, either to a file or streamed to a log management tool. Logs that float past on your console are critical during development, but we’ll assume that QA is catching the kinds of errors you’ll find there. In your production logs, you’ll want to keep an eye out for things that QA missed, or unexpected errors that arise in production. We’ll focus on Morgan and Winston as the frameworks used to generate and query your logs.

What Are the Most Common Errors?

Finding Exceptions by Type

Once your application is set up to log request information (see the “Request Logging with Morgan” section), you can query and analyze the data using Winston. The code sample below will find exceptions, extract the type, and count how many exceptions exist. All duplicate exceptions are grouped together. The information is output to the console using the built-in method console.log(). Start by modifying the foundation code within the “Winston” section of the “Analyzing Node Logs” chapter. Within options, set the limit parameter to 100, and set the fields parameter to [‘level’, ‘message’]Below the “Work with returned logs” comment, place the following code.

const  allExceptions = new Array();

// Loop through each log entry //
results.file.forEach(function(logMessage){
const  message = logMessage.message;
  // If log entry is an exception //
  if(logMessage.level === "error"
    && message.substr(0,17) === "uncaughtException"){
    const  exceptionMessage = message.substr(18).trim();
    // Add to array if not already added //
    if(allExceptions.indexOf(exceptionMessage) === -1){
    allExceptions.push(exceptionMessage);
    }
  }
});
 
// Log to the console the exception count and exception types //
const  exceptionCount = allExceptions.length;
allExceptions.forEach(function(currException){
  console.log(exceptionCount + " " + currException);
  exceptionCount--;
});

The query needs to return level and message information from the logs (set within the fields parameter of options). In this case, level is returned since we want to filter the returned logs to only those with a level of errormessage contains the error message. For exceptions, this contains the exception type. The queried log entries are filtered down to those that are exception logs, and each exception type is added to an array, which is finally outputted to the system console. As a result, we end up with a count of exceptions by type.

3 Transport already attached: file                            
2 first argument must be a string or Buffer      
1 stringToFloat is not defined

You can also search for exceptions of a specific type. The following returns all instances of the exception type “stringToFloat is not defined”, limited to 100 results within the last 24 hours. Start by modifying the foundation code within “Winston” in the “Analyzing Node Logs” subsection. Place the following code at the very beginning. Set exceptionSearch to the specific exception you are searching for.

const  exceptionSearch = "stringToFloat is not defined"; // Search for this exception

Within options, set the limit parameter to 100, and set the fields parameter to [‘stack’, ‘level’, ‘message’, ‘timestamp’]Below the “Work with returned logs” comment, place the following code.

results.file.forEach(function(logMessage){
  const  message = logMessage.message;
  // If log entry is an exception //
  if(logMessage.level === "error"
    && message.substr(0,17) === "uncaughtException"){
    const  exceptionMessage = message.substr(18).trim();
    // Log to the console if the exception matches the exception being searched for //
    if(exceptionMessage === exceptionSearch){
      console.log("[" + logMessage.timestamp + "] "" + logMessage.level + "" "" + logMessage.stack + """);
    }
  }
});

The query needs to return stack and timestamp information from the logs (set within the fields parameter of options). Stack contains useful debugging information regarding how the exception occurred. Timestamp contains the timestamp of when the exception occurred. The queried log entries are filtered down to those that are exception logs. Then, each exception type is compared with the type being searched for. If they match, the timestamp, error level, and error stack of the exception are outputted to the system console. In this example, information is outputted for each instance of the stringToFloat is not defined exception.

[2015-06-22T17:59:05.917Z] "error" "ReferenceError: stringToFloat is not defined,
    at Server.<anonymous> (/home/ubuntu/workspace/server.js:17:5),
    at Server.emit (events.js:98:17),
    at HTTPParser.parser.onIncoming (http.js:2113:12),
    at HTTPParser.parserOnHeadersComplete [as onHeadersComplete] (http.js:122:23),
    at Socket.socket.ondata (http.js:1971:22),
    at TCP.onread (net.js:528:27)"
[2015-06-22T17:58:40.178Z] "error" "ReferenceError: stringToFloat is not defined,
    at Server.<anonymous> (/home/ubuntu/workspace/server.js:17:5),
    at Server.emit (events.js:98:17),
    at HTTPParser.parser.onIncoming (http.js:2113:12),
    at HTTPParser.parserOnHeadersComplete [as onHeadersComplete] (http.js:122:23),
    at Socket.socket.ondata (http.js:1971:22),
    at TCP.onread (net.js:528:27)"

Finding Exceptions by Module

You can also search for exceptions by the module in which they occurred. The following will find the exceptions in each specific module of your application, extract the exception type, and count how many exceptions exist. All duplicate exceptions within a specific module are grouped together. The information is outputted to the console using the built-in method console.log(). Start by modifying the foundation code within “Winston” in the “Analyzing Node Logs” subsection. Within options, set the limit parameter to 100, and set the fields parameter to [‘level’, ‘message’, ‘process’]Below the “Work with returned logs” comment, place the following code.

const  allExceptions = new Array();
 
// Loop through each log entry //
results.file.forEach(function(logMessage){
  const  message = logMessage.message;
 
  // If log entry is an exception //
  if(logMessage.level === "error"
    && message.substr(0,17) === "uncaughtException"){
    const  exceptionModule = logMessage.process.argv[1];
    const  exceptionMessage = message.substr(18).trim();
 
    // Check if this specific exception for this specific module has already been added //
    const  exceptionIndex = allExceptions.length;
    while(exceptionIndex--) {
      if(allExceptions[exceptionIndex].module === exceptionModule
        && allExceptions[exceptionIndex].exception === exceptionMessage) break;
    }
 
    // If not added yet, add it now //
    if(exceptionIndex === -1){
      allExceptions.push({'module' : exceptionModule, 'exception' : exceptionMessage});
    }
  }
});
 
// Log to the console the exception count and exception types //
const  exceptionCount = allExceptions.length;
allExceptions.forEach(function(currException){
  console.log(exceptionCount + " " + currException.module + " " + currException.exception);
  exceptionCount--;
});

As in the previous examples, Winston will query up to 100 log entries from the last 24 hours. The query needs to return process information from the logs (set within the fields parameter of options) since this contains the specific module in which the exception occurred. The queried log entries are filtered down to those that are exception logs, and each exception type for each module is added to an array, which is finally outputted to the system console. As a result, we end up with a count of exceptions by type, grouped together by module.

3 /home/ubuntu/workspace/calculate.js stringToFloat is not defined<
2 /home/ubuntu/workspace/server.js Object #<Object> has no method 'start'
1 /home/ubuntu/workspace/server.js first argument must be a string or Buffer

You can also list the exceptions within a specific module. The following returns all exceptions within the module being searched, limited to 100 results within the last 24 hours. Start by modifying the foundation code within “Winston” in the “Analyzing Node Logs” subsection. Place the following code at the very beginning. Set exceptionModuleSearch to the specific exception you are searching for.

const  exceptionModuleSearch = "/home/ubuntu/workspace/server.js"; // Search for exceptions in this module

Within options, set the limit parameter to 100, and set the fields parameter to [‘stack’, ‘level’, ‘message’, ‘timestamp’, ‘process’]Below the “Work with returned logs” comment, place the following code.

results.file.forEach(function(logMessage){
  const  message = logMessage.message;
  // If log entry is an exception //
  if(logMessage.level === "error"
    && message.substr(0,17) === "uncaughtException"){
    const  exceptionModule = logMessage.process.argv[1];
    const  exceptionMessage = message.substr(18).trim();
    // Log to the console if the exception is inside the module being searched //
    if(exceptionModule === exceptionModuleSearch){
      console.log("[" + logMessage.timestamp + "] "" + logMessage.level + "" "" + logMessage.stack + """);
    }
  }
});

The queried log entries are filtered down to those that are exception logs. Then, the module in which each exception occurred is compared with the module being searched. If they match, the timestamp, error level, and error stack of the exception are outputted to the system console. In this example, information is output for each exception within the /home/ubuntu/workspace/server.js module.

[2015-06-22T20:55:11.838Z] "error" "Error: Transport already attached: file,    
    at Logger.add (/home/ubuntu/workspace/node_modules/winston/lib/winston/logger.js:454:11),
    at Object.winston.(anonymous function) [as add] (/home/ubuntu/workspace/node_modules/winston/lib/winston.js:87:34),
    at Server.<anonymous> (/home/ubuntu/workspace/server.js:18:9),
    at Server.emit (events.js:98:17),
    at HTTPParser.parser.onIncoming (http.js:2113:12),
    at HTTPParser.parserOnHeadersComplete [as onHeadersComplete] (http.js:122:23),
    at Socket.socket.ondata (http.js:1971:22),
    at TCP.onread (net.js:528:27)"
[2015-06-22T20:54:46.731Z] "error" "Error: Transport already attached: file,
    at Logger.add (/home/ubuntu/workspace/node_modules/winston/lib/winston/logger.js:454:11),
    at Object.winston.(anonymous function) [as add] (/home/ubuntu/workspace/node_modules/winston/lib/winston.js:87:34),
    at Server.<anonymous> (/home/ubuntu/workspace/server.js:18:9),
    at Server.emit (events.js:98:17),
    at HTTPParser.parser.onIncoming (http.js:2113:12),
    at HTTPParser.parserOnHeadersComplete [as onHeadersComplete] (http.js:122:23),
    at Socket.socket.ondata (http.js:1971:22),
    at TCP.onread (net.js:528:27)"

Using a Log Management System

A log management system like SolarWinds® Loggly® allows you to visually sort your uncaught exceptions by type and module. Exceptions are sorted by the number of occurrences. Pie charts, bar graphs, and other tools can be used to sort the filtered uncaught exceptions. Below, you can see an example of a log management tool that has automatically summarized the most common exception messages. You can see that “uncaughtException:stringToFloat is not defined” has occurred 1402 times. It can generate this summary automatically without requiring you to write any special code. This helps you understand key issues faster when there is a critical issue.

What’s Causing 404s?

Let’s examine the various 404 errors in the logs in your project, looking at which files/routes are not found and grouping them by source.

By examining your project’s structure, we can search and analyze based on the following sources:

  • Back-end routes (js/express)
  • Front-end requests (Angularjs)
  • File system (Apache/nginx)
  • External links and hits (bad inbound links)
  • False negatives (status code bugs)

We’ll use some basic examples to establish the concepts involved. Let’s say you have a back-end framework like Express.js that routes calls to /api/* (where * represents any number of API endpoints). AngularJS, on the front end, shall make its own calls, which may be logged by a custom AngularJS module along with a package such as stacktrace.js (see this article for more information and examples).

We may also assume inbound requests, whether by malformed external links or by URL hackers poking around at the project, as well as a file system managed by Apache and/or Nginx.

Back-end Routes

Let’s say you have Node serving routes under the /api/ URI; we know that 404
errors with the /api/ path included have been inaccurately linked in from other areas of our project (back-end or front-end) or, under rare circumstances, some external link or hit is inaccurately deep-linking into the project.

Let’s examine an error, such as a manual request for http://localhost:3000/error (where /error is a non-existing route). With Winston logging the errors, the request will produce a log output on your console similar to this.

{
  "req": {
    "url": "/error",
    "headers": {
      "host": "localhost:3000",
      "pragma": "no-cache",
      "cache-control": "no-cache",
      "accept": 
"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36",
    "accept-encoding": "gzip, deflate, sdch",
    "accept-language": "en-US,en;q=0.8",
        "cookie": "intended=%2Fadmin; _ga=GA1.1.2062547573.1418854206; 
connect.sid=s%3ALjCu3hnT7pFkD1ovdvi1cNJtGg9_Sb7T.1uDruiPULM%2BaiSZ4pVqfVm4X6FyCdPdoWRnvdA7tX1k"
    },
    "method": "GET",
    "httpVersion": "1.1",
    "originalUrl": "/error",
    "query": {}
  },
  "res": {
    "statusCode": 404
  },
  "responseTime": 107,
  "level": "info",    
      "message": "u001b[90mGET /erroru001b[39m u001b[33m404u001b[39m u001b[90m107msu001b[39m"
}

You can see the req.res.statusCode is 404, as expected. Let’s focus on req.url, req.headers.host, req.method, req.originalUrl, and req.res.statusCode. For example, if you’re using /api/ for your back-end routes, you can see and group requests where req.originalUrl contains /api/ to examine those failures. Using this kind of information from your log files, you can count, group, and sort 404 errors to gain insights.

Start by modifying the foundation code within “Winston” in the “Analyzing Node Logs” subsection. Within options, set the limit parameter to 10000, and set the fields parameter to [‘res’, ‘url’]Below the “Work with returned logs” comment, place the following code, which returns a simple count of 404 errors in the logs.

const  totalCount = 0;
results.file.forEach( function(logMessage){ 
 
  if( typeof logMessage.res.statusCode === "undefined" ){
    return;
  }
 
  if( logMessage.res.statusCode == "404" ){
    totalCount++;
  }
});
 
// Log to the console the count of 404 errors found:
if( totalCount > 0 ){
  console.log("404 Count: " + totalCount);
}else{
  console.log("No 404 errors found.");
}

Front-end Requests

AngularJS may request a URI which does not exist if there is a typo, outdated link, or erroneous reference.

If your project logs these requests, you can group the resulting data and gain insight into problem areas that need attention. Was Angular asking for an API route from Node that doesn’t exist or was moved? Node will log that. Is there a bad file request Angular asked for? Nginx (or whatever you’re using as your server) will have a log entry for that. The status codes returned by these requests can be logged with a custom AngularJS module that sends off the stacktrace to the back end for writing to disk. Because these kinds of errors are mainly problems in your codebase, a good QA/testing regimen will catch these and will not be the most common types of 404 in your logs.

Web Server Logs

If you’re serving your Node.js application behind a web server such as Apache or Nginx, those logs will contain their own 404 entries for static files. To get a complete picture of what is and is not being found properly on your server, be sure to examine these logs, as well as the ones generated by other components in your infrastructure.

External Links and Hits

External links and/or hits are not something that can be controlled, but we can choose what to do when they are found to be hitting the project. Perhaps an external website has a malformed or simply bogus link pointing to our site. Perhaps a URL hacker is just poking around at the server.

Or perhaps the server’s experiencing a systematic crawl by a script kiddie, not even necessarily targeted at your application. These are all cases that can be tracked with intelligent planning.

False Negatives

If your Express routes are reporting a 404 status erroneously, you may have log entries that incorrectly report failure to find a route. Doing a code review/audit can uncover such bugs and restore proper project behavior.

Site Loading Too Slowly?

Many users won’t tolerate even a minor slowdown; they’ll get frustrated and complain, or stop using your service. You should continuously monitor response times to make sure your server is running fast and that its performance is consistent over time. You can do this by tracking the response time captured in your logs stored by Morgan (see the “Request Logging with Morgan” section). If you have a REST API, the same approach can be used to track the performance of API calls, in terms of both speed and consistency. This can help you track and meet SLAs for internal teams or apps that depend on your service.

Analyzing Site Loading Speed With Winston

The following will list the number of requests per IP address and output the average response time to the console. Start by modifying the foundation code within “Winston” in the “Analyzing Node Logs” subsection. Within options, set the limit parameter to 10000, and set the fields parameter to [‘remote_addr’, ‘response_time’, ‘url’]Below the “Work with returned logs” comment, place the following code.

results.file.forEach(function(logMessage){
  if(typeof logMessage.remote_addr === "undefined" || typeof logMessage.response_time === "undefined"){
    return;
  }
  // Add the response time to the running total //
  totalTime += (+parseFloat(logMessage.response_time).toFixed(3));
 
  // Check if this specific IP address has already been added. If so, increase the count by one //
  const  exceptionIndex = allExceptions.length;
  while(exceptionIndex--) {
    if(allExceptions[exceptionIndex].remote_addr === logMessage.remote_addr){
      // Increase the request count for this IP address
      allExceptions[exceptionIndex].count++;
      break;
    }
  }
 
  // If not added yet, add it now //
  if(exceptionIndex === -1){
    allExceptions.push({remote_addr: logMessage.remote_addr, count: 1});
  }
});
 
// Sort by count //
allExceptions.sort(function (a, b) {
  if (a.count < b.count) {
    return 1;
  }
  if (a.count > b.count) {
    return -1;
  }
  return 0;
});
 
// Log to the console the access count for each IP address //
const  totalCount = 0;
allExceptions.forEach(function(currException){
  totalCount += currException.count;
  console.log(currException.count + " " + currException.remote_addr);
});
 
// Log to the console the average response time //
if(totalCount > 0){
  const  avgTime = parseFloat(totalTime / totalCount).toFixed(3);
  console.log("Average: " + avgTime + " ms");
}else{
  console.log("No requests found.");
}

In this example, Winston will query up to 10,000 log entries from the last 24 hours. If you expect to have more than 10,000 log entries per day, increase the limit parameter in options. The query returns the following information from the logs: remote_addr (used to list IP addresses), response_time (used to find average response time), and url (used in the next example). These fields are set within the fields’ parameter of options. (See “Logging Source Information” for more details.) The request count for each IP address is added to an array, which is finally outputted to the system console. The average response time is then calculated and outputted. In this example, the average time is 0.390ms.

4202 10.185.248.71                                           
848 8.35.0.0                                                  
290 107.20.0.0
53 206.221.0.0
24 24.19.0.0
Average: 0.390 ms

You can also find the average response time for a specific URL. This requires two simple modifications to the preceding code. First, place the following code at the very beginning. Set searchPageURL to the URL to find the average page response time.

const  searchPageURL = "/about"; // The page to find average response time for

Then, change the line that reads:

if(typeof logMessage.remote_addr === "undefined" || typeof logMessage.response_time === "undefined"){

to the following:

if(typeof logMessage.remote_addr === "undefined" || typeof logMessage.response_time === "undefined" || typeof logMessage.url === "undefined" || logMessage.url !== searchPageURL){

The number of requests per IP address and average response time for the URL /about are outputted to the console, in the same format as in the preceding example.

Using a Log Management Solution

It’s helpful to visualize performance as a time series. With many log management systems, you can set this up as a time series chart, which can display statistics like average and maximum over any time window. This helps you see if the issue was a temporary spike. In the example below using SolarWinds Loggly, you can see the maximum time in dark blue and the average time in green. Other options for splitting the chart include by host to see if one host is trending slower than others, or by URI to show your slowest pages or API calls. To send request logs to Loggly, see “Express.js” in the “Node Logging Basics” subsection within the “Basics and Centralizing” section of this documentation.

If you have an internal SLA or a maximum response time you want your responses to be served in, you can use a numeric range search to find responses that are over that threshold. You can build a regular expression to find these in grep, or, if you are using a log management system, you can search on it directly. For example, here is what the search would look like in Loggly:

json.responseTime:>500

Too Much Load From One Source?

When your site is under a heavy load, you should know whether the load is from real users or something else, such as:

  • A configuration or system problem
  • A client app or bot hitting your site too fast
  • A denial of service attack

If a specific IP address sends an unusually large number of requests, it could be an attacker or a problematic client. If you don’t want to allow this type of use, you might want to consider blocking or rate limiting this particular IP address.

Analyzing Source Information With Winston

Once your application is set up to log request information (see the “Request Logging with Morgan” section), you can query and analyze the data using Winston. The following will list the number of requests per IP address. The information is outputted to the console. Start by modifying the foundation code within “Winston” in the “Analyzing Node Logs” subsection. Within options, set the limit parameter to 10000, and set the fields parameter to [‘remote_addr’]Below the “Work with returned logs” comment, place the following code.

const  allExceptions = new Array();
 
// Loop through each log entry //
results.file.forEach(function(logMessage){
  if(typeof logMessage.remote_addr === "undefined"){
    return;
  }
 
  // Check if this specific IP address has already been added. If so, increase the count by one //
  const  exceptionIndex = allExceptions.length;
  while(exceptionIndex--) {
    if(allExceptions[exceptionIndex].remote_addr === logMessage.remote_addr){
       // Increase the request count for this IP address
      allExceptions[exceptionIndex].count++;
      break;
    }
  }
 
  // If not added yet, add it now //
  if(exceptionIndex === -1){
    allExceptions.push({remote_addr: logMessage.remote_addr, count: 1});
  }
});
 
// Sort by count //
allExceptions.sort(function (a, b) {
  if (a.count < b.count) {
    return 1;
  }
  if (a.count > b.count) {
    return -1;
  }
  return 0;
});
 
// Log to the console the access count for each IP address //
allExceptions.forEach(function(currException){
  console.log(currException.count + " " + currException.remote_addr);
});

In this example, Winston will query up to 10,000 log entries from the last 24 hours. If you expect to have more than 10,000 log entries per day, increase the limit parameter in options. The query returns remote_addr information from the logs (set within the fields parameter of options). See “Logging Source Information” for more details. The request count for each IP address is added to an array, which is finally outputted to the system console. Here, we see that the client at 10.185.248.71 accounts for over ¾ of all requests.

4202 10.185.248.71
848 8.35.0.0
290 107.20.0.0
53 206.221.0.0
24 24.19.0.0

Analyzing Sources With Loggly

Loggly visualizes your application’s traffic sources with a pie chart, simplifying the process of discovering sources causing too much load. Here you can see one IP is generating more than half of the site’s traffic, which is unusual.

Pie chart of requests according to client IP address as shown in the Loggly dynamic field explorer and as a pie chart. © 2019 SolarWinds, Inc. All rights reserved.

Unusual Traffic Patterns?

What are the various kinds of unusual traffic patterns we might expect to find?

  • Spikes (why did traffic increase suddenly?)
  • Dips (why did traffic drop off?)
  • Gaps (are there gaps in the log files? If so, why?)
  • Shifts in traffic origin (suddenly everyone in Arizona is hitting the site?)
  • Focused traffic (one page, one file?)
  • Changes in user interaction (are users spending more/less time on the site?)
  • Response code shifts? (sudden blast of 404s? 500s?)

Traffic Spikes

A spike in traffic may indicate a successful media campaign or any number of other reasons. Why did traffic increase suddenly? Spikes in traffic may be from a single source or multiple sources. This may indicate an attempt at gaining illicit access to your server, a DDOS attack, or healthy traffic from one or more sources indicative of an increase in positive attention to your content. The source and nature of the traffic can tell you a lot about what’s happening. Things to consider when traffic spikes include bandwidth usage, server load, and the need to scale up. If you’re on multiple servers, it may be time to spin up another node. See the “Too Much Load from One Source” section for an example.

Traffic Dips

Why did traffic drop off? Is there a deeper cause, such as a broken area of the site driving active traffic away? Be sure to check whether there are clues such as traffic drops from specific users or regions. Use critical thinking to determine whether a new UX/UI rollout is frustrating users and driving them away, or perhaps it is simply a browser rendering problem; users may not be able to load a page. Did IE users just drop off? Did the CDN go down?"

Traffic Gaps

Are there gaps in the log files? If so, why? Perhaps the servers went down for a while. Perhaps there’s a problem with your logging system, causing gaps when certain potentially buggy areas of your logging code do not write their otherwise-viable logs to disk.

Shifts in Traffic Origin

Have your traffic patterns changed recently? Are you suddenly seeing traffic from Arizona? Are all of your users being referred from one specific site like Reddit or Slashdot? A shift in traffic origin can tell you a lot about your audience and what’s going on around you. See the “Too Much Load from One Source?” section for examples.

Focused Traffic

Is one area, one page, or one file being repeatedly requested all of a sudden? Perhaps an article just went viral. Perhaps another site has externally linked to an image on your site.

Response Code Shifts

Is there a sudden blast of 404s? 500s? The impact on traffic can be immense, both from lost interest from users and because of reduced attention from inbound links due to problems in the site. Keeping an eye on spikes in your error codes can help you catch problems before they get out of hand. See the “What’s Causing 404s?” section for an example of counting your 404s.