LoggingThe Ultimate Guide

your open-source resource for understanding, analyzing, and troubleshooting system logs

curated byloggly

0

Troubleshooting with Node Logs

Now that your logs are being stored and you have the tools to analyze production-scale log volume, you’ll want to use them to troubleshoot production problems. With a number of tools and techniques at your disposal, it can be daunting to choose the best approach. We’ve chosen to focus on production logging, either to a file or streamed to a log management tool. Logs that float past on your console are critical during development, but we’ll assume that QA is catching the kinds of errors you’ll find there. In your production logs, you’ll want to keep an eye out for things that QA missed, or unexpected errors that arise in production. We’ll focus on Morgan and Winston as the packages used to generate and query your logs.

What Are the Most Common Errors?

Finding Exceptions by Type

Once your application is set up to log request information (see the “Request Logging with Morgan” section), you can query and analyze the data using Winston. The code sample below will find exceptions, extract the type, and count how many exceptions exist. All duplicate exceptions are grouped together. The information is output to the console using the built-in method console.log(). Start by modifying the foundation code within “Winston” in the “Analyzing Node Logs” subsection. Within options, set the limit parameter to 100, and set the fields parameter to [‘level’, ‘message’]. Below the “Work with returned logs” comment, place the following code.

The query needs to return level and message information from the logs (set within the fields parameter of options). In this case, level is returned since we want to filter the returned logs to only those with a level of error. message contains the error message. For exceptions, this contains the exception type. The queried log entries are filtered down to those that are exception logs, and each exception type is added to an array, which is finally outputted to the system console. As a result, we end up with a count of exceptions by type.

You can also search for exceptions of a specific type. The following returns all instances of a specific exception type, limited to 100 results within the last 24 hours. Start by modifying the foundation code within “Winston” in the “Analyzing Node Logs” subsection. Place the following code at the very beginning. Set exceptionSearch to the specific exception you are searching for.

Within options, set the limit parameter to 100, and set the fields parameter to [‘stack’, ‘level’, ‘message’, ‘timestamp’]. Below the “Work with returned logs” comment, place the following code.

The query needs to return stack and timestamp information from the logs (set within the fields parameter of options). stack contains useful debugging information regarding how the exception occurred. timestamp contains the timestamp of when the exception occurred. The queried log entries are filtered down to those that are exception logs. Then, each exception type is compared with the type being searched for. If they match, the timestamp, error level, and error stack of the exception are outputted to the system console. In this example, information is outputted for each instance of the stringToFloat is not defined exception.

Finding Exceptions by Module

You can also search for exceptions by the module in which they occurred. The following will find the exceptions in each specific module of your application, extract the exception type, and count how many exceptions exist. All duplicate exceptions within a specific module are grouped together. The information is outputted to the console using the built-in method console.log(). Start by modifying the foundation code within “Winston” in the “Analyzing Node Logs” subsection. Within options, set the limit parameter to 100, and set the fields parameter to [‘level’, ‘message’, ‘process’]. Below the “Work with returned logs” comment, place the following code.

As in the previous examples, Winston will query up to 100 log entries from the last 24 hours. The query needs to return process information from the logs (set within the fields parameter of options) since this contains the specific module in which the exception occurred. The queried log entries are filtered down to those that are exception logs, and each exception type for each module is added to an array, which is finally outputted to the system console. As a result, we end up with a count of exceptions by type, grouped together by module.

You can also list the exceptions within a specific module. The following returns all exceptions within the module being searched, limited to 100 results within the last 24 hours. Start by modifying the foundation code within “Winston” in the “Analyzing Node Logs” subsection. Place the following code at the very beginning. Set exceptionModuleSearch to the specific exception you are searching for.

Within options, set the limit parameter to 100, and set the fields parameter to [‘stack’, ‘level’, ‘message’, ‘timestamp’, ‘process’]. Below the “Work with returned logs” comment, place the following code.

The queried log entries are filtered down to those that are exception logs. Then, the module in which each exception occurred is compared with the module being searched. If they match, the timestamp, error level, and error stack of the exception are outputted to the system console. In this example, information is output for each exception within the /home/ubuntu/workspace/server.js module.

Using a Log Management System

A log management system allows you to visually sort your uncaught exceptions by type and module. Exceptions are sorted by the number of occurrences. Pie charts, bar graphs, and other tools can be used to sort the filtered uncaught exceptions. Below, you can see an example of a log management tool that has automatically summarized the most common exception messages. You can see that “uncaughtException:stringToFloat is not defined” has occurred 1402 times. It can generate this summary automatically without requiring you to write any special code. This helps you understand key issues faster when there is a critical issue.

sortByMessageCropped2

What’s Causing 404s?

Let’s examine the various 404 errors in the logs in your project, looking at which files/routes are not found and grouping them by source.

image10

By examining your project’s structure, we can search and analyze based on the following sources:

  • Back-end routes (Node.js/Express)
  • Front-end requests (AngularJS)
  • File system (Apache/Nginx)
  • External links and hits (Bad inbound links)
  • False negatives (Status code bugs)

We’ll use some basic examples to establish the concepts involved. Given a project in which Node.js/Express/Whatever-you-use-for-Node-routing is handling back-end API calls, let’s say those calls are being sent to /api/* (where * represents any number of API endpoints). AngularJS, on the front end, shall make its own calls, which may be logged by a custom AngularJS module along with a package such as stacktrace.js (see this article for more information and examples).

We may also assume inbound requests, whether by malformed external links or by URL hackers poking around at the project, as well as a file system managed by Apache and/or Nginx.

Back-End Routes

Let’s say you have Node serving routes under the /api/ URI; we know that 404
errors with the /api/ path included have been inaccurately linked in from other areas of our project (back end or front) or, under rare circumstances, some external link or hit is inaccurately deep-linking into the project.

Let’s examine an error, such as a manual request for http://localhost:3000/error (where /error is a non-existing route). With Winston logging the errors, the request will produce a log output on your console similar to this:

You can see that the req.res.statusCode is 404, as expected. Let’s focus on req.url, req.headers.host, req.method, req.originalUrl, and req.res.statusCode. For example, if you’re using /api/ for your back-end routes, you can see and group requests where req.originalUrl contains /api/ to examine those failures. Using this kind of information from your log files, you can count, group, and sort 404 errors to gain insights.

Start by modifying the foundation code within “Winston” in the “Analyzing Node Logs” subsection. Within options, set the limit parameter to 10000, and set the fields parameter to [‘res’, ‘url’]. Below the “Work with returned logs” comment, place the following code, which returns a simple count of 404 errors in the logs.

Front-End Requests

AngularJS may request a URI which does not exist if there is a typo, outdated link, or erroneous reference.

If your project logs these requests, you can group the resulting data and gain insight into problem areas that need attention. Was Angular asking for an API route from Node that does not exist or was moved? Node will log that. Is there a bad file request that Angular asked for? Nginx (or whatever you’re using as your server) will have a log entry for that. The status codes returned by these requests can be logged with a custom AngularJS module that sends off the stacktrace to the back end for writing to disk. Because these kinds of errors are mainly problems in your codebase, a good QA/testing regimen will catch these and will not be the most common types of 404 in your logs.

File System

If you’re serving your MEAN stack behind another server such as Apache or Nginx,
those logs will contain their own 404 entries for static files. To get a complete picture of what is and is not being found properly on your server, be sure to examine these logs, as well as the ones generated by the other areas of your project.

External Links and Hits

image07External links and/or hits are not something that can be controlled, but we can choose what to do when they are found to be hitting the project. Perhaps an external website has a malformed or simply bogus link pointing to our site. Perhaps a URL hacker is just poking around at the server.

image01

Or perhaps the server is experiencing a systematic crawl by a script kiddie, not even necessarily targeted at the MEAN stack. These are all cases that can be tracked with intelligent planning.

False Negatives

If your Express routes are reporting a 404 status erroneously, you may have log entries that incorrectly report failure to find a route. Doing a code review/audit can uncover such bugs and restore proper project behavior.

image06

Site Loading Too Slowly?

Many users won’t tolerate even a minor slowdown; they’ll get frustrated and complain or stop using your service. You should continuously monitor response times to make sure your server is running fast and that its performance is consistent over time. You can do this by tracking the response time captured in your logs stored by Morgan (see the “Request Logging with Morgan” section). If you have a REST API, the same approach can be used to track the performance of API calls, in terms of both speed and consistency. This can help you track and meet SLAs for internal teams or apps that depend on your service.

Analyzing Site Loading Speed with Winston

The following will list the number of requests per IP address and output the average response time to the console. Start by modifying the foundation code within “Winston” in the “Analyzing Node Logs” subsection. Within options, set the limit parameter to 10000, and set the fields parameter to [‘remote_addr’, ‘response_time’, ‘url’]. Below the “Work with returned logs” comment, place the following code.

In this example, Winston will query up to 10,000 log entries from the last 24 hours. If you expect to have more than 10,000 log entries per day, increase the limit parameter in options. The query returns the following information from the logs: remote_addr (used to list IP addresses), response_time (used to find average response time), and url (used in the next example). These fields are set within the fields parameter of options. (See “Logging Source Information” for more details.) The request count for each IP address is added to an array, which is finally outputted to the system console. The average response time is then calculated and outputted. In this example, the average time is 0.390ms.

You can also find the average response time for a specific URL. This requires two simple modifications to the preceding code. First, place the following code at the very beginning. Set searchPageURL to the URL to find the average page response time.

Then, change the line that reads:

To the following:

The number of requests per IP address and average response time for the URL /about are outputted to the console, in the same format as in the preceding example.

Using a Log Management Solution

It’s helpful to visualize performance as a time series. With many log management systems, you can set this up as a time series chart, which can display statistics like average and maximum over any time window. This helps you see if the issue was a temporary spike. In the example below using Loggly, you can see the maximum time in dark blue and the average time in green. Other options for splitting the chart include by host so you can see if one host is trending slower than others, or by URI to show your slowest pages or API calls. To send request logs to Loggly, see “Express.js” in the “Node Logging Basics” subsection within the “Basics and Centralizing” section of this documentation.

graph

If you have an internal SLA or a maximum response time you want your responses to be served in, you can use a numeric range search to find responses that are over that threshold. You can build a regular expression to find these in grep, or, if you are using a log management system, you can search on it directly. For example, here is what the search would look like in Loggly:

Too Much Load From One Source?

When your site is under a heavy load, you should know whether the load is from real users or something else, such as:

  • A configuration or system problem
  • A client app or bot hitting your site too fast
  • A denial of service attack

An IP address you don’t recognize might be a client with a problem or an attacker. If you don’t want to allow this type of use, you might want to consider blocking or rate limiting this particular IP address.

Analyzing Source Information with Winston

Once your application is set up to log request information (see the “Request Logging with Morgan” section), you can query and analyze the data using Winston. The following will list the number of requests per IP address. The information is outputted to the console. Start by modifying the foundation code within “Winston” in the “Analyzing Node Logs” subsection. Within options, set the limit parameter to 10000, and set the fields parameter to [‘remote_addr’]. Below the “Work with returned logs” comment, place the following code.

In this example, Winston will query up to 10,000 log entries from the last 24 hours. If you expect to have more than 10,000 log entries per day, increase the limit parameter in options. The query returns remote_addr information from the logs (set within the fields parameter of options). See “Logging Source Information” for more details. The request count for each IP address is added to an array, which is finally outputted to the system console.

Analyzing Sources with Loggly

Loggly visualizes your application’s traffic sources with a pie chart, which simplifies the process of discovering sources causing too much load. Here you can see that one IP is generating more than half of the site’s traffic, which is unusual.Screen-Shot-2015-02-11-at-1.58.26-PM

Screen-Shot-2015-02-11-at-1.58.41-PM1

 

Unusual Traffic Patterns?

What are the various kinds of unusual traffic patterns we might expect to find?

image13

  • Spikes (Why did traffic increase suddenly?)
  • Dips (Why did traffic drop off?)
  • Gaps (Are there gaps in the log files? If so, why?)
  • Shifts in traffic origin (Suddenly everyone in Arizona is hitting the site?)
  • Focused traffic (One page, one file?)
  • Changes in user interaction (Are users spending more/less time on the site?)
  • Response code shifts? (Sudden blast of 404s? 500s?)

Traffic Spikes

A spike in traffic may indicate a successful media campaign or any number of other reasons. Why did traffic increase suddenly? Spikes in traffic may be from a single source or multiple sources. This may indicate an attempt at gaining illicit access to your server, a DDOS attack, or healthy traffic from one or more sources indicative of an increase in positive attention to your content. The source and nature of the traffic can tell you a lot about what is happening. Things to consider when traffic spikes include bandwidth usage, server load, and the need to scale up. If you’re on multiple servers, it may be time to spin up another node. See the “Too Much Load from One Source” section for an example.

Traffic Dips

Why did traffic drop off? Is there a deeper cause, such as a broken area of the site that is driving active traffic away from experiencing the full measure of content users would otherwise consume? Be sure to check whether there are clues such as specific traffic origins dropping off. Use critical thinking to determine whether a new UX/UI rollout is frustrating users and driving them away, or perhaps it is simply a browser rendering problem; users may not be able to load a page. Did IE users just drop off? Did the CDN go down?

Traffic Gaps

Are there gaps in the log files? If so, why? Perhaps the servers went down for a while. Perhaps there is a problem with your logging system, causing gaps when certain potentially buggy areas of your logging code do not write their otherwise-viable logs to disk.

Shifts in Traffic Origin

All of a sudden everyone in Arizona is hitting the site? Is one particular site pouring traffic in? Did you just get Slashdotted? A shift in traffic origin can tell you a lot about your audience and what’s going on around you. See the “Too Much Load from One Source?” section for examples.

Focused Traffic

Is one area, one page, or one file being repeatedly requested all of a sudden? Perhaps an article just went viral. Perhaps another site has externally linked to an image on your site.

Response Code Shifts

Is there a sudden blast of 404s? 500s? The impact on traffic can be immense, both from lost interest from users and because of reduced attention from inbound links due to problems in the site. Keeping an eye on spikes in your error codes can help you catch problems before they get out of hand. See the “What’s Causing 404s?” section for an example of counting your 404s.

Written & Contributed by

Matthew

Lukas

David

This guide will help software developers and system administrators become experts at using logs to better run their systems. This is a vendor-neutral, community effort featuring examples from a variety of solutions

Meet Our Contributors Become a contributor