.NET Troubleshooting and Analyzing
The whole purpose of logging is for future root cause analysis and monitoring. If you support small applications, you might not need the extensive logging and troubleshooting, but enterprise applications span several departments, layers, and network resources. Troubleshooting and analyzing each application component can drag out the root cause analysis process unnecessarily. With your logs, you can much more easily pinpoint the error right down to the exact line of code.
Where you find your logs depends on how you’re logging errors. In most cases, developers write logs to the Event Viewer. This makes it much easier to search any type of logs (errors, information, or warnings) by date and time. Usually when you receive a report that there is an error in an application, a user reports that something is “broken” without many details. This can be frustrating as a developer since “something is broken” doesn’t give you many clues to the problem.
Some developers write error codes to the screen and ask the user to include this specific code when reporting the error. You write this code to your logs during the exception handling and logging process. You can then use this specific code for your searches when analyzing your logs. This often adds extra friction because your support team needs to ask them to send the logs if they forget.
Before you start troubleshooting, you should know how root cause analysis works. Every organization has its own procedure. Some companies allow you to quickly fix a problem and then perform root cause analysis. For example, suppose you have a critical issue that blocks users from accessing your site. You don’t know the exact cause of the problem, but a quick reboot of the server fixes the issue. You’ve fixed the problem, but now you need to perform root cause analysis to figure out what caused the problem in the first place. This type of process is common when you have a critical issue impacting revenue and you don’t have time to perform thorough analysis.
When the error isn’t critical, some companies require you to troubleshoot and find the root issue before you deploy a fix. Ultimately, it’s better to perform root cause analysis before deploying a fix so you fully understand the problem before you deploy something that could cause additional bugs. However, critical issues that limit the application and revenue for the company are often considered too damaging for a long analysis process. Instead, you’re required to quickly find the problem, correct the code, and ask for an emergency deployment that isn’t during normal production deployment times.
Before you start digging into code, here are some basic steps to remember.
- Gather as much information from the customer that you can.
In some environments, you don’t speak with the employee or the customer. You work through a project manager (PM). The PM should know that you need as much information as possible, but ask questions and get as much detail about the problem as you can.
- Try to reproduce the error.
You can sometimes reproduce the error (sometimes referred to as a “repro”). If you can reproduce the error, you sometimes don’t even need logs. You can identify the problem just from your application input and knowledge of the application. The PM will also attempt to reproduce the issue.
- Read your logs.
We’ll go into this step further in the next section. It’s easier to troubleshoot an issue if you can reproduce it, but you don’t always have that luxury as a developer. Even if you can reproduce it, you might still be confused as to why the error is happening. The next step is to read logs. As we stated in previous sections, it’s important to keep very verbose logs that help you identify where the problem is occurring and why it’s happening. Logs are a main component in troubleshooting and analysis.
- Debug your code.
Once you find the error, you can then step through your code, find the logic error, and fix it. Before you do, you should understand the application enough to know if a change in this one section would affect other areas of your code. Of course, QA will test your code and likely regression test for any new bugs introduced, but sometimes errors do slip through QA.
Let’s start by going into more detail about reviewing your logs. We’ll look at Event Viewer logs since they are the most commonly used, but you could also use third-party tools. .NET developers have several options including tools that will automatically parse errors and display them as a visual representation to help you better understand them. In many ways, visual tools help you aggregate errors and view statistics for reporting purposes. For instance, they can help you identify if a specific part of your program fails too frequently, so you know that you need to spend resources evaluating the issue further.
Analyzing Logs Using Event Viewer
The most common way for .NET developers to analyze errors in any application is Event Viewer. Even if you don’t specifically log errors to Event Viewer, IIS writes application errors to Event Viewer. It’s much better to control the way your application logs errors, but it’s common for new developers to skip logging and exception handling only to find themselves digging through Event Viewer for clues to why the application failed. Whether you specifically log to Event Viewer or use it to review errors on the server, this section will show you how to find errors in Event Viewer.
First, let’s take a look at the interface.
Event Viewer logs almost everything for the server, so you’ll see several log sections that you won’t need as the developer. What you’re concerned with is the Application section. If you recall, we used the Application log to write any errors to Event Viewer in the Basics chapter. All events will be in the Application section unless you specified a customized name for your log. If you have a custom log file name, you’ll see it in the Windows Logs tree.
Now, take a look at the Application log entries.
In the above image, there is a list of Information entries. We don’t see any errors, which is good. When you click one of the log entries, the details are shown in the General and Details tab. In this example, an MSSQLSERVER error is logged. It shows that recovery was run on the SQL Server and no critical errors were found.
On the right side of the window, click “Filter Current Log” to open a configuration that will help you find specific errors.
Using the filter, you can search for the different trace levels (information, critical, warning) and several other options. You would also set the Event Sources to match your application name. To find errors, check the “Error” checkbox and click OK. Now, we can see the critical errors logged in our application.
We’ve highlighted a “LogglyTester” error, which is an application we made to throw an error in the system. Note that Loggly itself is a service, not an application. As you can see, we attempted to divide by zero and the error was logged.
If we click the Details tab, we even get the code file name and the line number.
In the image above, the HomeController file is where the error occurred and the error happened on line 30.
Using the dropdown “Logged” option, you can also choose a time frame for the logged error. You can pull logs from any server, and use keywords for specific error types.
With the information in the error, we know that on line 30 in the HomeController file, our application has a bug. Since we know the exception was a “divide by zero” error, we look for division calculations in the HomeController. If the division uses a variable, then somewhere in the code, the denominator in the calculation is set to zero. This can be from input from the user or a miscalculation.
In .NET, if you have a null string and you convert it to an integer, it’s automatically converted to 0. Suppose we take input from the user, and the user decides not to enter a number in our text box. We then convert the input to an integer in our code and perform a division calculation. This would cause an error, and our analysis would show that we need to place validation checks on the user input to ensure that a number greater than 0 must be entered. This is a common type of root analysis that you’ll need to perform when you find bugs in your application.
Using Log Management Solutions
For small applications with only one web server, you can RDP into the Windows server, take a look at the logs, download them, and use them to analyze your code. What happens, though, when you have an application running on 10 web servers and the error is intermittent? In a web farm environment running a .NET application, it’s imperative that you always keep the servers exactly the same or run the risk of one server having compatibility issues while the others work fine. However, when it does happen, how do you know which server is causing the issue? The error is intermittent, so all you know is that it’s one of 10 servers causing the issue.
This is where log management solutions are useful. Without them, you’d have to log into each server, download the logs, and start analyzing them one by one. Windows Vista has the ability to collect logs based on Event Subscriptions, but you still need to troubleshoot by logging into the local machine. Log management solutions aggregate your logs into one location. They then have their own algorithms to perform analysis and reporting. In addition, you can also use them as a cloud backup source to push your logs on their servers and store them for a long period of time. Make sure you check their retention times.
Let’s take a look at a log management tool screenshot.
In the image above, we have five events logged. We can see all of them in one place. In this example, all errors are from the same server, but this could be five events from three different servers. You would then be able to pinpoint the server causing issues instead of remotely logging into each of your web farm servers one by one.
Log management solutions have several other advantages. You can get alerts for certain errors, search for specific errors each morning, and see a visual representation of errors accumulated over a specific amount of time.
Here is an example screenshot from Loggly’s interface.
Reviewing IIS Logs
IIS logs are a good failover if you don’t have the right logging system in place. The information you can gather from these logs is limited, but they can help you find pages that are causing errors in a web application. These logs don’t give you specifics, but they can tell you what pages are throwing errors such as 404s or 501s.
The following image is an example of a log file for IIS.
The file shows you the date, port, the user’s browser, target page, and the server response code (among other information). You find IIS logs in the C:\inetpub\logs directory.
If you know that the application is failing but don’t have any more information, you can use these logs to review the time frame that the errors occur. Again, you don’t have specifics, but you can then use your Event Viewer logging to synch up the time from IIS logs with the error events. IIS logs tell you when the error occurred and then Event Viewer gives you error specifics.
You’ll notice that IIS logs are not very user friendly when it comes to readability. For this reason, you should use a log management solution to work with IIS logs from several different servers. For instance, the following graph shows common server errors from Loggly’s service.
If you combine logging, log management solutions, and IIS logs together, you can pinpoint errors before users send reports. If you don’t catch the error before the user does, you can still properly analyze errors from a report with minimal information. Having users give you specific details is preferable, but you can’t rely on it.