Infrastructure and application monitoring is now becoming more and more important. This is mainly because both infrastructure and applications are becoming more complex. Additionally, modern software development techniques like canary releases rely on monitoring to do their job. And when we’re talking about cloud computing and microservices architecture, a good monitoring solution is a must-have.
Cloud-based applications require a slightly different approach to monitoring. Though the general idea stays the same as it is for on-premises environments, there are some cloud-specific things you need to be aware of to build an effective monitoring solution. In this post, you’ll learn what these specifics are and what the best practices are for monitoring cloud-based applications.
The first difference between monitoring traditional, on-prem applications and cloud-based applications is, with the latter, you can’t forget about the cloud itself. What does this mean? As an application team in an on-prem environment, you’ll probably focus on just monitoring your application. Monitoring the actual servers and networks will traditionally be handled by a dedicated platform or operations teams.
It doesn’t work the same way when you’re in the cloud. Sure, some companies try to replicate their existing monitoring approach to the cloud. But to succeed, you need to get rid of the split between infrastructure and application. There are two main reasons for this.
First, moving to a cloud application typically gives teams more freedom to manage their infrastructure. Therefore, it should also be up to them to monitor it. If the application team manages the infrastructure but the platform team monitors it, it can create a lot of confusion. The “you build it, you run it” approach makes the team accountable for the entire stack.
The second reason is cloud resources often work differently than physical devices. For example, load balancers sometimes require a “warm-up” or have limits on network address translation functionality. This usually doesn’t happen on physical devices. Therefore, application teams need to be aware of such behaviors, and the best way to do this is by monitoring their cloud resources.
You also need to keep in mind monitoring cloud resources requires looking at different sets of metrics. Besides typical CPU or RAM usage, you must keep an eye on cloud vendor and resource-specific metrics. For example, a load balancer on the Azure cloud will give you different metrics than a load balancer on AWS or DigitalOcean.
Also, something common in the cloud environment but rarely seen on-prem is quotas and limits. As an application team, you should monitor your quota usage. Otherwise, you may end up in a situation where, during peak traffic, your infrastructure isn’t able to scale up due to the quota being reached.
Something specific to cloud environments is cost monitoring. Most cloud resources follow a pay-per-use model. Your organization may opt for long-term resource reservation, but it won’t change the fact you’ll have to keep an eye on your spending. This is especially true when application teams can use self-service-type solutions to manage infrastructure on their own.
Leaving a test server used for a short proof of concept on-prem doesn’t hurt. In the cloud, however, something forgotten can sometimes contribute to significant infrastructure costs. But forgotten resources aren’t the only reason for cost monitoring.
Since we’re talking about a pay-per-use model, badly designed database calls can lead to unexpected price tags overnight. Therefore, it’s not just about checking your monthly cloud costs—it’s more about spikes in costs on an hourly basis.
Finally, in the cloud, there’s usually more than one way to achieve the same result. For example, if you need a database, you could simply create a virtual machine and install the database yourself. You could also use software as a service (SaaS) offerings from your cloud provider or even some prebuilt solutions from the cloud marketplace. And though all three solutions may fulfill your requirements, there may be significant cost differences between them. Therefore, monitoring costs for all of them can help you decide which one to pick.
It’s common to implement microservices architecture when going to the cloud, so your monitoring system needs to be adjusted accordingly. Microservices require a slightly different approach to monitoring. You no longer have an application to monitor; instead, you have dozens of small applications. Your monitoring system needs to gather metrics from all the individual microservices and be able to give you a general overview of them.
Another important aspect of monitoring cloud-based (and especially microservices-based) applications is network monitoring. But we’re not talking about traditional network monitoring. When it comes to cloud networks, traffic usually shifts from north-south to east-west. What this means is instead of looking only at load balancers and other typical network equipment, you need to focus on network flow between your microservices.
Most microservices will talk to each other using HTTP APIs; therefore, a simple alert of “too many 500 HTTP codes” won’t tell you much. You’ll need context (for example, which microservice?) for such alerts. Network monitoring will also help you scale your microservices accordingly. For example, lots of 500 error codes coming from only one microservice can mean there’s a bug in the microservice or that it simply can’t handle the connections and needs to be scaled up.
Earlier, we mentioned devices will expose specific metrics you should keep an eye on. These metrics will be specific not only to the resource itself but to the cloud provider.
This is why you can’t simply port your on-prem monitoring dashboard and adjust it to the cloud. You need to add those extra metrics. Of course, you probably won’t need to monitor every single one of them, but to avoid unexpected downtime, you need to know about the limitations of the cloud devices and monitor them accordingly. For example, on the Azure cloud, there’s a limit of SNAT ports on the load balancer. This limit is related to the number of front-end IP addresses assigned to a load balancer and back-end pool size (number of virtual machines).
Another important aspect of cloud-based application monitoring is service maps. Service maps are a type of dashboard designed to illustrate how your resources are connected to each other. This is important because you’ll usually use more services in the cloud than on-prem. You’ll likely use more small components like API gateways, proxies, and different types of storage. It’s easy to lose track of what connects to what, and service maps show you this.
When it comes to logging strategy, there are some things to consider. The old-school approach to monitoring is to treat logging separately. In cloud environments—and especially when using microservices—logging should be part of a monitoring solution.
This doesn’t mean you have to use one tool for everything. It means you should treat logging and application performance monitoring (APM) as one. You can, for example, create metrics based on log entries. You can read more about this approach in this blog post.
When we talk specifically about microservices, we have another problem to solve: log centralization. If you have 100+ microservices, you’ll have 100 different places to check logs. And let me just remind you—these 100 microservices are working together as one application. Reading all these logs separately may help you spot some specific code issues, but in general, it won’t give you the bigger picture. You won’t be able to answer how an error in the log of a specific microservice relates to the entire application.
For example, the “order failed” message can be the result of a failed payment, the product not being available, wrong email/login details provided by the user, or many other issues. And all these issues can be produced by a separated microservice. Therefore, you need to aggregate the logs from all microservices and send them to a centralized log analysis tool.
If you’re looking for cloud-friendly tools capable of fulfilling these requirements, try SolarWinds® AppOptics™ and SolarWinds Loggly®. AppOptics is a powerful APM tool offering both infrastructure and application monitoring, and Loggly can help you aggregate and make sense of logs. Feel free to create a free account here and test them for yourself.
The importance of monitoring should be clear by now. It’s no longer a question about whether you need monitoring—it’s about how comprehensive your monitoring is. Do you have the visibility you need to understand the health of your applications no matter how they’re deployed? Monitoring isn’t a goal on its own. The goal is to avoid downtime and find bugs quickly so you can fix them. Monitoring is an essential tool capable of helping you achieve this. Therefore, you should make sure your monitoring systems suit your needs, fulfill your company-specific requirements, and work well with your infrastructure.
Keep in mind infrastructure and applications change rapidly, and as much as stability is important, the ability to adjust to new technologies is also something you should look for. Tools like AppOptics and Loggly offer this. They work well in both on-prem and cloud environments, and they’re constantly improving. Moreover, unlike traditional monitoring tools, they understand the importance of integrating the whole infrastructure stack into one monitoring solution. If you want to learn more about this, check out this post.
This post was written by Dawid Ziolkowski. Dawid has 10 years of experience as a network/system engineer, has worked in DevOps, and has recently worked as a cloud-native engineer. He’s worked for an IT outsourcing company, a research institute, a telco, a hosting company, and a consultancy company, so he’s gathered a lot of knowledge from different perspectives. Nowadays, he’s helping companies move to the cloud and/or redesign their infrastructure for a more cloud-native approach.