Five Best Practices for Proactive Database Performance Monitoring
In my last post, I made a case for how scalability and reliability have a new meaning in the SaaS world and shared two important building blocks for SaaS success:
In this post, I’ll delve into four more SaaS architecture best practices, which build on the core principles we discussed in the first part of this series.
It’s inevitable that customers will send data or use service in an unpredictable or not-so-well-defined manner. With Loggly, we have seen customers unintentionally send huge amounts of data to our service because of an issue of one kind or another in their systems. (We call them “noisy neighbors,” and they exist for every kind of SaaS application.) The best SaaS applications are designed to handle this kind of unpredictable behavior, and service governors are the key. Governors protect your other customers from the noisy ones by quickly identifying noisy neighbors and processing their requests or data through separate paths.
Your service governor is the component that watches your whole service and takes actions based on policies defined for it. It is one of the major users of your metrics and action APIs. It:
Let’s say that some customer suddenly starts sending an enormous bps load to your service. The governor would use metrics from the data collection pod to detect the increased activity. It would look at your defined policy and then invoke the action API on the data collection pod, the action API on the splitter, or both. If there is a need to notify the customer, it would invoke the notification service to send an appropriate notification. Below is the architecture diagram of our sample application with service governors, notification service in place.
The next factor to consider is how your SaaS service will handle data corruption. Data corruption will happen, whether through human fault, machine fault, or software bugs. Building your SaaS service in such a way that it can recover from these faults is a key aspect of its reliability.
Building for data corruption requires a separate data store that maintains all customer data unmodified. The moment that data enters the service, it goes to this immutable store where it can’t ever be modified. Most storage systems provide basic CRUD (create, read, update, delete) functionality. However, to have truly immutable data we should use only the create and read functionality provided by storage systems. In our example, the immutable store can be added off the messaging system because data collection pod collects but doesn’t modify customer data. Now, if your data store gets corrupted for any reason whatsoever, you can recover the data from your immutable data store.
You should make sure you pick an immutable store that can be plugged to your processing pipeline easily without any manual work. You should also be able to send data from this store to your processing pipeline based on time ranges, customer sets, etc. Finally, I recommend that you have a separate lane for rebuilds instead of using the normal lane so the real-time data coming into your application doesn’t experience a performance penalty because of rebuilds.
As your business grows, customer data grows. You need the ability to increase the capacity of servers or components in your service quickly and painlessly. Adding servers manually is painful and error prone. Worse yet, the process can cause outages or result in your service running in a degraded mode.
So how can your application recognize that it needs more capacity and add this capacity on demand? By adding a provisioning service to your stack. Here’s what that service would do (assuming that your service is running on AWS):
You can build your provisioning service with your choice of framework like Ansible, Puppet, Chef, etc. Or you can use a combination of AWS services like CodeDeploy, CloudFormation, OpsWorks, or the AWS boto API. Like your governor, your provisioning service should get metrics from all the components so it is capable of spinning any component in your stack. (This is another reason it is good to have all components stateless.) And with your provisioning service in place, your service doesn’t require any manual intervention if it needs more capacity.
More and more SaaS companies have seen the need for continuous integration (CI) and continuous delivery (CD), which have been a natural evolution of the agile development methodology. Agile simply can’t be successful without CI and CD. The caveat: It requires a lot of infrastructure and resource to build CI and CD into production environments. If you have the ability to put such infrastructure in place, then I recommend having CI and CD for production. If not, you should have CI and CD for your development and test environments and one-click deploy for production. You can actually think of your provisioning services in two parts:
The very nature of log management made it an imperative for Loggly to build our solution to handle an unknown and unpredictable scale. But virtually every SaaS business faces a similar set of challenges:
Loggly’s segmented pipeline has made it simpler for us to:
We’re positioned to take on these potential headaches for tens of thousands of net-centric companies that need insights from their log data so they can focus on what really matters to their end customers. So if you haven’t done so already, start a Loggly free trial so you can stop thinking about the scalability issues around log management forever.
And stay tuned to the Loggly blog because I’ll be digging into more SaaS scalability best practices over the next several months.
Manoj Chaudhary
Share Your Thoughts