Why Loggly Chose Amazon Route 53 over Elastic Load Balancing

 

The key function of our log management service is to process massive amounts of log data from a wide variety of sources, parsing and indexing it to make it ready for analysis and monitoring in near real-time. The first entry point to Loggly’s big data ingestion pipeline is what we call our collectors. (You can learn more about the whole pipeline in this recent post.)

Loggly Collectors Perform One Key Task: Log Ingestion

The Loggly collectors ingest all of the incoming streams of log data from our customers and move that data down our log management pipeline for processing. They are written in C++ and are designed to support the TCP, UDP, HTTP, HTTPS, and TCP over SSL communication protocols. The collectors have two purposes:

  • To work at network line speed
  • To not drop any data, ever

The collectors need to be able to collect logs no matter how fast our customers throw them at us. They are designed to run completely segregated from rest of the Loggly stack. This architecture allow us to leverage multiple regions and multiple Availability Zones (AZs) from AWS.

Log Management Scalability Requires a Way to Balance Traffic Across Collectors

Throughout our architecture, we have made scalability a top priority. Since the collectors are the first step in our pipeline, we needed to find something to sit in front of them that would:

  • Balance the load across collectors
  • Offer fault tolerance, allowing us to continue ingesting data if one of the collectors dies or is experiencing issues
  • Scale horizontally with the growth in our log volumes

Given that our infrastructure runs on Amazon EC2, we looked first to AWS Elastic Load Balancing (ELB). In this post, I will discuss some of the limitations we found with ELB and explain why Amazon Route 53 DNS works better for our log management use cases.

ELB Operates on a Simple Concept

As the name suggests, ELB is elastic; it can balance the load without exposing what is running behind it. The way I like to think about ELB is that it’s like a parking attendant: As a car comes to the parking gate, it routes them to the floor with fewer cars. Similarly, ELB routes the traffic to the nodes sitting behind it. ELB shares another characteristic with a parking attendant: It is relatively expensive because it has to stop each and every car coming into the parking lot and provide it with information about where to go.

Increasing Log Volumes Led to Latency Issues with ELB

We started our journey with ELB deployed in front of our collectors as shown in the diagram below. However, we started hitting latency issues in streaming traffic to the collectors when volumes got into the tens of thousands of events per second. We built our collector to be as efficient as possible, so these latency issues were essentially defeating its purpose.

AWS Elastic Load Balancing with Loggly Collectors

Issue #1: ELB doesn’t allow forwarding on port 514.

One of Loggly’s key competitive advantages is that we don’t require our customers to install agents on their local machines. We ingest logs via syslog, which has been shipped with Linux- and Unix-based systems for ages. We always want to make sure we are compliant with open standards. According to the syslog RFC 5426, syslog receivers should listen on port 514, so our load balancer should able to listen and forward port 514. However, ELB doesn’t have the ability to forward on 514. Its acceptable protocols are HTTP, HTTPS, TCP and TCP over SSL and ports 25, 80, 443, 465, 587, and 1024-65535.

Issue #2: ELB doesn’t support forwarding via UDP.

According to RFC, the syslog receiver should support the UDP protocol. Loggly does not recommend that our customers send data on UDP because it is not a guaranteed delivery protocol, but we need to be able to support it in order to be fully compliant with syslog. Furthermore, some customers do have legitimate use cases for sending data on UDP. Thus our second hiccup with ELB: UDP is not supported. (ELB does support load balancing only on HTTP, HTTPS, TCP, and TCP over SSL.)

Issue #3: Bursts in log event traffic can hit ELB performance limits.

One of the biggest challenges with a SaaS service—especially a log management service—is that customer traffic patterns are never entirely predictable. (Read more about this in my post on six critical SaaS engineering mistakes you want to avoid.) Our incoming traffic varies a lot, and we can have sustained bursts lasting hours when customers are experiencing operational problems that generate lots of logs (Or when they mis-configure their logging, but that’s another story). Of course, these high-traffic periods are the times when we need load balancing the most.

So here’s where our critical issue with ELB came into play: When we ran simulations of log event bursts, we started to reach its performance limits. And we simply can’t afford to have latency on our traffic hitting the collector.

Here are some ELB performance numbers from testing by Rightscale.

Rightscale Amazon Elastic Load Balancing Testing

It’s important to note, however, that Amazon provides limited visibility into exactly how ELB is performing. Under the covers, AWS scales an ELB instance up or down based on traffic patterns, and AWS proprietary algorithms determine how large the ELB instance should be. If the ELB instance doesn’t fit with your traffic patterns, you will get increased latency. AWS CloudWatch does make some metrics available, specifically HTTPCode_ELB_5XX. But this metric does not clearly indicate whether latency is due to ELB or to your server.

Amazon Route 53 Aligned Better with Log Management Use Cases and Requirements for Performance, Fault Tolerance, and Scalability

Once we discovered ELB’s limitations, we looked for alternatives and settled on Amazon Route 53 Domain Name System (DNS).

Amazon Route 53 DNS Round Robin Was a Win

If you’ve ever used the Internet, you’ve used the Domain Name System, or DNS, whether you realize it or not. A DNS is a service whose main job is to convert a user friendly domain name like “loggly.com” into an Internet Protocol (IP) address like 198.192.168.29 that computers use to identify each other on the network. In Amazon EC2, Amazon Route 53 is an “authoritative DNS” system, meaning that it provides an update mechanism that developers use to manage their public DNS names. It then answers DNS queries, translating domain names into IP address so computers can communicate with each other.

Amazon Route 53

There are bunch of  of  routing mechanisms available with Route 53 DNS, as pointed out in its Developer Guide. Round robin is pretty basic load balancing, but it works well for us from an efficiency standpoint. We use weighted routing with equal weighting, which converts it into simple round robin. When a request is sent through DNS Route 53, it gets to the socket of the server sitting behind it right away, so there is no performance impact in collecting the log events.

Failover Checks Provide Fault Tolerance

Since the key goal for the collector is to implement our “no logs left behind” policy, we take advantage of Route 53 failover health checks. If there is an issue with a collector, Route 53 automatically takes it out of the service; our customers won’t see any impact. When we have to add more collectors, we can do so seamlessly. When a collector goes out of service, it goes out of the DNS rotation. We have collectors provisioned behind the DNS in different AZs.  Route 53 routing works across the AZ, so whenever there is an outage in one AZ, data will go to the collectors running in different AZs. Thus we gain the ability to scale across zones, and we’re protected from faults in case we hit issues with an AZ.

Route 53 DNS  Optimizes Inbound Traffic

Route 53 detects and resolves our collector IP so data can go straight to collectors without any overhead. It leverages the high-performance collector software we have built and doesn’t have to go through load balancing latencies.

No Pre-Warmup Required with Route 53

Finally, Route 53 is more graceful about handing the unpredictable log event bursts that start coming with no notice.

Summary: Route 53 Makes Good Sense for Application with High, Unpredictable Loads

Route 53 turned out to be the best way for Loggly to take advantage of our high-performance collectors given our huge log volumes, unpredictable variations, and constant growth in our business. It aligns with the collectors’ core purposes: To collect data at network line speed with zero loss, and it allows us to benefit from the elasticity of all of the AWS services we use at Loggly.


24 comments

  • Glenn Poston

    2 months ago

    We’re considering using ELB for routing DB traffic and are considering a few approaches (ELB, HAProxy managed by consul template, and Route53). A couple people have asked how Loggly handles DNS failover… It would be great for the failover to be instantaneous, but the lowest TTL you can set in Route53 is 1m. Have you found a way around this?

    • Manoj Chaudhary

      Manoj Chaudhary

      2 months ago

      Hey Glen,
      If you are just routing DB traffic. I am assuming the request per second is not as high as what we have here at loggly(this is just assumption). If my assumption is true I think using ELB or ALB(Application Load Balancer) from AWS is the best way to go. Have a record set in Route53 which point to your ELB or ALB which then points to your RDS instances.

  • Ben

    2 months ago

    Hey there, I work for AWS and would like to clarify that the port limitation is only for EC2-Classic ELBs. By now, the majority of our customers are in VPC, and a Classic ELB in VPC allows you to choose any port. Also, we have Application Load Balancers which also allow you to choose any port.

    The UDP limitation would be the valid use-case to not use ELB.

  • timo

    2 months ago

    I suppose it depends on the protocol you use. If you are agentless – like mentioned with rsyslog and UDP then there is no way anyway to have your logs ACK by the collector. But if you have another protocol supporting ACK, then your agent/protocol will miss the ACK from the collector and switch over to another instance (another DNS query). Sure in that case you will have a delay until your messages are handled, but its not missing any of them messages

  • Jonathan

    3 months ago

    AWS has solved Issue #1, right?

    • Manoj Chaudhary

      Manoj Chaudhary

      3 months ago

      No. In the classic mode ELB only allows listening following ports [EC2-Classic] 25, 80, 443, 465, 587, 1024-65535. It doesn’t allow to listen on 514.

  • Atman Dhol

    3 months ago

    How do you scale your collectors?

  • Dawn K

    5 months ago

    Great article.

  • George Stein

    7 months ago

    Grammar error in your sentence above – It says “ELB does supports load balancing only on …”
    s/b
    “ELB does support load balancing only on …”

    • Karen Sowa

      Karen Sowa

      5 months ago

      Hi George,

      Thanks for the feedback, we appreciate it! It should show correctly now.

      – Karen

  • Soliman

    8 months ago

    Thanks for sharing your experience, Great job ! however I’d like to ask how did you implement and configure auto scaling especially you’ve to add the new IP address of your added server as a record set in Route 53 or even to remove it from record sets in case you are gonna scale down.

  • Tarun

    10 months ago

    Great article explaining the details in a methodological way. Thanks for sharing this.

  • Feras

    10 months ago

    I’m a bit confused but you say you can’t afford to lose logs. Well with DNS failover, you are bound to loose at least TTL amount of logs in case a backend goes down. So even if TTL is something extremely low like seconds, for you that can be a huge amount of loss. I’m actually considering the same thing you are describing in the post but for my use case a downtime of <= TTL is not reasonable. How do you handle that?

  • Matt Billenstein

    10 months ago

    How do you handle ttl’s? AFAICT, you need something like a 60s ttl, but in the event of a failure, your service seems to be unavailable to clients having cached those records for up to 60s.

    And with a 60s ttl, DNS queries get expensive — Route53 is $400 for the first billion…

  • Abhishek

    11 months ago

    Have a query … Does route 53 health checks failover pricing applies to each health rule or a health check interval ?

  • Manoj Chaudhary

    Manoj Chaudhary

    1 year ago

    John very good question. We use health checks for non udp traffic. we have built home grown monitor which check for the udp working normally or not.

  • John

    1 year ago

    Manoj, great article. I had a question. We simply send UDP traffic, nothing else to 2 EC2’s. I am using Route53 simple round robin load balancing. My question, I cannot create a health check for a UDP port, so what is the best way to monitor the health of my data collector. You mentioned “we take advantage of Route 53 failover health checks”. How would I achieve Health Checks on UDP ports. I’ve got one server with bare apache and I created a health check for that, but it isn’t monitoring that UDP port in case my java app does for some reason. Thanks again for the GREAT article… helped me a ton.

  • Bobby

    1 year ago

    Sorry for adding a comment a year after the fact, but how to you handle SSL server certificates in a way that is secure across those servers? i.e. not passing .key files around between instances. I would rather use Route 53 but ELB’s shared server cert mechanism is huge advantage.

    • Manoj Chaudhary

      Manoj Chaudhary

      1 year ago

      Hey Bobby,
      We do the SSL termination on our servers. Yes unfortunately if you want to use the Route53 as it stands you need to have keys on the servers.

  • Malcolm Turnbull

    1 year ago

    Excellent blog post. I am glad to see that some people actually understand what high availability and performance are about! When your application layer is configured for true scaleability then it definately wont require sticky or persistent connections. Therefore route 53 is definately the way to go. We had a customer using our load balancer on aws for a UDP application and we told them the same thing: If you don’t need persistence then use route 53. If you need persitence then use our Loadbalancer.org appliance or any other decent solution such as F5 or Kemp Technologies.

  • Steve

    1 year ago

    Not bad, but I’d have liked a bit more technical detail on how exactly did you set up route53 & the servers – If I read correctly you have used weighted routing & failover, which is exactly what I’m sortof looking at, but I’m not sure how will this impact SSL configs, and how to set it all up (example).

    Still, good article that reinforces my view that it can work!

Share Your Thoughts

Top