share post
Amazon Web Services Elastic Load Balancer (AWS ELB) enables websites and web services to serve more requests from users by adding more servers based on need. Unhealthy ELB can cause your website to go offline or slow down dramatically.
In this article, we will cover:
Elastic Load Balancing automatically distributes incoming application traffic across multiple Amazon EC2 instances. It enables you to achieve fault tolerance in your applications, seamlessly providing the required amount of load balancing capacity needed to route application traffic.
Elastic Load Balancing publishes data points to Amazon CloudWatch for your load balancers and your back-end instances. CloudWatch enables you to retrieve statistics about those data points as an ordered set of time-series data, known as metrics.
Let’s look at UnHealthyHostCount, UnHealthyHostCount is the number of target EC2 Instances that are considered unhealthy. By itself, it is not saying much. Consider the next question: Is the value 2 for the UnHealthyHostCount metric good or bad? And how good (or bad) is it?
Usually, when I present this question, the person would say, “It depends how many EC2 instances the ELB has”.
It’s important to note that trying to place an alert on a metric such as an unhealthy host count is not very useful. Even if we chose a number that works now, it is possible that we would change the size of the cluster that the ELB is serving (or just used Auto-Scaling-Group), and render the alert useless.LogicMonitor introduced Complex Datapoints
UnHealthyHostRate= UnHealthyHostCount / (HealthyHostCount + UnHealthyHostCount)
A complex datapoint, calculate the % of unhealthy hosts, more than 50% healthy hosts will be considered as critical.
A complex datapoint, calculate the % of 5xx error from the total request.
HTTPCode_Backend_5XXRate_Rate = HTTPCode_Backend_5XX / RequestCount
A complex datapoint, calculate the % of 4xx error from the total request.
HTTPCode_Backend_5XXRate_Rate = HTTPCode_Backend_5XX / RequestCoun
A complex datapoint, calculate the % of Queue, based on the SurgeQueueLength metric. SurgeQueueLength is the total number of requests (HTTP listener) or connections (TCP listener) that are pending routing to a healthy instance. The maximum size of the queue is 1,024. Additional requests or connections are rejected when the queue is full. For more information, see SpilloverCount.
SurgeQueueRate= SurgeQueueLength / 1024
The total number of requests that were rejected because the surge queue is full.
[HTTP listener] The load balancer returns an HTTP 503 error code.
[TCP listener] The load balancer closes the connection.
The number of connections that were not successfully established between the load balancer and the registered instances. Because the load balancer retries the connection when there are errors, this count can exceed the request rate. Note that this count also includes any connection errors related to health checks.
While using Anomaly-Detection and Static thresholds for key metrics is expected, there are other use-cases.
Enabling Anomaly-Detection for requests-per-second can identify unexpected load, this unexpected load can be a result of AWS ELB denial of service attack (more info about AWS (DDoS) attack).
When requests dropping to zero can indicate a remote error (r.g. IOT device stop collecting signal).
Ran Gilboa is an employee at LogicMonitor. Subscribe to our LogicBlog to stay updated on the latest developments from LogicMonitor and get notified about blog posts from our world-class team of IT experts and engineers, as well as our leadership team with in-depth knowledge and decades of collective experience in delivering a product IT professionals love.
Gain introductory knowledge to the ITIL v4 framework, as well as the improvements and changes made from ITIL v3.
Michael Tarbet (Global VP of Sales, MSP) and Steve Kahn (Area VP, Channel Sales, North America) would LogicMonitor would like to exclusively invite you to watch the Arizona Diamondbacks vs Los Angeles Dodgers from a premium suite at Chase Field in Phoenix May 26.
Join LogicMonitor for drinks and industry insights as we discuss the observability problems modern enterprises are facing.