LogicMonitor has always been committed to providing the most comprehensive insight into your datacenter. Our array of executive dashboards track performance metrics with up to two years of history, as well as provide at-a-glance views of your infrastructure’s present status. Now, LogicMonitor can also predict the state of your infrastructure for up to three months with alert forecasting.
Several years back, the hashtag #monitoringsucks took the IT world by storm. Among monitoring users’ chief complaints was the lack of actionable data that could be drawn from your typical monitoring tools. IT blogger James Turnbull notes:
“Nagios can trigger [alerts] on a percentage threshold, for example the disk is 90% full… [but it] doesn’t tell you the most critical piece of information you need to make decisions: how fast is the disk growing.”
This is a problem many IT professionals face. Monitoring has historically been relegated to being the bearer of bad news: it warns you about performance issues, but does not help strategize a resolution. Alert forecasting tackles this challenge in a new and exciting way by predicting metrics’ trends and letting you know when they will reach a designated alert threshold. With this information, you can prioritize the urgency of performance issues, refine budget projections, and strategize resource allocation.
Let’s take a look at some of our most popular alert forecasting uses.
Continuing with Mr. Turnbull’s example of disk usage, let’s say you receive a warning indicating a disk is 80% full. With 20% of your disk space still available, this alert may not induce any sense of urgency. But, it does beg some questions: how quickly is disk space filling up? When will you receive a critical alert that the disk is, say, at 95% capacity? To answer these questions, you need alert forecasting. With the historical data collected on this disk, alert forecasting uses predictive analytic algorithms to craft a future trajectory of disk usage. This trajectory shows how much time will pass until the alert becomes critical and requires immediate attention. In this case, knowing how quickly your free disk space is diminishing allows you to create a strategic timeline for how and when you will need to allocate resources to increase available disk space.
Forecasting predicts that disk space usage will surpass 95% within 28 days.
Another common use of LogicMonitor’s forecasting is financial budgeting. Take, for example, one of LogicMonitor’s most popular datasources, AWS Billing. As cloud computing continues to lure the IT industry, AWS’ platform has become a major portion of one’s IT budget. Using alert forecasting on our AWS Billing datasource lets you project what your monthly or quarterly spend will be. This is an invaluable tool for everyone involved in drafting your IT budget including CFOs, CIOs, and Directors.
The benefits of alert forecasting are enormous, but there is always one question that hounds any sort of predictive analytics tool: how accurate are the forecasts? The short answer for LogicMonitor’s alert forecasting is “extremely accurate.” For a longer and probably more convincing answer, let’s delve into how alert forecasting works.
Before we start forecasting, we ensure your historical data is robust by applying a seasonal hybrid ESD model. In layman’s terms, this algorithm corrects data gaps and removes isolated outliers that don’t represent your metric’s overall trend. Once the data is primed, we use a series of ARIMA and HoltWinters statistical models, similar to those used in business’ fiscal projections and economic forecasting, which graph a trajectory of your metric’s future performance. One particularly cool feature of our alert forecasting is the “minimum confidence to include” field, which lets you designate the minimum level of confidence that LogicMonitor’s algorithms must have in order to predict an alert. For instance, a 90% minimum confidence indicates there is at least a 90% chance that an alert will be triggered by a certain date. Just as LogicMonitor is committed to never sending you a false alert, we are also committed to making the most accurate forecasting projections!
At LogicMonitor, we go beyond alerting on your environment’s performance issues. We want to be a resource for optimizing your entire IT strategy. To learn more about the alert forecasting tool, including how to set one up for yourself, visit our Alert Forecasting support page.
This table uses the past year of data to project the number of days until media wearout, CPU, 5-min load average, and disk usage datapoints surpass their designated thresholds.
Recently a customer contacted us at support for help setting up alert thresholds on a custom datasource they had written. Upon inspecting the datasource in the customer’s LogicMonitor portal, I realized they were monitoring an HL7 feed, which I found intriguing.
HL7 stands for Health Level 7, and refers to the standards and methods of moving clinical data between independent medical applications (commonly used within hospitals). The feed is comprised of human readable text messages broken up into sections, much like an email. One of the most common types of HL7 transmissions are ADT messages, which are used to update admissions, discharges, and transfers within a patient’s clinical data record.
Below is an example of an HL7 message:
MSH|^~\&|MegaReg|XYZHospC|SuperOE|XYZImgCtr|20150529090131-0500||ADT^A01|01052901|P|2.5
EVN||201505290901||||201505290900
PID|||56782445^^^UAReg^PI||DOE^JOHN||19620910|M||2028-9^^HL70005^RA99113^^XYZ|200 E 6TH ST^^AUSTIN^TX^30
OBX|1|NM|^Body Height||6|f^Feet^ISO+|||||F
OBX|2|NM|^Body Weight||180|lb^Pounds^ISO+|||||F
AL1|1||^ASPIRIN
In the message above you’ll notice each segment contains an object identifier, followed by that section’s relevant information. For example, PID identifies the patient and contains their date of birth and home address. Observations (vitals) and allergies are noted within the record as well to prevent any potentially dangerous reactions to prescriptions.
With the constant updating of a patient’s data, the ADT feed allows information to be transmitted from a clinic or hospital information system to an external application or provider in near real-time. Once updated, the clinical data is usually accessed or needed in many different places such as outpatient clinics or labs. Fortunately, our customer was not interested in following or recording personal data – which may have moved the monitoring system into scope for HIPPA compliance. Their goal was only to monitor the feed’s status. To do this, their datasource uses a webpage collector to query data from the feed’s API using a GET request. The message feed status was determined using a datapoint that measures the amount of time since a new message is posted. Depending on how busy the hospital location is, there are times when it is entirely acceptable to have zero new messages posted in any particular feed. Once a considerable amount of time has passed with no new messages, however, the customer wanted to be alerted so they could investigate.
After working with the customer on this particular issue I came to realize how flexible LogicMonitor is. In addition to the hundreds of supported devices, services, and apps LogicMonitor supports, it can be customized to meet any monitoring need. Even for use cases we hadn’t thought of our selves!
[Written by Chris Morgan, Senior Solutions Engineer at LogicMonitor]
At LogicMonitor, our monitoring philosophy is to provide customers with actionable intelligence. Great examples of actionable intelligence are the alerts we send you about performance issues in your IT infrastructure. Providing meaningful performance and health metrics is our bread and butter, but we want to avoid overwhelming you with alerts as overload often results in apathy, defeating the original purpose of monitoring.
Consider the case of when a Windows Server running SQL database receives a credential change. Any new client request to that server will then fail, and with every failure a Window Event will trigger. When your server has an issue and 100 different clients are trying to access it unsuccessfully, you’ll see an event, and an alert, for each and every failure. This quickly becomes overwhelming, and you’ll probably turn off EventSource alerting to avoid the alert storm. Your frustration in this case would be understandable – a single Windows Server can be responsible for thousands of event alerts in a very short time period. But turning off event alerting has potentially dire consequences: you can miss crucial events you actually need alerting on, so you’re throwing the baby out with the bath water.
To help you deal with this, LogicMonitor has implemented a new feature to suppress duplicate alerts. Windows (and LM, historically) treats each event as a separate alert. The new feature allows you to alert for a particular event, but suppress duplicates for a given time period. Where previously you’d receive an alert storm, now you’ll simply get one alert for the time period you specify (default 1 hour). You’ll be able to give the alert the attention it needs because you’re not snowed in by the storm.
The result is that Windows Event alerts will no longer overwhelm the alert tab or your inbox. You can keep event alerting in place, without risk of alert overload. Configure duplicate alert suppression as an option in EventSources. It’s a new feature for all Windows and LogWatcher EventSources.
This capability is turned on by default in LogicMonitor’s latest release (v.49). We are not overwriting customers’ existing EventSources, but we *highly* recommend that you turn it on for pre-existing EventSources and LogWatchers. Turn it on and watch as that event alert noise gets turned down and lets you focus on the more important actionable information in your infrastructure.
Here at LogicMonitor we serve hundreds of managed service providers with our cloud-based performance monitoring platform. “MSPs” range from true blue service providers, to Cisco VARs, to Cloud Providers and System Integrators. MSP monitoring typically means using LogicMonitor to monitor their own equipment and hosted apps. Many also will drop a LogicMonitor collector at a customer site or remote sites to monitor critical customer side infrastructure. Our flexible deployment model avoids the hassle of setting up VPNs that legacy premise-based monitoring tools require.
Lately I’ve spent a lot of time in airports and freeways to better understand our MSP customers, and I’m amazed by their expert use of LogicMonitor and their willingness to share best practices to help other MSPs monitor better (and make more $!). I’d like to thank a couple of MSPs in particular — Sagiss (in Big D) and CIO Solutions (in our hometown of Santa Barbara) — for contributing to help us build a best practices guide for using LM within MSPs. Here’s the first in a series to help MSPs get the most out of LogicMonitor, and hopefully contribute to your success. (more…)
Eleanor Roosevelt is reputed to have said “Learn from the mistakes of others. You can’t live long enough to make them all yourself.” In that spirit, we’re sharing a mistake we made so that you may learn.
This last weekend we had a service impacting issue for about 90 minutes, that affected a subset of customers on the East coast. This despite the fact that, as you may imagine, we have very thorough monitoring of our servers; error level alerts (which are routed to people’s pagers) were triggered repeatedly during the issue; we have multiple stages of escalation for error alerts; and we ensure we always have on-call staff responsible for reacting to alerts, who are always reachable.
All these conditions were true this weekend, and yet we still had an issue whereby no person was alerted for over an hour after the first alerts were triggered. How was this possible? (more…)
A company started a trial yesterday, added a bunch of windows hosts, and immediately got warnings triggered that their hosts were “receiving 42 datagrams per second destined to non-listening ports…Check if all services are up and running.”
This was across many of their hosts, and was an issue they were unaware of, and didn’t immediately know the cause.
However, this morning we received an email:
“I need to share my excitement with discovering the cause of the UDP ‘storm.’ It was the Drobo Dashboard Service we had running on a Citrix XenApp server. Every 5 seconds, it was broadcasting to port 5002 searching for our appliance.
It was further amplified as we have Virtual IPs enabled on the Citrix server, resulting in what appeared to be a broadcast coming from each IP every 5 seconds.
We disabled that service and the UDP alarms have cleared. Thanks again.”
Their UDP error graph now looked much better:
While having 40 extra packets per second discarded by servers is not really going to affect them much (unlike the old days, when a few hundred broadcasts per second could freeze a computer entirely), the more things are controlled, and understood, the better your datacenter will perform. Sources of hidden complexity can hinder troubleshooting, slow resolution, and lead to failures later on.
This is just an example of the ways LogicMonitor has you covered. There are many alerts that most people will never see – but it’s nice to know there are thresholds set that will help you get your infrastructure conforming to best practices – if you happen to slip.
We got a question internally about why one of our demo servers was slow, and how to use LogicMonitor to help identify the issue. The person asking comes from a VoIP, networking and Windows background, not Linux, so his questions reflect that of the less-experienced sys admin (in this case). I thought it interesting that he documented his thought processes, and I’ll intersperse my interpretation of the same data, and some thoughts on why LogicMonitor alerts as it does… (more…)