Unlike physical stores and organizations that operate during set hours, the IT world never sleeps.
In today’s highly connected digital environment, many believe that when an investment is made in technology, it should be accessible at all times — which is virtually impossible to guarantee. Since disruptions occur, organizations should evaluate the services needed to run operations smoothly. For example, what services are required during an IT service outage to ensure minimal disruptions?
This type of evaluation requires organizations to look at several metrics, including a system’s uptime (or reliability) and availability. Although these two metrics are often used interchangeably, they are not the same.
These two metrics lead us to the uptime myth. Uptime does not mean availability.
Understanding what these two metrics mean and what they don’t can help managed service providers (MSPs) create accurate and transparent agreements.
Uptime refers to the percentage of time a system is ready for operation under normal circumstances. This metric measures system, solution, or infrastructure reliability and most commonly refers to a computer.
So, uptime is how long a machine has been working and available, expressed as a percentage of time. However, uptime does not necessarily mean all services and applications are available and ready for use.
When looking at a service-level agreement (SLA), guaranteed uptime is determined by past performance. However, it is not an indicator of what will happen in the future.
So yes, uptime can be an indicator of availability, but it is by no means a guarantee.
The Great Google Outage of 2014 is an excellent example of how a 100-percent uptime is impossible. During this outage, service to Google applications, such as Google+ and Gmail, were down for 25-55 minutes, affecting approximately 10 percent of users. This example shows the conflict that exists between IT reality and consumer expectations. More outages occurred for Google, Facebook, Instagram, Azure, Salesforce, and more in the years to follow. Regardless, consumer expectations remain high.
IT professionals know that 100-percent uptime is a myth, which is why technology is so essential when aiming to deliver a level of service availability that ensures positive customer experiences.
In contrast, availability is the probability that a system will work as required when required. This metric is critical when a team is working remotely. Data shows that 16 percent of companies across the globe are now 100 percent remote, and since 99 percent of people would choose this option for the rest of their lives, even if it was just part-time, this percentage will likely rise in the coming years.
Related: How IT Departments Are Evolving in an Era of Remote Work
When comparing these two metrics, consider the difference between uptime and availability as OEE (overall equipment effectiveness) and TEEP (total effective equipment performance).
Understanding both of these metrics is important because incorrect assumptions can be costly. Viewing these metrics wrong will often lead to a poor experience. Service providers will meet the thresholds in their agreement, but the level of service will be lower than the customer expected.
This phenomenon is what’s referred to as the watermelon effect.
Outputs can meet defined targets, but the outcome is not achieved, leading to unhappy customers.
The watermelon effect is the byproduct of thinking your IT metrics meet all requirements. However, the customer has a poor experience. Metrics look green on the outside, but on the inside, they are red.
SLA reports can look good, leaving the MSP happy. In contrast, customers aren’t satisfied, taking their business elsewhere. Customer experience is essential, so an MSP should never underestimate the importance of support metrics.
The greater the level of transparency around the actual end-user experience, the easier it is for IT teams to focus on helping the end-users. Diving deeper into “red” metrics helps IT teams focus their attention on what matters most. The best thing to do is lean in fast and hard, maximizing the time available to fix problematic metrics before the end of a project or quarter.
So, even if uptime metrics are met, it’s critical to consider customer experience if clients feel the value of a service is still missing. If engagement dips, the driving forces that encourage change drop, and businesses cannot accurately improve the matters that matter most to customers today.
The key is to identify issues before they become a problem. This can be achieved via an observability platform, such as LogicMonitor.
The “Five Nines” is a common term taken to mean 99.999 percent concerning uptime or availability.
For example, the availability level “1 Nine” signals 90 percent uptime, which equates to 36.5 days of downtime per year. As availability levels increase, so does the associated uptime. When companies advertise “5 Nines” availability, this refers to an uptime measurement of 99.999 percent or approximately 5.26 minutes of downtime a year.
The Five Nines of uptime and availability is a significant selling point, which is why suppliers market a 99.999 percent uptime SLA. The issue is, in some cases, each additional nine added to an uptime or availability score does not necessarily guarantee greater reliability. It’s more important for customers to focus on the supplier or service provider based on their capabilities.
As a managed service provider, this is where you can shine.
For example, working with customers to develop a business continuity plan can make a difference when disruptions occur. To achieve the Five Nines, you must consider both equipment and personnel. Uptime and availability are determined by equipment not going down — these metrics are also affected by how quickly the response is when components fail. A business continuity plan is imperative.
This requires a proactive approach, such as using automated tools that allow you to better respond to an unexpected event.
Read more: Solutions to Strengthen Your IT Business Continuity Plan
Although it’s important to be aware of critical metrics and company stats, SLAs can be reasonably meaningless when customers seek an accurate measurement tool. To truly gauge the value of an agreement, companies must look at the bigger picture. SLAs require a certain level of commitment. Boasting a 99.99 percent SLA is excellent, but if there isn’t enough staff to assist when an issue occurs, this commitment is tough to meet. So, the higher the number of nines, the more resources are required.
This type of agreement often leads to a grey area and when issues occur, compensation is usually minimal or non-existent. For example, cloud providers will often provide their customers with service credits if there is an outage. However, these credits do not generally cover the costs. For example, an outage can negatively affect a company’s ability to trade or sell, resulting in lost revenue or a damaged reputation.
The “four-hour SLA response window” is another variable that businesses need to stay aware of when creating a disaster recovery plan. Suppliers will often include this four-hour window in their terms and conditions and while it sounds ideal, it doesn’t mean issues will be fixed within that window. Instead, it means the supplier will begin troubleshooting in that time. As a result, systems can be offline longer and often are.
To ensure outstanding customer service, some MSPs no longer offer a guarantee on the SLA they provide. Instead, they set service level objectives (SLOs). To measure compliance, service level indicators (SLIs) need to be considered, bringing us back to uptime. If an MSP offers 99.96 percent uptime on hosted servers, this value represents the objectives you strive to adhere to. The goal here is to measure compliance and avoid disputes.
Additionally, it’s beneficial to create different SLA agreements concerning different workloads. For example, a cloud infrastructure service may require 5+ Nines to ensure greater functionality, whereas low-priority workloads can still operate at low performance concerning service availability.
There are calculations to measure uptime and availability. However, it can be challenging to measure these metrics.
Network uptime, for example, refers to the time a network is up and running. This is tracked and measured by calculating the ratio of uptime to downtime across 365 days, which is then expressed as a percentage.
Here is an example of how to calculate network uptime:
So, if your network is down for 7 hours during the year, network uptime would be:
Again, uptime is directly related to past performance. That is why challenges arise. For example, a cloud solution may be available with a 99.999 SLA commitment. However, vulnerabilities and even cyberattacks can cause outages beyond the vendor’s control. If the service is affected for days, the service’s availability reduces.
For businesses to consistently track server uptime, there are also monitoring services.
When customers evaluate MSPs, they do so through the use of metrics. Relevant metrics are also monitored by those in management roles at MSP organizations. The goal here is to ensure they maintain suitable performance levels and actively improve key performance indicators (KPIs).
Uptime is usually covered under service improvement metrics. However, several other metrics are worth paying attention to as an MSP. These metrics leverage data to highlight deciding factors between IT providers.
In that sense, uptime and availability are essential to consider, but they are not the end-all for MSP metrics. There is a bigger picture to consider when monitoring managed services.
In today’s current business environment, availability is becoming an important metric based on the transition toward remote work.
While both metrics matter, especially when creating SLAs, they are only part of the overall picture. It is not just the metrics that are important, but more so what you do with them.
To improve service uptime and availability is essential that customers understand we do not live in a perfect world. Communication is crucial, especially concerning the needs and requirements of customers.
In addition to running tests, implementing fail-safes, and working towards eliminating failure points, continuous monitoring is imperative. Monitoring availability provides clarity. That is how you build a highly available system.
Achieving 100 percent uptime is an unattainable objective. As discussed, uptime reflects past performance. In that sense, it is a valuable indicator of future availability, but it is not a guarantee — and 100 uptime remains virtually impossible.
That is why businesses must focus on project maintenance requirements and potential logistical delays. In doing so, they can more accurately forecast downtime to improve availability. To address consumer expectations, IT teams must anticipate problems and improve visibility to recover faster. Complete coverage, flexible integrations, and deep reporting will remain critical areas of focus to achieve this.
There are a few Agile certifications available to choose from, and in this article, we’ll discuss the best agile certifications currently available for IT professionals.
Join LogicMonitor Wednesday June 1st for Dinner @ Frankie & Johnnie's Steakhouse
Join LogicMonitor for a CiscoLive Dinner @ SushiSamba - June 14th, 2022