server monitoring Archives

So, you are considering moving to the cloud. Or, likely, have already transitioned your services. If you are wondering what the next steps are, you have many options on how to move towards cloud infrastructure. This blog will cover what to consider when using public clouds. Specifically, we will focus on monitoring a hybrid cloud set up using the big 3 cloud providers, Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).

Using multiple cloud platforms offers a variety of benefits, including more fault tolerance and the ability to leverage new offerings from each of the cloud providers. However, a big downside is having to monitor all of your cloud providers, since you’re probably used to having a single pane view of your on-premise servers. This is not something that is easy to do across AWS, Azure, and GCP.

What to Monitor in a Hybrid Cloud Environment

Some of the things you will want to be monitoring are your usage of services, storage, network statistics, service health, and costs. While most cloud providers offer this information relatively freely from their portals, finding a way to simplify this view and show high-level information across multiple platforms is much more difficult. Things you may want to look at include:

Comparing usage and cost across your compute infrastructure
Comparing network traffic to your cloud services
A NOC style view of all of your infrastructure

This can become more difficult if you are looking at multiple accounts and especially if you are looking at multiple cloud environments. There are many things to consider when trying to aggregate all of this data into a single pane. Will my portal solution scale with my services? How can a portal deal with the ephemeral nature of many cloud services? How much customization do I have on this tool? Can I receive alerting on all the cloud services? Is that alerting customizable? How do I set up the monitoring? How much will it cost me in time or money?

Different Types of Hybrid Cloud Monitoring

Each cloud offers in-cloud solutions to monitor their own services. AWS has CloudWatch, which is an acceptable solution if all of your infrastructure is on AWS. However, it falls short on multi-cloud monitoring. Users also often complain about the complexity of creating dashboards and the lack of exporting alarm and alerting data. Azure offers Azure Monitor, which comes with a similar limitation of not being fully able to support other infrastructure. The complaints largely focus on the inability to create customized metrics and dashboards. GCP also offers an in-cloud option, Stackdriver, which again only monitors GCP resources. Stackdriver requires users to pay extra for third party integrations if they want to export alerts. Assuming you have a multi-cloud or hybrid on-prem and cloud infrastructure, what are you to do? Consider external monitoring tools.

Free Monitoring Tools

When considering what tools to use, you’ll need to prioritize your needs and what options are available for you to use. There are many free tools available that can provide some or all of the features you need. The cost of those tools is just development and maintenance. Some tools can cover the bare basics, but it can take a lot of time and devotion to make them useful in a simple way. There is also the problem of keeping them updated and fully functional, along with the added cost of hosting the services yourself. Often the tool of choice at this tier is Nagios. Most deployments with Nagios use an installed agent that is put on every server to track the data. This includes extra hits on the CPU and API side. By virtue of being an agent installed Linux monitor, it can be used with most clouds including AWS, Azure, and GCP. The UI options on many free tools are only as robust as you are willing to create and maintain.

Basic Monitoring Tools

Now, let’s take a look at what you get when you start paying for monitoring. Some solutions offer a simple view of what is happening on your individual cloud platforms, but there isn’t a way to dive deep, get alerting, or have customization. These solutions are low cost and they save you the time of logging in to multiple sites to get the information you need. Again, Nagios has an enterprise-focused tier that provides better support than the free do-it-yourself version. Most of these tools also require an installed agent on each server, costing you the extra CPU and API calls. They may have better UI options out-of-the-box but often fall short on customization.

Advanced Monitoring Tools

While the basic options work well for some organizations, there are also more fully integrated, scalable solutions available that allow you to customize your dashboards and views. Solutions in this range start to have more complex cost structures but also allow for more complex dashboards and more useful insights. Some solutions in this area will also allow alerting to be raised from the cloud platforms through to these solutions. While these solutions come with a heftier price tag, they come with a lot of added features.

If you are looking for even more customization and alert handling, there are even more complex monitoring solutions available. With these, you can expect to get integration and scalability right out of the box. The real differences that you start to see are fully modifiable alerting and alert routing, often with the ability to control alert escalation for either severity or alert time. This alerting can include mail lists and the ability to acknowledge that alerts are happening and that someone is looking at them.

Solutions in this category also should include a way to quickly add or remove services, even better if the monitoring solution does it for you. Often there is support for adding your own functionality on top of the already existing platform, everything from customizing how the data is collected and organized to creating custom collection methods to add specific data to your dashboard.

Less Maintenance and More Flexibility

With each of the tiers, installation and maintenance will vary. A more advanced cloud monitoring solution will offer simple ways to not have to constantly change the setup or manually add and remove data collection and monitoring. Solutions that have many moving parts can provide coverage, but a solution with fewer moving parts that provides a similar level of coverage will have less opportunity to fail.

These cross-platform monitoring solutions grow in complexity and cost, but often provide a larger return on investment. The solution does more of the work for you, allowing you to focus on the parts of your business that require more nurturing attention. A complex system that monitors your cloud platforms across Amazon Web Services, Microsoft Azure, Google Cloud Platform, and your own on-premise solutions allows for the flexibility to gain insights and have a unified view of your systems.
If you are looking for a platform that provides all of the advanced monitoring features listed above, check out LogicMonitor. Try it free, or book a free demo today.

When discussing monitoring with IT and technical operations teams, it comes as no surprise that every team has its own parameters and requirements for their particular environment. Some teams just need an up-down of specific devices or interfaces, and others need more granular metrics like JVM or database performance. At the end of the day though, everyone is responsible for a service. Your service might be a public-facing website, it could be internet access at a remote location, or it might be an internal tool, like a support ticketing system.

A Unified Approach to Monitoring

There has been a trend recently for practitioners and executives alike to have a unified view of the “health” of their service. What is commonly found though, is that reporting on the overall status, or health, of a service, is not straightforward. This is especially true for environments that are supported by diverse or ephemeral infrastructure. While most teams can define the key metrics that must be in good standing for their service to be healthy, they are stopped short of this goal due to their disparate and/or inflexible tooling.

A majority of the technical teams that we talk to typically use a mix of monitoring solutions that were purpose-built for one layer of today’s infrastructure stack. They are using a tool (or module) for monitoring the network, another tool for monitoring the servers, and yet another tool for monitoring the applications and databases. While some of these tools and modules can be integrated to give you a dashboard with metrics from each layer, they are not truly integrated and therefore can’t aid in correlating the performance of individual components with the health of the overall service.

In addition to the countless hours of time that are recovered when a unified monitoring and reporting system is put in place, it is just as imperative to connect the dots between the infrastructure and the team-level, or business-level KPIs.

How We Monitor Business Intelligence

Teams are frequently asked to diagnose or report on the overall health of their service, and for most, this is a daunting task. Let’s take a true-to-life example. At LogicMonitor, the Business Systems team maintains a business intel portal that is consumed by numerous cross-functional teams throughout the company. For us, this portal is our service. To accurately report on the successful delivery of our service, we must have visibility into metrics that range from infrastructure performance to consumer-grade KPIs.

A business intel service map dashboard in LogicMonitor

For this portal to operate effectively, a number of components must be performing optimally: the database-extraction application must send the data to Amazon’s S3, a third-party ETL-service pushes the data from S3 to our data warehouse, and an internally-built and hosted application then moves the data into our portal.

To nobody’s surprise, we use LogicMonitor to keep an eye on our ETL pipeline, as well as our portal’s successful delivery to our (internal) “customers.” From the throughput on our S3-buckets all the way to the number of weekly active users accessing our business intel portal.

When all is said and done, this is the joint responsibility of the team, from the engineers to management. We know that if our portal is not working properly, trust in the service deteriorates, and our active user count dwindles.

LogicMonitor’s customers are using the platform in exactly this way: to bridge the gap between the infrastructure that supports the mission-critical applications, and the metrics that are used to grade the success of these services. This is only possible with a tool like LogicMonitor because it has been built from the ground up to provide comprehensive visibility into all layers of the infrastructure, as well as limitless flexibility to monitor any metrics deemed relevant to the health of the service.

LogicMonitor is what allowed Ted Baker to extract ERP insights relevant to both executives and IT Teams. As Stuart Carrison, Ted Baker’s head of IT stated, “What’s unique about using LogicMonitor is that we can provide information to the business about how the business is performing.”

In order to provide a less abstract example of the service-centric approach to monitoring, let us look at a common service that IT teams are responsible for: an internal ITSM tool like Jira.

A JIRA Service Health dashboard in LogicMonitor.

In this case, our dashboard is comprised of underlying infrastructure components, along with counters of tickets pertaining to our Jira portal. As mentioned before, LogicMonitor provides a holistic picture of our service: from hardware health all the way through to team-level goals.

Key Takeaways

For those looking to build out their service-based approach to tooling and monitoring, here are a couple of recommendations and food-for-thought as takeaways:

Start from the KPI’s of your service, and work backward. This tends to not only fortify your monitoring safety net, but it initiates the conversation that bridges the business with the tech. With a tool like LogicMonitor in place, you are provided a unified and consistent method for attaining coverage over all of your relevant metrics.
Second, especially when it comes to dashboards and reports: find the right balance between including all components that make up your service while not losing sight of the service itself. As is common in IT, one can “lose sight of the forest from the trees.”
Finally, it’s about the people: use the platform as a way to introduce ease into your technical operations and clarity to all stakeholders across the business.

If you are interested in learning more about LogicMonitor’s business intelligence capabilities, connect with your customer success manager or attend a weekly demo to see LogicMonitor in action.

Background

Synthetic transactions facilitate proactive monitoring of online business services such as e-banking and e-commerce. They not only monitor existing services but can be used to simulate and test the quality of future user interactions with services. LogicMonitor includes Website Checks out-of-the-box but did you know that Scripted DataSources can be used to extend website monitoring into synthetic transactions? This is the first in a series of technical articles on extending LogicMonitor’s capabilities via two popular technologies: Python and Selenium.

It is important to note that solutions presented in this article involve slight modifications to host environments. Your Collector hosts are yours to control and with great power comes great responsibility. As you add more scripting and programming language support to your Collector hosts, be sure to refer to their documentation and adhere to recommended security practices. Also, leverage your IDE and code linters to point out aberrant coding practices. Finally, do your best in documenting changes to your environment and leaving room to rollback modifications.

Performing Synthetic Transactions in Vanilla LogicMonitor

Out-of-the-box, LogicMonitor offers two forms of website monitoring:

Standard Web Check, which uses web requests in order to confirm website availability from predefined LogicMonitor locations
Internal Web Check, which uses a Collector host and user-defined Groovy script in order to confirm website availability from inside the customer’s network

During my Professional Services engagements, some customers required deeper support using our Website Checks:

Identifying HTML elements in a Document Object Model (DOM), especially elements and information related to login forms
Accounting for JavaScript, particularly JavaScript redirects
Scripting in Groovy
Performing synthetic transactions such as completing a multi-factor authentication (MFA) process

The first concern is addressed in our support doc Web Checks with Form-Based Authentication. We also offer online support accessible via our website or the product, as well as Professional Services to help customers with Web Check configurations for any complexity. I aspire to address the remaining concerns.

Spotlight on Selenium

Front and center for our synthetic transaction solution is Selenium, a test automation technology particularly focused on web browser automation. If you have experience in front-end quality assurance, then you are probably well acquainted. Using the Selenium WebDriver, we can programmatically navigate to web pages, find specific HTML elements on a page such a login form, and interact with those elements.

Selenium easily addresses the concern of JavaScript redirects. Navigating simple to complex web pages are no challenge for Selenium because it executes the automation in an actual browser. You could grab popcorn and watch the magic if you wanted. Alternatively–and I do recommend this–you can execute your automation in a headless mode, which hides the browser in exchange for increased execution speed.

Selenium boasts cross-platform support for the following operating systems:

Linux
Microsoft Windows
Apple Mac OS X

Scripting and programming language support is even more impressive:

Python
Java
Ruby
C#
JavaScript
R
Perl
PHP
Objective-C
Haskell

While the LogicMonitor product makes embedded Groovy and PowerShell scripts easy, you technically can configure your LogicMonitor Collector hosts to support any of the aforementioned languages. This would effectively expand your monitoring capabilities through LogicMonitor DataSources and Internal Web Checks (remember these execute from Collector hosts).

Partnering with Python

Whether you love it or hate it–Python often gets the job done. Stack Overflow, a common developer resource well within the top 100 websites visited in the world, has found Python to be one of the fastest growing scripting languages particularly in “high-income countries”.

This year over a third of developers responding to a Stack Overflow survey responded that they use Python, and, a recent Dice article cited Python’s ease of use and flexibility as a significan
t factor for its popularity within the data science realm. Python is worth learning due to its sheer popularity alone. Thanks to that popularity and its compatibility with Selenium, finding code examples and getting started with web automation is easy.

Getting Started in 5 Steps

The following steps will help technical readers start performing synthetic transactions with LogicMonitor:

Choose development and production operating systems for your automation. You can develop your Python script(s) in any operating system supported by Python (e.g. Linux, Windows, Mac OS X, and more). Plan to align the operating system compatibility of Python, Selenium, and LogicMonitor Collectors. In this case, you want to be more granular in your choice of major platform. For example, you may have developed your Python script in a Windows 10 environment, but, you would execute in a Windows Server environment due to LogicMonitor Collector compatibility.

Install Python in your development and production environments. I personally prefer Python 3.7 in general for its introduction of dataclasses and it is the version I tested against. Python 3.6 would be a fair choice for stability given its current stage of releases. I recommend avoiding and migrating away from Python 2.7 as it will approach End of Life in 2020.

Optionally, install virtualenv. This step is for those who appreciate the repeatability and reproducibility practices often found in the Dev Ops realm. One benefit of virutalenv is that it will install your Python libraries in a temporary location. It can be used in development environments to help isolate and track the minimum, required Python libraries to execute a scripted solution. This benefit is usually combined with a Python Pip requirements file, which documents the required libraries and facilitates their automatic installation. Read the virtualenv installation and user guides before proceeding to the next steps.

Install Selenium. This could be a two-phased approach depending on your browser support requirements. First, install Selenium Python libraries using Python Pip (automatically included in your Python installation). Second, install a web driver. Firefox will require the Gecko web driver. Chrome requires the Chrome web driver. You should pick your preferred browser for reliability testing.

Finally, author your automation script. Expect an in-depth example from this article series in the near future. For now, you can use the official Selenium documentation for help, or, find examples online at places like Stack Overflow. Just remember that in order to combine your automation script with LogicMonitor, the script must output results in an accepted format.

Potential Business Impacts

Synthetic transactions with LogicMonitor can positively impact your business, providing a competitive edge in MSP services portfolios, and, enabling internal IT teams to track the reliability of business services.

Conclusion

In this article, we introduced the execution of synthetic transactions in LogicMonitor using Selenium, a tried and true automation tool, and Python, one of the most popular scripting languages. In just 5 steps, the script-savvy can get started. Need help? Ask about a free LogicMonitor Professional Services scoping call.

Want to see more? Follow us here:

On Facebook
On Twitter
On LinkedIn

Or, e-mail us @ [email protected]

At LogicMonitor, we store almost 100 billion metrics a day – more than a million per second. Some of our storage engines consequently have to deal with a lot of disk IO, so we have a heavy reliance on SSDs. Traditionally these have been Intel’s enterprise SSD SATA drives, but recently we’ve been moving onto PCIe SSDs. Which raises the question – why have we been upgrading systems to PCIe SSDs? How did we know it was needed?

As a SaaS-based performance monitoring platform, we can collect a lot of data about our drives and our systems. But we’re only as good as the data we can get from the SSDs and that data isn’t always as comprehensive as we’d like.

With a traditional hard disk, the device utilization (percentage of time the device is not idle – i.e. busy) is a good indication of capacity. If the graph shows the disk is busy 90% of the time, request latency will be increasing, so it’s time to upgrade drives, add more spindles, or move workloads. However, with SSDs, as the iostat man page says:

"... for devices serving requests in parallel, such as RAID
 arrays and modern SSDs, this number does not reflect their
 performance limits."

This is because SSDs can deal with multiple requests at the same time – they are often optimized to deal with up to 64 (or more) requests at once, with no appreciable change in latency. The fact that the device is utilized doesn’t mean it can’t do more work at the same time – utilized does not mean at maximum capacity. At least in theory…

For our workload, however, we found that device utilization is a good predictor of SSD performance capacity. For example, looking at the following graphs:

You can see that as the amount of time the drive is busy (device utilization) increased, so did the latency of write requests.

We can make that easier to see by changing the IO Completion time graph (which shows the time for operations to complete including queueing time) to not scale from zero, and not include reads:

Now it is a bit more apparent that when the device is busy, write latency went from an average of around 1.7ms to a sustained value of almost 1.8ms. This seems like a small change, but it’s around a 6% increase in latency.

The absolute change in the number of disk operations is small, as seen below:

Yet it is apparent that even a small change in the number of file operations, when combined with a small change in latency, creates a significant change in the number of operations queued or active in the SSD system (this follows from Little’s law – the number of objects in a queue or being processed is equal to the arrival rate times the processing time):
And for our application, that is enough to increase queuing internally.

This kind of erratic queueing is an indication of overload in our system and means that the IO system is close to not being able to keep up, resulting in growing queues.If left uncorrected, the system would then lag behind the incoming data. That would get our operations team paged in the middle of the night, so it’s something we try to address in advance.

In this case, we could have spread the load to more systems, but instead we elected to put in a PCIe SSD.

This dropped the IO completion time by a factor of 100, to 0.02ms for writes, instead of around 2ms as it was previously:

Another interesting note was that the number of writes went from around 10,000 per second, with about 12,000 merged, to 22,000 writes per second and zero merged, with the same workload. With no doubt this is due to the fact that the PCIe nvme disks use the none kernel IO scheduler, instead of the default CFQ scheduler that the SATA SSD’s used. (As I’ve noted before, this is easy to test, and we did test different schedulers for the prior SSDs – but found no detectable difference, so we left it at default.)

A more problematic issue we noted is that the NVME disks reported 100% busy time, and a queue size in the billions, both in LogicMonitor and in IOstat. However, this turned out to be a kernel issue, due to older kernels not using atomic in-flight counters. With the kernel updated, the statistics are being reported correctly.

So will device utilization turn out to be a good predictor of drive performance capacity for PCIe SSDs, or will their ability to deal with requests at the same time mean that we can’t rely on that metric for planning? Well, we are nowhere near 100% for these devices, and we haven’t yet run accurate workload simulations to
this scale – but using FIO, we did find that in order to drive these cards to near 100% (it required many workload threads to be running – but that it also increased latency).

But as always – the best indication of how your systems scale is to monitor the metrics that matter for your systems, as you scale them. Drive utilization is an indicator, but not as meaningful as monitoring custom metrics for your applications (e.g. their own internal queues, or latency, or whatever is appropriate). Make sure your applications expose metrics that matter and your monitoring system is easily extended to capture them. But also keep capturing metrics like device utilization – more data is always better.

Last night, our server monitoring sent me a text alert about the CPU load of a server in our infrastructure I had never seen before. (I’m not normally on the NOC escalation – but of the usual NOC team, one guy is taking advantage of our unlimited vacation policy to recharge in Europe, and two others were travelling en route to Boston to speak at AnsibleFest.) So I got woken up; saw the errant CPU; checked the server briefly via LogicMonitor on my phone; replied to the text with “SDT 6” to put this alert into Scheduled Downtime for 6 hours, and went back to sleep with the CPU still running over 90%.

How, you may ask, did I know it was safe to just SDT this alert, when I had never come across this server before? What if it was a critical piece of our infrastructure, and its high CPU was causing Bad Things? Its name told me. (more…)

While most may not see Microsoft as a ‘disruptive innovator’ anymore, they seem to be claiming exactly that role in the enterprise hypervisor space, just as they did in gaming with the Xbox. As noted in “VMware, the bell tolls for thee, and Microsoft is ringing it“, Hyper-V appears to be becoming a legitimate competitor to VMware’s dominant ESXi product. As described in the article, people reportedly now widely believe that “Microsoft functionality is now ‘good enough'” in the hypervisor – and it’s clearly cheaper (in license terms, at least.) So is this change in perception really turning into more enterprises choosing Hyper-V?

From LogicMonitor’s view of the situation, we can say that in our customer base, virtualization platforms have been almost entirely VMware in the enterprise and most private cloud providers, with some Xen and Xenserver in the cloud provider space. But, we have also been seeing more Hyper-V deployments being monitored in the last 6 months. Still a lot less in absolute numbers than the number of ESXi infrastructures being added to LogicMonitor: but the rate of growth in Hyper-V is certainly higher.

This sounds like a “low-end disruption” classic case study from the Innovator’s Dilemma (Clayton M. Christensen), except for the fact that the Innovator is a $250 billion company!

Right now, Microsoft seems to offer the ‘good enough’ feature set and enterprise features, and ‘good enough’ support, reliability and credibility, leading to some adoption in the enterprise datacenter. (From our biased point of view – the metrics exposed by VMware’s ESXi for monitoring are much better than those exposed by Hyper-V. But perhaps Hyper-V is ‘good enough’ here, too…) There are lots of ways this could play out – VMware has already dropped the vRam pricing; Microsoft being cheaper in license terms may not make it cheaper in total cost of ownership in the enterprise; VMware is starting to push VMware Go, which could turn into a significant disruptor itself.

So can the $250 billion Microsoft really prove to be more nimble than the $37 billion VMware? History would suggest Microsoft will deliver a solid product (eventually). Hypervisors themselves are becoming commodities. So the high dollar value will shift upward to management. VMware may chase the upward value (like the integrated steel mills did, that were largely disrupted out of existence); they may go after the commodity space (reducing their profit margins, but possibly protecting their revenue). Or they may push VMware Go, Cloud Foundry, and other cloud offerings, disrupting things entirely in another direction.

Of course, there are many other possibilities that could play the role of disruptor in the enterprise hypervisor space: Citrix (Xenserver) and KVM spring to mind, but these (currently) tend to play better in the large data center cloud space, rather than the enterprise.

Still, VMware is very much in a position of strength and is well suited to lead the next round of innovation which I see as the release of a product which allows for the movement of VM’s seamlessly from my own infrastructure to a cloud provider’s and back, while maintaining control, security and performance (and monitoring) that IT is accustomed to. Let’s see if I am right. Fun times ahead!

– This article was contributed by Steve Francis, founder and Chief Product Officer at LogicMonitor

As the new hire here at LogicMonitor brought in to support the operations of the organization, I had two immediate tasks: Learn how LogicMonitor’s SaaS-based monitoring works to monitor our customer’s servers, and at the same time, learn our own infrastructure.

I’ve been a SysA for a longer than I care to admit, and when you start a new job in a complex environment, there can be a humbling period of time while you spin-up before you can provide some value to the company. There’s often a steep and sometimes painful learning curve to adopting an organization’s technologies and architecture philosophies and make them your own before you can claim to be an asset to the firm.

But this time was different. With LogicMonitor’s founder, Steve Francis, sitting to my right, and its Chief Architect to my left, I was encouraged to dive into our own LogicMonitor portal to see our infrastructure health. A portal, by the way, is an individualized web site where our customers go to see their assets. From your portal, you get a fantastic view of all your datacenter resources from servers, storage and switches to applications, power, and load balancers just to name a few. And YES, we use remote instances of LogicMonitor to watch our own infrastructure. In SysA speak, we call this ‘eating our own dog food’.

As soon as I was given a login, I figured I’d kill two birds with one stone and familiarize myself with our infrastructure and see how our software worked. Starting at the top, I started looking at our Cisco switches to see what was hooked up to what. LogicMonitor has already done the leg-work of getting hooks into the APIs on datacenter hardware, so one has only to point a collector at a device with an IP or hostname, tell it what it is, ( linux or windows host, Cisco or HP switch, etc) provide some credentials and ‘Voila!’ out comes stats and pretty graphs. Before me on our portal was all the monitoring information one could wish for from a Cisco switch.

On the first switch I looked at, I noticed that its internal temperature sensor had been reading erratic temperatures. The temperatures were still within Cisco’s spec, and they hadn’t triggered an alert yet, but they certainly weren’t as steady as they had been for months leading up to that time. For a sanity check, I looked at the same sensor in switch right next to it. The temperature was just as erratic. Checking the same sensors in another pair of switches in a different datacenter showed steady temperature readings for months.

Using the nifty ‘smart-graph’ function of LogicMonitor, I was able to switch the graph around to look at just the data range I wanted. I even added the temperature sensor’s output to a new dashboard view. With with my new-found data, I shared a graph with Jeff and Steve, and asked, “Hey, guys, I’m seeing these erratic temperature’s on our switches in Virginia. Is this normal?”

Jeff took a 3 second glance, scowled, and said, “No, that’s not right! Open a ticket with our datacenter ticket and have them look at that!”

That task was a little harder. Convincing a datacenter operator they have a problem with their HVAC when all their systems are showing normal takes a little persistence. Armed with my graphs, I worked my way up the food-chain with our DC provider support staff. He checked the input and output air temperature of our cabinet, and verified there was no foreign objects disturbing air flow. All good there. We double-checked here that we hadn’t made any changes that would affect load on our systems and cause the temperature fluctuation. No changes here. But on a hunch, he changed a floor tile for one that allowed more air through to our cabinet. And behold, the result:

Looking at our graph, you’ll notice the temperature was largely stable before Sept. 13. I was poking around in LogicMonitor for the first time on Sept. 18th. (Literally, the FIRST TIME ) and created the ticket which got resolved on Friday Sept. 21. You can see the moment when the temps drop and go stable again after the new ventilation tile was fitted. ( In case you’re wondering, you can click on the data sources on the bottom of the graph, and that will toggle their appearance on the graph. I ‘turned off’ the sw-core1&2.lax6 switches since they were in another data center )

Steve’s response to all this was, “Excellent! You’re providing value-add! Maybe we’ll keep you. Now write a blog post about it!”

And I’ll leave you with this: Monitoring can be an onerous task for SysAs. We usually have to build it and support it ourselves, and then we’re the only ones who can understand it enough to actually use it. Monitoring frequently doesn’t get the time it deserves until it’s too late and there’s an outage. LogicMonitor makes infrastructure monitoring easy and effective in a short period of time. We’ve built it, we support it, and we’ve made it easy to understand so your SysA can work on their infrastructure.

Or blogging.

No matter the kind of database – Oracle, SQL server, MySQL, PostgreSQL, etc – there are distinct kinds of monitoring for the DBA. There is the monitoring done to make sure everything is healthy and performing well, that allows you to plan for growth, allocate resources, and be assured things are working as they should.

Then there is the kind of in depth activity that DBA’s undertake when they are investigating an issue. This takes far more time, and uses a different set of tools – the query analyzer, profilers, etc – but can have a large impact, and is where a good SQL jockey can really make a difference. But given the amount of time that can be required to analyze and improve a query, when is it worth it? (more…)

Platform

Infrastructure

AIOps & Edwin AI

Cloud & Multi-Cloud

Digital Experience

Logs

Solutions

Business Outcome

Role

Industry

Resources

By Resources

By Topic

Learn the Platform

Company

About Us