The modern data center revolution

Speaker

Chris Kline, CIO/CTO Strategic Advisor

Chris leads a global team of senior AI, cloud and observability experts at LogicMonitor. With 20+ years of experience, he works with customer executives to create real and lasting value for their initiatives. Previously, Chris worked in various observability, DevOps and APM specialist roles at Splunk and CA Technologies. He has also worked as a product manager and as a solutions engineer. Chris holds 4 patents for work related to IT Ops automation earlier in his career.

Presentation Slide Deck Video Transcript

Video Transcript

“Hi. My name is Ismat Mohideen. I am on the product marketing team here at LogicMonitor, and I’m joined by my colleague, Chris Klein, who is our CIO and CTO strategy advisor, who will also be joining this session. And this is about the modern data center. It’s not a tech presentation. We’re not gonna demo anything for you. But we are going to describe what the modern data center revolution is and the impact from an observability perspective.

So we’re not trying to throw new terms at you to wow and surprise you, but this is a presentation where we would like for you to be thinking about current trends and how you can be a change agent in your organization and how you can seize the moment and really make an impact, in your business.

And so I’d first like to start by talking about hybrid coverage.

According to a Gartner study, ninety percent of organizations are going to adopt hybrid cloud by twenty twenty seven. Hybrid’s here to stay. It always was. We’re in that business, and we understand that our customers are under pressure to do more with less, to do more with their existing infrastructure, to do more with their budgets, but not necessarily get a new tool to monitor all the new things. And if there’s a tool that only applies to some of your use cases, you don’t get to retire the older tools and your spend only goes up, not down. This isn’t something that’s new to us. We have been describing and talking about hybrid cloud for quite some time now.

And there’s also a growth in data center capacity. According to a study from McKinsey, there’s going to be a twenty two percent growth in demand for data center capacity. And this is really driven by the rapid adoption of AI, new AI workloads, and it’s leading to this surging demand in data center capacity.

And, quite honestly, data centers aren’t going to be able to keep up with the growth of demand that is being generated by these new AI workloads.

And so, hybrid is really the reality for how enterprises are going to be managing their infrastructure in this new age. It’s going to be in the data center, there’s going to be a significant investment in the data center driven by AI workloads, the requirements that come from new GPUs, but it also include the repatriation of workloads from cloud to on prem data centers to grow these new AI workloads.

And the reality here is that the modern data center is all of these things. It’s cloud, it’s on prem, it’s hybrid, it includes AI. And from an observability perspective, what logic monitor is here to help you do is support this journey, consolidate your observability tools so that you can do more with less, so that you can be more efficient.

But what does that really mean for your observability and the way that you monitor your resources on a day to day basis as you’re thinking about growing and expanding your AI workloads?

So AI workloads need data center infrastructure. So let’s first talk about the data center issues. There is an increasing demand in data center infrastructure that is required to capitalize on producing the power required for driving AI workloads. There’s also supply constraints.

The chip manufacturers aren’t producing GPUs fast enough, to get them into the data centers for people to even start using them, right, and building their new models. Speaking of models, model training is driving the increased use of the infrastructure, and as the usage increases, compute is also, is, is also being driven up. And, the cost of AI compute is greater than the cost of general compute, so there are these cost management implications at play.

And AI workloads are also, considered very brittle, and if something breaks, applications have a tendency to break down. They’re less fault they’re less fault tolerant, meaning that you can’t get them back up and running quickly, you know, the way that you might be able to with an existing workload or application that you have in your business already.

So there’s this whole gamut of issues around the data center that’s, you know, being, driven by the rise of new AI workloads and the demand around them. But, let me kind of pivot a little bit and describe why observability really matters, in this era.

You know, you’re not talking about data centers as, you know, a silo, right? These workloads and these applications are connected to resources that are also running in cloud and containerized environments.

And, you really need to get that full visibility into the modern data center infrastructure, the hybrid environments, into your existing systems, and your new systems to really get that full picture of the total health of the applications that you’re running.

You wanna ensure workload and service availability as well, to make sure that your AI applications are ready and available when your customers need them. You also wanna maximize the value of your investments so that you’re getting the most out of the the investments that you’re making in the data center while you’re building out your new workloads and applications.

So let’s talk a little bit more about the problem with lots of different tools to manage all of the workloads and applications that you’re running.

And to do that, I’ll turn it over to Chris.

Chris, take it away.

Thanks, Isma. And you’re absolutely right. There’s, there’s this problem in our business of so many tools because there’s so many technologies.

And in the modern data center, the technologies are all over the place, They’ll continue to be so. We don’t have the luxury in our world of getting rid of something every time something new shows up, and so we wind up having more and more things to monitor and winds up being more and more tools. I had a customer years ago explain this problem to me in a very succinct statement that was galvanizing for me. He said to me it felt like he had fourteen watches on and still had no idea what time it was. Meaning, there were so many tools, he didn’t know which one to look at. And so I think the problem here is actually that you’re spending most of your time focused on coverage, and that is that you don’t get to move forward with becoming more proactive in how you use the tools.

If you think about an average week, you’re probably onboarding more devices, making sure that you have the right troubleshooting put into it. It’s all about care and feeding and making sure that you have things covered. And so you have this equivalent of so many watches, and you’re not actually becoming more mature.

I think there needs to be some more accountability for all of us in this space that we need to really make an attempt to carve some time out to not just do the basics of getting things onboarded, but really to use the tools to maximum effect.

So to do that, I wanna take you through an observability journey model that has four different stages in it. And it’s not terribly complex.

It just describes, moving from a reactive to a proactive stance. And the way that you can figure out where you are in this model is by answering a simple question of how do you figure out or how do you know when there are performance problems. Well, if you’re in an IT ops or cloud ops team and if your users call to tell you that there’s a problem, then you’re probably on the left side of this chart.

Maybe you find out that there’s a problem and you know before the users call and you’re you work on it dutifully. It’s still reactive. It’s not a bad thing. It just means when there’s a problem, you find it and you solve it. Or maybe you get over to the third stage and you do such a good job at at optimizing that you’re able to squash most of the problems before they actually pop up, before they actually affect end users.

Over on the far right side, that’s what most of us are aspiring to, and that’s a term that we are referring to here as agentic AI ops. And you’ll hear more about that at the conference.

Ultimately, what we’re trying to do is get more value out of the tools that we have that allow us to effectively be stronger and to solve a problem faster with less skill. The benefit in the business then is that a lowered skilled engineer, say a level one operator, should be able to solve a problem and faster without having to escalate it to a higher skilled resource, which they may have had to do in the past. That’s how we move from the left side to the right side in a chart like this.

So, obviously, as you get to be more, proactive, the the benefits are obvious. You get to reduce your costs. You you’ve probably taken out a lot of the costs in your business already that you can, and you’re looking for new ways to to figure out how to become more efficient and effective.

And, as as you become more proactive, that means you have more not this more uptime, but you’re also, able to get more work done in a shift per FTE.

I wanna take you through now a model of the logic monitor architecture and explain how using certain capabilities in the platform will help you to move from whatever stage you’re in today to a higher stage. If you’re a stage two kind of an organization today where that is largely reactive, maybe once in a while, you’re still focusing on getting some coverage to devices that that aren’t adequately covered, but mostly you’re reactive, and, frankly, that’s where a lot of enterprises are, then I wanna help you figure out how would I get in a year’s time up to the third stage where I’m proactive and spending much more of my time optimizing systems because I’ve solved a lot of the major problems.

This is a kind of talk track that you’ve heard about for years, but I feel like we haven’t executed well enough on the use cases that help us to actually move forward in a pragmatic way. So the architecture that you see here has four pieces on it, and I’m gonna take you through each one of those four. But I’m gonna start over on the right side first in a little bit of a backwards model to make better sense out of it.

First, we think about the different solutions that you’re seeking. What are the problems that you’re trying to have solved? Well, as Ismet said, you may have AI workloads that you’re trying to monitor. Maybe you’ve got, devices that are in the cloud, certainly about things that are, spread all over the place.

It might be multiple clouds. It could be on prem. Maybe you’re trying to control the cost of those devices that are in the cloud as well. Maybe sustainability is a requirement for you because you’re trying to make sure that you’re offsetting a carbon footprint, maybe due to ESG requirements.

These are the kinds of things that should start our pattern rather than be the result of our pattern.

To get to these outcomes, then we have a variety of capabilities that will help us to achieve those things. I’m gonna come back and talk about those more in just a moment. To get to the capability outcomes, well, I gotta have some telemetry. This is the data that comes in.

And that telemetry is sourced in the modern data center that you learned about a few minutes ago. That modern data center lives on prem. It lives in the cloud. It lives in IoT.

It lives in AI workloads.

It so hybrid, in the truest sense of the word, is here to stay. And these devices all emit a large variety of telemetry.

That telemetry takes all the shapes that you see here, metrics and events, logs, traces, and other things as well. And in the past, what you find is that vendors, when they’re kind of giving you a new watch, tend to focus a lot on which one of these telemetry types is their favorite, which signal do they like the most. If I’m an APM company, then I would come to you and suggest that a trace is the answer to your problem. If I’m a logging company, well, then I never met a problem that I couldn’t solve with logs.

You get the idea. And as we move forward now, that’s what one of the things that needs to change from where we’ve been in the past is we need to stop arguing about the kind of telemetry or where it comes from. Because largely, it’s been commoditized now. We need to get the data in as quickly and effectively as possible, and we need to ensure coverage on all the required things on the left side of that chart.

Of course, coverage is important.

But once I get that data in, I need to stop worrying about what kind of data it is and spend a whole lot more time on the platform itself, on the capabilities of what I’m using that data for.

Now one of the things most of you probably know, but if you don’t, when we collect telemetry and logic monitor, we do so with a collector architecture. This is an agentless architecture, which means you don’t have to deploy anything. This is one of the first ways that you can get data in faster.

In the old days, you might have had to install an agent on a variety of different endpoints or resources.

And to deploy those agents in production, you would have to go through pre prod. You’d have to test it. You’d have to have automation scripts. You’d have to go to a change review board.

You’d have to wait until the off hours time to do it. And, collectively, that time was very, very long. And so to onboard a tool took way too much time, and that’s why coverage became such a heavy focus of our efforts. By using a collector architecture, we don’t have to install anything on these devices.

All we have to do is discover them and input credentials so that we can pull those devices and basically ask them how they’re feeling. They’ll respond with a variety of telemetry when we pull them with what you see here. And with that telemetry then, we will import it to the platform, and we’ll start to work towards the use cases of our capabilities.

So that takes us to the capabilities themselves.

There are nine different capabilities here on the tiles. I’m gonna just talk through a few of them. Rather than thinking about these as features, I’d like to think about them in terms of capabilities that help you to achieve the outcomes that get you to be more mature in your observability journey. That’s how we move forward and how we monitor and manage the modern data center. So let’s run through that now.

The first one I wanna talk about is anomaly detection.

Anomaly detection is the idea of not necessarily knowing where a problem is at. Let me give you an example.

I know I have a problem, but I don’t know where it’s at. And then the old way of doing things, I would either manually set an alert for a high CPU, a disk that’s full, a network that’s having problems, or any such thing.

And I I’d have to set that alert, first of all. And then when that alert occurred, I wouldn’t really know a root cause of it. For example, if a CPU is a hundred percent, is that a root cause?

Of course, it’s not. There’s no way that a high CPU alert can tell me how to fix the problem. I could maybe go restart something, but it’s it’s just kicking the can down the road. It doesn’t fix anything.

So the alerts themselves don’t contain the the root cause. They they basically have the what but not the why. When we think about anomaly detection, it’s two things. Number one, it’s adjusting the alert against norms based upon patterns that we’ve seen.

So don’t just alert when something hits, ninety percent CPU. Tell me if the CPU usage and that might not be a great example, but you’ll you’ll get what I mean. Tell me if the CPU usage is normal for this device at this time, or is it something that is anomalistic?

We use something called dynamic thresholds to do that.

But once I have the alert, now I wanna get to the root cause of it. And the way that I do that is with log events.

Log messages tell you the why. They tell you the root cause of something. A CPU got high because someone deployed new code and it didn’t work well or because something timed out. And so the log messages give you a view into what’s happening behind the scenes.

The problem with log messages is usually you have to know what question to ask of them, how to form a query.

And many times, a level one operator doesn’t know what question to ask or doesn’t know what how to form that query. And that’s what’s required escalations in the past. So to make a level one engineer stronger, we need to do two things again. I need to put the the event with the alert.

I need to say, you have a problem. Your disk is full. Your CPU is high. And here are the logs that matter specifically for this problem from these machines at this time.

From those logs, we can pull an anomaly that says here’s things that have never happened before or things that have happened too many times. But using heuristics, using, basically, AI type algorithms to pull from the logs in what we call a query less fashion. Meaning, a level one engineer doesn’t have to know what question to ask and doesn’t have to know how to form that question, but just sees it naturally in the context of the workflow that they’re trying to use to solve the alert.

When we do that, we put them together, and you’ll you can see more of this in another session on logs. I think it’s labeled as as peanut butter and chocolate. You’ll see how we can very much streamline that triage and resolution process, and that’s how you get to a faster time to resolution. And that’s how you move from your stage one or two to a three and ultimately to a four.

The next vignette I wanna talk about is called service insights. This is also a session that you can get more detail on here inside the conference.

Service insights is a concept of drawing a circle around all of the components that relate to a various, any specific service or application and reflect it as one overall unit of measure.

If a CPU is high or a disc is full, I’ll keep using those as simple examples, and they affect my ability to buy some shoes or tickets to the movie or something like that, then what I really wanna know is, is Chris unable to complete that transaction?

Is it too slow? Did it time out? Is the user experience not good?

And I’d like to be able to do that by measuring from the bottom up. The idea of measuring application performance has been around forever. That’s not what we’re talking about specifically.

It’s being able to do it easily and simply.

Normally, app performance is measured from the top down. What that means is that you have to put some kind of a trace on the application. Use a product called APM. And APM traces are very powerful, and there’s definitely a place that you wanna use them. But they’re also very expensive. They require a high level of expertise, and they’re very techno, technically dependent on the kinds of tech that’s being used behind the scenes. So you have to have certain versions of and of frameworks, and there’s there’s just a lot of dependencies.

And so it me makes it so that usually only a small number of applications wind up getting monitored that way. The rest of the apps, you just have basic CPU disk memory kinds of measurements. And on those things, if a CPU is high or a disk is full, it’s really hard to know if it’s a problem. Maybe a CPU is high, but it didn’t hurt anything.

So don’t actually make that alert something that I have to respond to in the middle of the night. That’s what a service insight is all about. Now we’ve had this in our product for some time, but there’s a major step forward that we’re happy to share at this conference right now. With dynamic service insights, now we can get the bottoms up view of drawing a circle around infrastructure measurements to get an application level of visibility and do so automatically.

In the past, doing this work took a lot of effort.

You’d have to manually add the different systems into a service and maintain the integrity of that service. And every time new code was deployed or new systems came online, someone would have to remember to go maintain the integrity of that service by its members. I’m like, now you don’t have to do that anymore. And using an, using, metadata, we’ll get into that in a separate session, you can automatically make these services dynamically. And as new stuff comes online in your modern data center, it will automatically be formed into services, and then you can reflect the quality of those services to your users in a way that is not at all technical. And that’s a way that you can now take technical teams and business teams and line them up to speak the same language much more efficiently.

The next vignette I wanna talk to you about is Resource Explorer.

Hopefully, by now, you’ve heard a little bit about Resource Explorer. It was introduced well over a year ago now into the product. It is one of the most major steps forward in the Envision platform in the last couple of years.

Resource Explorer’s job is to flatten the environment for you and take out some of the tribal knowledge requirements. Let me explain.

Most tools, including logic monitor, require a level of tribal knowledge or expertise in the tool to know where to click because you have to click, click, click, click many times to get to the kind of data that you would use to diagnose a problem.

Usually, on the left side of your of your screen, you’d have some kind of a navigation bar, and on the right side, you’d have a detail page. It’s called a master detail view, kind of like a file folder structure that you would have, say, on your laptop. And you’d have to drill down four or five or ten different folders to get to the level of nesting that is the thing that you need to know. And that’s part of what makes it difficult for a level one engineer to know exactly what to do on their own without help. You’d have to have a bunch of runbooks. You’d have to have instructions or training, something that teaches that engineer exactly what to do.

What if we could flip that on its head and give you a view that had the equivalent of what I like to call free clicks to awesome?

Meaning, the kind of thing you see in a demo from a vendor. You know how vendors show up and they do a demo and it looks beautiful and and clean, but maybe when you go use that product, it it’s not quite the same experience? That’s the problem we’re solving for here. And so there’s another session where you can get more detail on this.

But the idea of resource explorer is to give you a flat view of the world that requires no previous knowledge. It’s kind of like putting a pivot table on top of your data. And with that pivot table, you can group and filter using metadata any way that you want. I can group by owners, by locations, by versions, and by any of a hundred other things that you might think about as metadata in the form of properties or tags that then would help you to see patterns.

And pattern matching is how I make a level one engineer stronger.

I want to not have to work through eight separate alerts or tickets individually before I see that the things are related to one another. I wanna visually see that all the things that are red are located in this corner of my data center. They’re running this cloud service. They’re in this on the second floor of this branch office. Because when you know where the things are or who they’re owned by, that context helps you to solve the problems tremendously faster. So I encourage you to check out the resource explorer session and learn more about that. It’s a tremendously faster way of solving problems and will make your level one engineers much, much stronger.

The next one I wanna talk about is with Edwin AI. This is perhaps something you’ve heard of before. We’re in the midst of a giant revolution in what’s happening with artificial intelligence right now, And I don’t think there’s any space that’s better tuned and e and, aligned with what AI can do than what we see in observability.

One of the biggest problems that we have in observability is that there’s way too much data for anybody to understand.

And if you’re someone like me who’s been in the space for decades and understands all of the kind of telemetry that’s coming in, it’s still too much. But imagine if you were new to this space and you’ve been in your job for a couple of months and you’re a level one operator trying to trying to do the very best that you can, and you’re getting a tremendous volume of alerts with language that is almost impossible to understand.

Of course, you’re gonna have to either just, like, throw the things away and hope they don’t happen again or escalate for help. That’s the problem that we wanna help you solve.

Edwin AI is a platform that works together with LM’s InVision as well as third parties.

We recognize that all of your telemetry data will never be in one platform. We know that it’s a multi domain or multi tool world. So Edwin AI will receive events from LogicMonitor as well as from other tools.

Its job is to make level one engineers stronger by deduplicating and correlating the stream of alerts that come in from all of these tools and make sense out of them.

Instead of having a hundred alerts, what if you had two or five that have all of the underlying data pulled together and correlated into a single insight?

What if that insight had plain English on top of it as a summary that says you have a problem in three branch offices with these two applications?

What if the tool could describe to you a potential root cause or even what to do about it?

What if that tool could describe to you how a similar problem in your environment was solved two weeks ago to help that level one engineer know more about how they should solve this problem now?

That’s the kind of help that Edwin can give to you, and that’s why the tool was named like someone who could be on your team. An expert on your team who can look over the shoulder of a level one operator and tell them what to do and make them stronger, make them faster in how they solve their problems.

So you’re you’re already experiencing the modern data center, and you’re on a journey that’s similar but not quite the same as what I’ve just described.

If you think about as you over past maybe five or ten years, many of you have been focused on a cloud journey that if you’ve been around as we’ve talked to you about before.

That cloud journey has focused on moving from the data center to the cloud, and now we see that’s coming back again. And that felt like the the goal was maybe to, get out of the data center. I still hear that sometimes. I have to close my data center by twenty twenty seven. I I heard that from an executive just this week.

Or, I need to move more of my applications. I need to refactor these applications into container and pass workloads.

These kinds of things are technology shifts. But if you’re not careful and you focus just on the technology shift, it doesn’t actually help you to become more mature in your observability journey.

So the observability journey that we’ve been talking about sits kind of adjacent to or on top of the technology journey that moves you either out of the data center into the cloud or now back to the data center as a part of repatriation that seems to be happening much more aggressively with AI workloads.

I’m gonna give you an example as we close here with, a logic monitor customer and partner who has been on this very journey. That’s Tapestry.

Tapestry has brands such as Coach you might be familiar with. And so, they’ve been moving for years. They joined LogicMonitor some years ago. And if you listen to the keynote, you saw a little action figure with, Pedro.

And Pedro’s, one of our one of our friends. Pedro was talking to me recently about their journey, and their journey was first to take it out of the data center that started several years ago. But now they recognize that they’re they will always have an on prem presence in manufacturing locations. They have colo colocation, sites as well, and they’re in multiple clouds and will remain so for the long term.

That means the systems that they monitor and manage are all over the place, and they represent not only cloud native applications, but many legacy apps as well.

Over the time that they’ve journeyed with logic water, they started very reactive, much like many of you. And over the last couple of years, they’ve become far more proactive, moved into that third stage by employing many of the kinds of capabilities that we’ve talked about here today and that I want you to go see some of those detailed sessions on later on.

Ultimately, what they’re focused on now is they’ve already reduced you can see here the tool consolidation down from four to one, and they reduced the number of alerts that they’re getting. The next thing is that they have a CEO led initiative to get to zero major incidents or, you know, p ones or MIs over the course of the next year by being proactive and optimizing and identifying issues before they boil into problems.

When you can when you get rid of some of the worst issues, that can free up some slack in the system that lets your team now focus on the the time to go tune and fix things before they boil into into difficult problems. And that’s usually where most of you suffer is you like, all of your time is taken on the firefighting, and there’s never any time left to do anything proactive at the end of the day. That’s how you that’s what makes the difference overall.

So hybrid observability, it’s here to stay. Logitech owner has been talking about this for quite some time now. It’s not new to us at all, and that’s why we focus our platform on all of the different technologies.

Whether they’re legacy and and they’re on prem. There’s nothing wrong with them. They still work just fine. Or cloud native technologies or now AI workloads as well. Logitech Monitor’s got your back in being able to not only monitor all of those things, but also to help you become more proactive in how you analyze and use this data to make your engineers stronger and make it so that they can solve problems faster.

Ultimately, at the end of the day, what you need is not another watch. You need to consolidate the watches.

It’s less about coverage and less about onboarding data or at least it should be. It needs to be more focused on what you do with the information that you have and how you use that info to make your users stronger so they can create better outcomes for your business.

I hope you get time to visit some of the other sessions today where you can learn more about some of these little feature vignettes that I’ve described to you. We appreciate you joining us here at our conference. Thanks. Have a great day.”

More from Elevate

Page

Ready to get started?

Try it free

Platform

Infrastructure

AIOps & Edwin AI

Cloud & Multi-Cloud

Digital Experience

Logs

Solutions

Business Outcome

Role

Industry

Resources

By Resources

By Topic

Learn the Platform

Company

About Us