Seeing the forest: Shifting to Service Level Insights

Are you tired of chasing down alerts for individual resources and components that don’t reflect overall service status? What if you could see the health of your dynamic services, in the context you care about, regardless of where things are run and how they are deployed?

Join this breakout session to learn how Service Insights can help you:

Gain a complete, long-term view of service health and performance over time
Align your monitoring strategy with business objectives and service delivery goals
Prioritize issues related to business critical services

Speaker

Chris Kline

Chris leads a global team of senior AI, cloud and observability experts at LogicMonitor. With 20+ years of experience, he works with customer executives to create real and lasting value for their initiatives. Previously, Chris worked in various observability, DevOps and APM specialist roles at Splunk and CA Technologies. He has also worked as a product manager and as a solutions engineer. Chris holds 4 patents for work related to IT Ops automation earlier in his career.

Chris lives in Colorado where he enjoys hiking and mountain biking, and spending time with his sassy daughters.

Presentation Slide Deck Video Transcript

Video Transcript

“Hi, and welcome to the session on service insights. I’m Chris Klein. I’m a CIO adviser with LogicMonitor.
Say we’re gonna talk about what services are and why they’re important. We’re gonna talk about LogicMonitor’s implementation of it in a feature called service insights and some exciting new updates that we’ve been working on for quite a long time.
To start the topic, I want you to have a look at these two pictures. And I want you to think for a moment, what is the difference between the two images that you see here?
What do you see on the left side, and how does that compare to what you see on the right side?
On the left side, you’re seeing something that measures water. And on the right side, you’re also seeing something that measures water.
The difference is, is the one on the left side is measuring the pressure of the water, and the picture on the right is measuring the quality of that water.
This is a great analogy for the way that we sometimes talk past one another. Many of you probably associate yourselves and your profession more with the left side of this picture, with plumbing kinds of situations where your job is to ensure that data gets from point a to point b. In the world of infrastructure monitoring, that’s a lot of what we do. If you’re an application person, though, then you don’t know a lot about the infrastructure. You don’t know a lot about the plumbing, but you do know whether the water is gonna kill you. Or in the application sense, you know if the app is working correctly or not.
When we speak in terms of pressure or quality, we’re both talking about the same thing, but we also talk past one another sometimes.
We say things like, it looks okay from here, or the network’s just fine, or I can’t see that you’re having any problem. These are the kinds of things that happen over and over when we’re trying to triage, particularly in bridge calls, because we’re looking at the same problem from different angles that represent different lenses and help us and cause us to skip past what the other person is seeing.
This is precisely why a service level contest is necessary and can help us to solve that issue.
So if you think about services versus resources, if you’ve been around logic monitor at all, you know that we typically measure performance in terms of resources.
Resources, on on the right side of this chart, will, for example, give you an alert on any error condition for any resource, any device.
It’s measured by single devices.
It’s, lots of silos because you have think about, like, dynamic groups and things that you might have where you organize your devices and you put all of one kind of server or one kind of network device into a group.
If you’re a pressure person, if you’re in the plumbing side of the business, then that’s exactly what you care about looking at. But when you’re asked about how that represents against somebody else’s side of the business, sometimes that translation level is hard.
Now translate that to the service con column in the middle of the picture here. A service is representing the business context of what’s going on.
All the members of a service contribute to the overall indicator or quality of the service. So instead of having a red light for each server, each load balancer, each VM, each network device, or anything else, you would have one light that represents the overall quality of service.
This helps to facilitate collaboration because now if a CPU is at a hundred percent or a disk is full or network’s dropping packets, I can quickly see whether or not it’s it’s causing a degradation to an overall service being delivered.
That’s the way that business wants to understand things. That’s the language that they hear and understand. That’s more of a water quality measurement, but we can do it actually with water pressure KPIs, which means we can actually go back and forth between both ways.
So why, why are we digging into this, and why is an insight around service level so important? Well, speaking in terms of business impact is something that those of us who are deep in technology don’t always do very well.
Frequently, I’ll ask people, what what does this application really do? Or or when when this, this piece of infrastructure has a problem, how do you know what the business impact is? And it’s not uncommon that I hear, well, I don’t really know what the impact is. My job is just to keep it running. And I think back to when I was in operations years ago. I had the same kind of thing where I’d get paged in the middle of the night to go fix a problem with middleware on an application called x y z, and I was supposed to fix it as though I knew what it did. Well, I didn’t know if it was the junk mail machine or if it helped people pay their credit card bills or anything in between, And having had having some context would really have helped me.
When you have multiple things that turn red at the same time, this level of context can be really powerful and help you to know which one to work on first. There are always gonna be some apps or services in your business that are more important than others. But sometimes, especially with shared infrastructure, it can be really difficult to understand the implication when something’s breaking to know exactly which business services are affected.
So, typically, the way that this problem gets solved and this is an image that we used, over the last few years. You may have seen it before. Typically, the way we solve this kind of a problem is by looking from the top down. This is typically referred to as an APM problem for application performance monitoring.
Meaning, you would do application level tracing. You would use a product such as Dynatrace or AppDynamics or other kinds of tools like that, and you would put a trace via an agent. You you would trace an application to find out where the blockage is located. So it’s kind of like an, like an angiogram I described sometimes. It’s like putting a dye into your veins, and then you wanna see where the blockage is at in the blood flow. That’s what a trace will do. And that’s typically how a service level context gets created.
The challenge with doing it that way is multifold. Number one, traces are expensive. They’re expensive in lots of ways. They’re very expensive in terms of the license cost that they take.
Some they can cost upwards of ten thousand dollars an agent. They’re expensive in terms of the expertise that’s required. There are very few people that know how to do that work well. And they’re very expensive in terms of the technical dependencies that they have and the time and and the time drag on it.
Because it’s an agent based technology and because it has the ability to potentially ruin the application that’s being monitored, You have to be very careful, and you have to work through preproduction environments and chained windows, and you have to test it and learn how to back it out and a lot of things that can take a lot of time. But ultimately, at the end of the day, means that the top down view is a really great idea, but have you only ever seen it on the top most important applications in an enterprise?
What that means is you might have a hundred or five hundred apps or services in your business, and that top down view can only ever get deployed on maybe the top five. The top ten if you’re really lucky. And you might only have five or ten if you’re lucky, experts who know how to do that work.
So back to the problem, I want to have a service level context, but I don’t have enough time and experts to go and do that work to get that outcome. And that’s been the status quo for a decade or more now.
Now let’s flip this on its head. If I go from the bottom up, what do I get if I go that way? Well, if I come from the infrastructure represented on the bottom of this picture, then I don’t have to deploy any agents. I can just use LogicMonitor’s agentless collector architecture to remotely pull infrastructure metrics.
But infra metrics by themselves don’t have service level context in them. So if I can add that context and effectively draw a circle around the server and the network, you know, the the the CPU, the disk, the memory, whatever the metrics are, and I draw a circle around those and I say, this is the infrastructure that represents this application’s performance, I’m not getting trace level visibility. I’m not suggesting that I am. But if I were to combine that together with, say, a synthetic or a web check or scraping from a log some metric that helps me to know how the app is performing from a user’s perspective to get the end user experience.
Now I have an agentless ability from the bottom up to get the same kind of view that I was getting top down, but I do it without that expensive cost of license, cost of deployment, cost of expertise.
And this means now if I have five hundred apps in my enterprise, I can do this for all five hundred of them much, much more aggressively than I could with the top down model. So the bottom up view is is gonna be a better way to get you more ubiquitous coverage. And though it won’t go as deep as the top down, it will get you good enough for most of your applications.
And it won’t require near the expertise, which means that you can get this out there much more quickly for the entire surface area of your estate.
So LogicMonitor has long had a capability called service insights. It’s been in the platform for quite some time. And service insights, as I mentioned, they they basically draw a lasso around selected infrastructure resources.
And from those resources, you’ll collect certain KPIs, various, various alert thresholds from from different, data sources. And you would say, if my disk is too full or my CPU is too high or my network is too latent or my user experience is too poor, then give me a red light. We’ll put those together into one thing by grouping them together.
Then I’ll show the health of the overall service as one cohesive unit that gives much more context about how an application is working.
We have rein reinvented this capability here over the last year. We’re just ready to roll it out now. It’s gonna be rolled out in a couple of phases. But what happens is here’s an example in an image that you can see here of just red, yellow, and green, little indicators there with the name of a service.
And those indicators tell me whether it’s working or not. And I don’t really need to understand underneath at the service level, is it because of the network or the database or the CPU that’s too high or anything else? This just helps me to communicate whether or not the service is functioning properly.
Now in the past, the way that this was done was effective but time consuming.
It required you to know all of the devices that had to be put into the group.
It also required that you keep an eye on it. And anytime there’s a change in your environment, you’d wanna propagate that change into your services to keep them current, maintaining the currency or integrity of those services.
And so the services worked just fine when you did the work, but most folks didn’t actually have time to do the work.
So we have spent time over the last year plus reinvesting in this capability to make it so that the services could be built, configured, and maintained automatically using a template capability. So there’s the environment shifts, the template can be reapplied to what’s seen out there in the wild, and the service can maintain its integrity without any input on the administrator’s part.
That takes away the giant burden of having to configure these things because you may have thousands of them over time and ensures that you can trust what you’re seeing because it automatically updates as the environment shifts. It’s a huge move to the way that it’s been done in the past and takes the bottoms up approach that I showed a moment ago and makes it much, much simpler to implement.
This page here shows you, Resource Explorer. And Resource Explorer, if you’re not familiar with it already, is a very simple flat way to look at all of the resources in your environment. The image that you’re looking at here shows Resource Explorer organized by service. So once you create these services, you can use them in a variety of different ways. You can use them with the little chiclets you were seeing a moment ago, but you can also use them in other parts of the product just like you’re seeing here in Resource Explorer to group and filter and see how performance is, being, maintained across a variety of different services. You don’t necessarily have to have a new view. It’s just more context into the views that you have already.
Another thing that’s helpful with services is that they can be linked together into a into a topology.
The image that you’re seeing here shows that you can see dependencies among services. The arrows between them, those are, those are dependencies via something called edges and vertices.
And the the edges and vertices allow you to show what thing depends on another thing. So you can create services at multiple levels. Imagine this then. You would create services that represent both technical services and also business services. A technical service might be my load balancers.
It might be a pool of, web servers, for example.
It might be a Kubernetes cluster.
It could even be a shared network.
You can anything that you may have in the past had to create, say, dynamic groups for, you can create these as a technical service that you aren’t necessarily linked to the business outcome, but pull together a variety of technologies because you wanna group them and see how they’re performing collectively and show their overall status.
Above those technical services, then you might have a business service that sources as its dependencies multiple technical services that says, I have these three or four different pieces of technology I’m depending upon. And when those things turn orange or red, then the parent business service also would turn a color.
The business service in turn then might have end user measurements in it in addition to technical measurements. Usually, when we stitch them all together, now you can start to see how the whole body of of of in topology, how the whole body of an architecture is being established in a way that I can see how what’s happening in one piece is affecting performance in another piece. Visually, this gives you a really quick understanding to say, oh, I see how my piece is affecting other things, and I know where we need to go work first.
So making services is, actually a lot easier now than it ever has been in the past. And I wanna walk you through for a moment, how you would go through doing such a thing.
The headline on this page is and if you take one thing away from this session, I need you to consider making tags a priority in your environment.
Now tags are known as logic monitor properties, They’re probably something you’re already familiar with if you spent any time with the platform.
Logic monitor properties or tags are one of the most differentiating capabilities in the platform, but they’re historically underused.
It’s not at all unlikely if you’re monitoring a device in, say, Amazon or Azure that you might already have seventy or more properties for a single resource.
Logic monitor is very good at automatically adding and decorating these tags into your environment.
But the tags that LM puts in your environment are usually gonna be technical tags, a tag that says the version of a thing, the version of the firmware, or, the model number of something, for example.
The all all of these can be self diagnosed very easily, and LM does a great job of that.
Those tags can be extended by a variety of mechanisms. You may be familiar with something called property sources in the platform, and this is a mechanism where you can use a simple scripting architecture to or framework to go and get additional tags and and input them to the system. For example, if the second character in a host name is a p, then I would make a an environment tag called production. Or if it’s a d, I would make the environment tag be development as an example. So you could use, automation to create tags based upon other things that you’re observing in the environment or if you wanna go offline and gather additional information.
You can go get tags. Well, heck, you could go get a spreadsheet and upload it if you needed to. You can also integrate with the CMDB, a config management database.
And though most of you probably don’t have CMDBs that you love, it is common if you have one that that’s a great source of truth to know more about the, the the tags that give business context. For example, what is the team that owns a particular CI or resource? Who’s the owner? What is its location? What does its change window?
And so on. What environment is it in? What is the name of the application?
In order to use services, you’re going to need to come up with probably your five favorite tags to get started.
And I would recommend that for most of you, it’s probably gonna be things like location, owner, application name, version, and so on. And you’re gonna tag these into all of the resources in your environment that match those components.
You can do so manually, but auto doing it automatically is gonna be a much better way. It’s not a it it’s not trivial, but nor is it terribly difficult. And when you add this metadata into those devices, they’re useful for a variety of different purposes. They’re useful for service insights that we’re talking about right now.
They’re incredibly useful for resource explorer. They’re useful for Edwin AI when you wanna move into event intelligence and the generative AI on that side of the house. They’re useful in dashboards in so many parts of the product. So if you don’t do tagging right now, you’re really leaving a lot of the value of the platform on the table.
If you do take the tags and get them in the system, then building up of services becomes pretty straightforward.
So the taxonomy is, number one, you would identify the services that you’re looking for. Am I gonna build some technical services or some business ones?
Then you figure out the dependencies between them. Draw a little whiteboard map that shows you which things depend on which other ones. Then you’re gonna create some tags.
You have to create those separately in the system. But once the tags are in there, then you’re gonna create the service based upon the tags.
And Michael’s gonna walk Michael Rodriguez from product management is gonna walk you through an example of this in a demo momentarily.
But what happens is you create a template. The template looks for certain tags and possibly certain tag values. And when it finds them, it will pull the associated resources into that service automatically, and it will maintain that integrity as the tags change and the resources change in the environment.
Let’s have a look at how this works.
Hey there. Mike Rodriguez here. Chris gave us a good rundown of why you should care about services, but let’s take a look at one in the wild.
So let’s pretend I work for a large company, and I’m the principal IT engineer.
This large company has one web app that they use to make all their money. It’s their primary business service. This web app is down. They don’t make money. As a principal IT engineer, it’s basically my job to keep this thing running.
When I started, we had a lot of issues with this app. I’d get critical alerts all the time, usually late at night when I was busy with other life stuff, and I’d have to go in and check things and SSH in and load web pages and make sure that customers weren’t actually impacted. Right? Because that context wasn’t coming through with the alerts that I was getting.
Eventually, I decided, hey. I should roll all this stuff into a service and make my life easier because a critical alert at the device level doesn’t necessarily indicate a business impacting event. Right? And every time I respond to those is time wasted, time I could have better spent sleeping or working on other things for the business.
It also made me look like kind of a fool in front of my boss when he’d be like, hey, Mike. How’s the web app doing? And I’d be like, give me a minute to test it so I don’t look like an idiot when I say it’s up and it’s actually down.
So how did I do all this? Right? And as you can see, we’re actually in a state right now where we have a this the services and warning.
So let’s take a quick look at the members. Well, actually, let’s take alert at the alert first.
So in this case, we can see that there is a cluster members percent down alert on our web cluster module. So this module monitors just the the web components of the service and lets me know if more than fifty percent are down.
If the number of web servers changes, the percent basis allows me to easily kind of see, right, so I don’t have to hard code it. But in this case, one of the servers is down, and that’s leading to the warning alert.
So let’s take a look at the members. These are all the members that make up this service, and we can quickly see that web one is in critical status. It’s probably totally down.
So, right there. Right? Imagine in the old in the old world. I would have gotten that that web one critical alert. I would have had to go test a bunch of things, and I would have found out that this thing was still working.
The really cool thing is now I can go fix this, and I can go back to my boss and say, hey, boss.
Not only did the web app not go down, but we saw this pending possible issue, and we proactively worked to mitigate it. Right? So that wasted time now becomes time that I could be more proactive and get ahead of things without wasting time just kinda seeing where the where things stand. So I’m gonna go fix this in a few hours. I’ve got some other pressing stuff ahead of it. We are kind of flapping in the breeze here by having, you know, just one web server still up, but we’ll probably be okay at least for a couple of hours.
You might be asking, what happens if I have a hundred of these web apps? Right? Do I have to go in and do all this manually and manually pull out all five members or n number of members for each service and set them up manually?
If you have tried to build them before, you may think that. However, we do have some really cool, awesome new stuff coming that will let you essentially define rules to create services based on tags.
Right? So if I just go in and tag all of my devices with even something as simple as application dot name, I can create this rule. It will look for anything with that tag, and it will create it a it will put it into a service with the name of the application.
Now you can get a lot more advanced. You can do filtering. You can edit the names, things that we won’t necessarily get into on this demo. But there’s a lot you can do with this to to make this really seamless. Once I have this policy defined, again, services are automatically created and maintained, and I don’t have to do anything except just put the application tag on everything that I deploy.
Additionally, the KPIs are all defined in data sources, and they work just like data sources on the device level, except they apply to the service level. So if you’re familiar with data sources, you should be right at home. We can easily map this web cluster performance module onto as many other, web based services as we want.
Super, super easy.
So, again, this was just a super quick example showing you that, you know, you can save time troubleshooting if you put things into a service and build KPI’s out for them, because it spares you from having to kind of sort everything out from the device level.
We do have a lot planned for this for this year. We do have plans for out of the box modules, which will do some of this for you, as well as some other features to make this easier and more repeatable.
If you’re interested in looking at this today or getting your hands on it, please reach out to your account team, and we’ll see if we can get you into the beta.
Other than that, thanks for watching.
Thanks, Michael. That’s a helpful way to understand how these services get created in a real life situation.
I wanna end by thinking through an example of how these services might be applied in, actual business scenario.
Topgolf is a logic monitor customer, and I chose to use this brand because many of you probably have a Topgolf somewhere near where you live and you drive by them and you understand how they how they work. Many of you have probably played at a Topgolf venue.
At Topgolf, you’re not a customer. You’re a player.
And Topgolf employees are not employees. They’re playmakers.
When Topgolf came to us originally, they were seeking to reduce complexity of the number of tools that they had. They have tech that’s on prem and in the cloud. They have old tech and new tech. They have IT and IoT.
If you’ve had never been to a venue, the way that it works is, you know, it’s it’s it’s a driving range, but it’s got a a much more enhanced experience. And so you get to, select a game, and you get to play a virtual augmented reality game that you’re actually swinging your club in. And as a part of that, every every ball that you hit has a chip in it that gets tracked by cameras.
And then there’s a bunch of technology that hooks everything together to help figure out what’s going on in that player’s experience.
Well, the swinging of clubs is only one part of the player experience.
You also are usually there as a group. That group usually wants to eat and drink. That group has, menu service and table service that’s happening. And so there’s a variety of different technologies that support all of that. Let’s think about how that turns into a service.
So imagine in an example here that a ball dispenser might not be working properly in an individual bay.
If we didn’t have services, then that would probably come up as some error message in, say, a log file somewhere that you’re hoping somebody looks at. And the players would have to probably ring out and call for help somewhere.
In a service context, we might look at a service as each bay. Bay one, bay two, and bay three would be three different services that could be created.
Each of those services would have within them a variety of different technologies that might include cameras and ball dispensers and the games, the game screen that’s in there and and a variety of other things as well, including, say, point of sale for, beverage and food purchases.
When we pull all of that together into an overall service for that bay, if a ball dispenser was not triggering properly, then a service alert would trigger indicating that that bay was not having a good experience.
That alert could then trigger an escalation, which would get the right folks involved because now you’re looking at the player’s experience rather than one individual piece of technology and whether or not it’s making a difference.
You get that resolved, something gets restarted or fixed up, and then the player experience can resume. So in this example now, you’re talking more about how it’s affecting the business outcomes, and you’re thinking much more holistically rather than looking at all the bits and bytes underneath it individually for their own components.
Topgolf has had a great experience with LogicMonitor, and there’s, there’s a case study out in the website if and if you wanna hear some more about it. But they’ve consolidated from ten tools down to one and dramatically lowered their time to resolution.
Now they’re excited to get into the next generation of dynamic service insights exactly like we’ve described today to further take that player experience and consolidate it into something that’s holistic so that we’re looking at everything from the top down even though we’re using bottom up metrics.
So to recap, business businesses want to talk in their language, not in the language of technology. And the way that we do that is through service insights.
Service insights have been dramatically revamped over the last year to give you a bottoms up view that is largely equivalent to the top down view, but much less expensive, much less time consuming. And now the dynamic service insights that are becoming available very shortly will maintain integrity of a service using a template based upon the metadata of your l m properties so that you don’t have to maintain or manage these services on a daily basis. So you can get all of this benefit with very little effort and communicate the value of what you’re doing in the deep technology as water pressure people using water quality metrics that the business understands.
I hope you’re excited about this and wanna get involved. We have a beta program going actively right now. If you’d like to get involved with that, then reach out to your account team, and they’ll get you in touch with our product team.
And we’ll be, making this available generally very shortly. Thanks for listening, and enjoy the rest of the sessions.”

Ready to get started?

Try it free

Platform

Infrastructure

AIOps & Edwin AI

Cloud & Multi-Cloud

Digital Experience

Logs

Solutions

Business Outcome

Role

Industry

Resources

By Resources

By Topic

Learn the Platform

Company

About Us