Logs for Ops

Blogs for Logs - Logs for Ops

The evolution of machine data and logging in general has shifted multiple times over the last couple of decades. The log began with Unix and was rooted in command line actions like tail or grep. It evolved from system-based logs to application-based logs and eventually became more UI-friendly and readable. Not only has the log itself evolved, but the purpose of the log and audience for the log has morphed over time as well. Initial users were developers but eventually shifted to operational teams as a means to triage and diagnose issues. Eventually, with cyber threats and hacking, the log landed in the middle of the security center where it has dominated the logging landscape over the past decade and spawned a billion-dollar industry and the concept of Security Information and Event Management (SIEM). There is a new transition underway that is gaining momentum and the ever transitioning log is now coming full circle with the re-emergence of logs in the day-to-day operations center workflow. 

A History of Logs

In the world before Splunk, the old adage of “a reboot fixes all” was king amongst the front-line troubleshooting efforts. Another tried and true method is often stopping/starting services and bouncing resources with fingers crossed that this would resolve the issue during a firefight.  Only in instances of the problem reoccurring or the reboot not resolving did admins dive into the logs.

At the time, logs were not for the faint of heart. They were not verbose, often neglected in content, and weren’t considered a crucial part of the development lifecycle. Most log troubleshooting required an actual developer to interpret the logs, which added to the complexity and willingness of operations teams to go into the logs.

Enrichment of the Log

The next shift happened as awareness of machine data grew and was found to be a key element to investigating outages and finding root causes beyond the dev team. Developers began standardizing the formats of logs. RFC compliance also emerged along with accepted timestamps and basic information, session IDs, and host information began to be included in every type of log. 

This is also the same timeframe that the concept of log levels came forth. The logs themselves began to be classified as debug level, info, warnings, errors, or criticals, which helps troubleshooters to narrow the focus. As developers started adding more focus into the logging, this process eventually became an important checkbox in the development lifecycle and a key ingredient to the DevOps methodology. With all these log enrichments, the industry saw an explosion in the amount of data. Things quickly became beyond human consumption in order to really understand and use logs to their potential. While these enhancements to the logging framework gave operations teams the ability to seek out root cause on their own without always needing to engage the developers, it also pushed the need for ingest and analysis tools, search capabilities, parsing, and centralized storage of these logs. 

Shifting From Ops to Security

As cybersecurity, data breaches, and attacks became the norm, the next pivot for logs moved out of the NOC and pushed the emergence of the SOC and SIEMS. Companies raced to spend billions on logging tools as a means to cover themselves, investigate, and hopefully maintain good graces amongst the industry and their customer base.  

We have seen enterprises dump billions into the collection, analysis, and storage of these logs for audit, just in case scenarios, and to help threat hunt. The control of this data and the tools they resided in moved from ops to security. In all of this, the NOC operator and sysadmins were left behind and the entire reason for logging in the first place, to resolve outages and issues, was lost in the mix. 

The security teams now own these tools. They lock them down, access is restricted, and even if admins are given access, the learning curve to understand the query language well enough to go from 10 million events to 10 is excruciating. The front-line teams are right back where they started, avoiding logs. Instead, they are trying to reboot or bounce services with their fingers crossed, hoping it resolves the issue, and if it doesn’t, hoping it doesn’t happen again. 

Worse yet, it does happen again and now you are reluctantly forced into the logs. The next steps are often tedious and typically manual in nature. Manual mode entails RDP, termserv, or remote connect to the box and creds become essential. Next comes the decision of which of the dozens of logfiles with different timestamps is the one you need for your investigation. Next is downloading the logfile, zipping and/or unzipping it, and then the real fun begins. Open the file itself and do some ctrl+f, grep, or tail command to find some cryptic error codes which probably won’t mean much to someone who doesn’t have expertise in the tech behind the log. If that’s the case, then here comes the escalation to tier 2, 3, or beyond, a copy and paste, and then shipping it off to the app or dev team for more analysis. Even once you find the so-called needle in a stack of needles, it is without context, without the metrics or monitor threshold that was triggered, or any relations to upstream or downstream connections, devices, or containers. This entire concept is known as non-value-added engineering time and is costly no matter how you slice it. All the while, MTTR is slip-sliding away. 

Putting the Operational Focus Back on Logs

LogicMonitor’s LM Logs and its anomaly detection capabilities put the operational focus back on logs. In the alert and notification pane of the interface, you can see correlated logs and metrics together in a single pane of glass when troubleshooting outages in real-time, driving up to 80% improvement in MTTR.   

With LM Logs, advanced queries are not required. Logs are simple and easy to ingest, then mapped to a monitored device already in LogicMonitor. Ease of use is key and there is no query language learning curve required since the AI and anomaly detection do the initial work for you.  Additionally, no non-value-added engineering time is required to manually access the logs. 

Beyond the basic logs use cases, early adopters have quickly leaned on pipeline alerts to trigger notifications based on any log condition in order to get a real-time notification if the problem returns. Teams have also been quick to leverage anomaly views. They are able to use this information in weekly reviews with their teams to discuss if those anomalies should be turned into pipeline alerts to give them real-time visibility into that event happening again. 

In just a short time we’ve seen customers small and large leverage LM Logs as an operations-specific log tool, even though they already have a SIEM or security-based tool. Logs for Ops. 

The SIEM is here to stay and the importance of logs in security remains king, but the evolution of logs is shifting full circle back to the operations center. LogicMonitor and LM Logs is smack dab in the middle of that shift.