Windows Event Log Best Practices for Operations Teams

Windows Event Log Best Practices for Operations Teams

The Windows Event log is an essential tool for administrators to investigate and diagnose potential system issues, but it can also be a daunting task to gain real value and separate useful log entries from noisy non-essential activity. Depending on the level of logging that can be useful, Windows events can span system issues, application-specific issues, and also dive into security type issues around unauthorized access, login failures, and unusual behavior. Administrators must understand the concept of channels, logging levels, and the lens through which to view the data to make this data useful in day-to-day troubleshooting and root cause analysis. 

For today’s discussion, we are going to apply the lens of a Windows system administrator using the Windows Event logs to identify the root cause for system failures, performance issues,  configuration issues that lead to outages, or service disruptions. 

Noise vs. Value

The comment I hear most often around the subject of noise vs. value is that 99% of logs are never searched or used. I suppose this is mostly true but also reinforced because of how people typically use logs, which is avoiding them until there is a problem or outage and then the logs become a necessary evil. As most logging vendors bill based on ingest, log consumers go to great lengths to filter out the repetitive, audit success events, or even drop all information level logs and only keep warning, critical, and error logs. All of which reduce ingest and cost, but also reduce visibility into the dark spot of configuration changes. Configuration changes have been linked to 40% of all outages in IT today and are mostly written as information level in their various log sources. Windows is no exception, with a great deal of these “noisy” events being written to the information channel (more on channels to come).

One could argue that if you could leverage AI, machine learning, and anomaly detection you could increase the value of 99% of the logs that are never searched until a problem arises. Using anomaly detection to scan configuration change logs or information level logs could show real insight around new behaviors that can be addressed or known about. This could help avoid outages in the first place and also resolve things more quickly when there are service disruptions. If you are only ingesting and holding error, warning, and critical level logs you are missing out on the proactive value that can be gained from the 99%.

Centralize and Aggregate

The first step in any Windows Event management strategy is to centralize and aggregate these logs from across the environment into a single location. While retention times can vary based on use cases, having them central and searchable is key to success.  

Windows Events are located locally at each windows server which means there can be a fairly large degree of non-value-added engineering time involved just to get the right server at the right time. Non-value added engineering time is the amount of time a tech or admin will spend connecting to systems, downloading log files, searching, or parsing any tasks required before analysis and triage can be done. There are a variety of means to consolidate but for purposes of our discussion, we will assume that log ingestion means are in place and we can move the conversation into noise vs. value and what kind of logs to consume. 

What Are Event Channels? 

Channels are essentially buckets that classify event types. Putting them into categories makes them easier to collect, search, and view. Many Windows-based applications are written to be configured to log into one of Microsoft’s pre-defined channels, but we also see applications creating their own channel as well. For purposes of monitoring Windows Event log through the eyes of a system administrator and troubleshooting, the best practice approach is to ingest logs from the application, system, and security channels. However, there could be many additional channels based on the functions and other applications installed on the system. A quick PowerShell command from the administrator PowerShell prompt can show you a quick list of available channels on the system in question.

Enter the following command into PowerShell:

# to see channels listed in the standard order
Get-WinEvent -ListLog *

# to sort more active channels to the top of the list
Get-WinEvent -ListLog * | sort RecordCount -Descending

# to see channels present on a remote computer
Get-WinEvent -ListLog * -ComputerName <hostname>

The output will include a list of channels, along with the number of event records currently in those channels:

LogMode   MaximumSizeInBytes RecordCount LogName
-------   ------------------ ----------- -------
Circular            20971520       59847 Application
Circular            20000000       29339 Microsoft-Windows-Store/Operational
Circular            20971520       21903 Security
Circular             4194304       10098 Microsoft-Windows-GroupPolicy/Operational
Circular             5242880        9568 Microsoft-Windows-StateRepository/Operational
Circular            15728640        7066 Windows PowerShell
Circular             5242880        4644 Microsoft-Windows-AppXDeploymentServer/Operational
Circular             8388608        4114 Microsoft-Windows-SmbClient/Connectivity
Circular             1052672        2843 Microsoft-Windows-EapHost/Operational
Circular             1052672        2496 Microsoft-Client-Licensing-Platform/Admin

Event Types

Now that we’ve defined channels, it’s time to discuss the Event Types that are pre-set by Microsoft and apply them to each channel. Microsoft has categorized the Event Types as Information, Error, Warning, Critical, Security Audit Success, and Security Audit Failure. The first thing of note here is that event types Security Audit Success and Security Audit Failure only apply to the Security Channel and are only present in the Security Channel. Another important item to point out is that not all security events are written into the Security Channel. Security-related events could also be written to the System or Application channel or any of the other function-specific channels that could reside on a system. 

Windows Events for Ops vs. SIEM

Traditionally, the SIEM requests any and all logs and log types with the exception of debug level logging. This could work in a perfect world where cost is not a prohibitive factor, but most companies don’t have unlimited funds and they make exclusions. As mentioned above, security-type events can be written across channels. The SIEM typically needs to grab logs from all of these channels, but often times will exclude information level alerts as a cost-savings method.  

The typical stance from an operational perspective is that the security channel is often excluded from ingestion. As we’ve mentioned above, information level alerts can be highly valuable to operations staff, especially if the idea of anomaly detection is applied.  

Best Practices

While the usual approach outlined above will offer value, LogicMonitor’s best practice approach goes a little further in capturing additional value from these logs. We recommend grabbing any channel, but certainly starting with the application, system, and security channels, even for operations teams. The key channel here is the security channel. It is essential to define the event types we want to ingest and our recommended approach is to accept the information level log events and higher for any channel. This means we want to bring in the following event types: Information, Warning, Error, and Critical events for all channels, including the security channel. Let’s go a little deeper into why we might grab any security events for the operations team and how we can tune out some of the audit success noise.

The security channel is hands down the largest event volume and highest talker. When we investigate further, we know it is overwhelmingly the security audit success event type. Security audit successes are extremely chatty and essentially give us a record of who logged in where/when, what they accessed, and expected behavior. This level of logs may come in handy for a security analytics investigation but is rarely helpful or used in a break/fix type of triage. This leads us to the security audit failure and the other side of the coin. Security audit failures show us failed access, or login failures, and can indicate suspicious activity which could be the first indicators of an attack or hacker activity. Any or all of the above can lead to larger issues affecting both operations and security teams.  

One additional thought here to wrap up event types and best practices is debug level logs. Debug level logging could be useful if a specific use case comes up and could be turned on temporarily to assist in the deep dive triage. It should only be used during these ad hoc cases.  

Become More Proactive Than Reactive With LM Logs

LogicMonitor’s LM Logs capabilities combined with our best practices approach give customers a winning and sensible approach to aggregate Windows event logs quickly and efficiently into a single, searchable pane of glass. Collect the right level of log to help focus Windows administrators for better MTTR, use root cause analysis during triage of an issue, and use anomaly detection to drive proactive use of the logs to catch unusual behavior or changes that could lead to larger issues down the road. These are just a few features of LM Logs that address some of the Windows Event Log challenges:

  • Easy to ingest with LM Logs via WMI using existing LogicMonitor monitoring infrastructure and collectors already in place. 
  • Combine logs with server metrics for more in-depth views in a single pane of glass.
  • Daily, weekly, or monthly anomaly review sessions help gain valuable insight into unusual behavior.
  • Intuitive and simple means to define channels using the LM Logs DataSource and applying properties at either the individual server or server group levels.
  • Trigger Alerts on specific conditions with keywords, eventIDs, or Regex. 
  • Use data source properties to quickly filter out repetitive events by EventID, Event Type, or Message String.

Check out our on-demand webinar to learn more about how LM Logs can help your teams decrease troubleshooting, streamline IT workflows, and increase control to reduce risk.