Apache Hadoop is a collection of software allowing distributed processing of large data sets across clusters of commodity hardware. The LogicMonitor Hadoop package monitors metrics for the following components:
- HDFS NameNode
- HDFS DataNode
As of February 2020, we have confirmed that our Hadoop package is compatible with version 3.2.1. It may be possible to monitor older versions of Hadoop, but data will not be returned for all datapoints.
As Apache releases newer versions of Hadoop, LogicMonitor will test and extend coverage as necessary.
Enable JMX on Hadoop Host
LogicMonitor collects Hadoop metrics via the REST API rather than directly via JMX. However, metrics are originally collected and stored using JMX and, therefore, JMX must be enabled on the Hadoop host. For more information on enabling JMX, see the “Enabling JMX” section of the Java Applications (via JMX) Monitoring support article.
Add Hosts Into Monitoring
Add your Hadoop host(s) into monitoring. For more information on adding resources into monitoring, see Adding Devices.
Assign Properties to Hadoop Resources
The following custom properties must be set on the Hadoop resource(s) within LogicMonitor. For more information on setting properties, see Resource and Instance Properties.
Note: These ports must be open to the Collector.
Note: To verify the correct port is being used, you should be able to access http://<HOST>:<HTTP_PORT>/jmx and view metrics for each of the various components.
From the LogicMonitor repository, import all Hadoop LogicModules, which are listed in the LogicModules in Package section of this support article. Upon import, these LogicModules will be automatically associated with your Hadoop resources, assuming the properties listed in the previous section are assigned.
LogicModules in Package
LogicMonitor’s package for Apache Hadoop consists of the following LogicModules. For full coverage, please ensure that all of these LogicModules are imported into your LogicMonitor platform.
Configuring Datapoint Thresholds
The Hadoop package does not include predefined datapoint thresholds (in other words, no alerts will trigger based on collected data). This is because the technology owner has not provided KPIs that can be reliably extended to the majority of users. In order to receive alerts for collected data, you’ll need to manually create custom thresholds, as discussed in Tuning Static Thresholds for Datapoints.
Next are some datapoints for which you may want to consider setting thresholds:
- DataSource: Hadoop HDFS DataNode FS State
- NumFailedVolumes. Datapoint that reports total number of failed volumes.
- Remaining. Datapoint that reports remaining capacity on the datanode.
- DataSource: Hadoop HDFS NameNode Info
- NumberOfMissingBlocksWithReplicationFactorOne. Datapoint that reports the number of blocks with only one copy across the cluster.
- PercentUsed. Datapoint that reports the percentage of used space across the cluster (DFS and non-DFS).
- DataSource: Hadoop HDFS NameNode Status
- ServiceRestart. Datapoint that returns a value greater than 0 when the service state changes
- State. Datapoint that returns a status code indicating the status of the Hadoop namenode service.
- DataSource: Hadoop HDFS NameNode FSNamesystem
- CorruptBlocks. Datapoint that reports the current number of blocks with corrupt replicas.
- CorruptReplicatedBlocks. Datapoint that reports the number of corrupt blocks that have been replicated.
- FSState. Datapoint that returns a status code indicating whether the FS is operational or in safe mode.
- MissingBlocks. Datapoint that reports the current number of missing blocks.
- MissingReplicationOneBlocks. Datapoint that reports the number of missing blocks with replication factor of 1.
- NumDeadDataNodes. Datapoint that reports the number of datanodes currently dead.
- UnderReplicatedBlocks. Datapoint that reports the current number of blocks under replicated.
- VolumeFailuresTotal. Datapoint that reports the total number of volume failures across all datanodes.
- DataSource: Hadoop Yarn Queue Metrics
- AppsFailed. Datapoint that reports the number of applications that failed to complete.
- DataSource: Hadoop Yarn Cluster Status
- NumLostNMs. Datapoint that reports the current number of lost NodeManagers for not sending heartbeats.