The MapR Metrics service collects and displays analytics information about the Hadoop jobs, tasks, and task attempts that run on the nodes in your cluster. You can use this information to examine specific aspects of your cluster's performance at a very granular level, enabling you to monitor how your cluster responds to changing workloads and optimize your Hadoop jobs or cluster configuration. The analytics information collected by the MapR Metrics service is stored in a MySQL database. The server running MySQL does not have to be a node in the cluster, but the nodes in your cluster must have access to the server.
The Job Metrics Database
Metrics information is kept in a MySQL database that you configure when you install MapR. The Metrics database provides the following standard tables:
- The JOB and JOB_ATTRIBUTES tables hold job metadata while a job is running. Information from the JOBSEL and JOB_ATTRIBUTES tables is written to the
/var/mapr/<cluster name>/mapred/jobTracker/jobs/history/directory. If a request is made at the MCS for a job that has already been purged from the Metrics database, that data is reloaded from the relevant directory.
- The METRIC_TRANSACTION_* tables hold job transaction data such as counters. The transactional data is written to the
/var/mapr/<cluster name>/mapred/jobTracker/jobs/history/metrics/directory on each host. This directory depends on the base path of the JobTracker directory. These transactional data files are named
- The NODE table holds information about the node ID, hostname, host ID, cluster ID, and creation time.
The TASK, TASK_ATTEMPT, and TASK_ATTEMPT_ATTRIBUTES tables hold information related to a job's tasks and task attempts. These tables update while the job is running.
If a job's task data has not been accessed within a configurable time limit, the data from the TASK, TASK_ATTEMPT, and TASK_ATTEMPT_ATTRIBUTES tables is purged. The
db.joblastaccessed.limit.hoursparameter in the db.conf file sets the number of hours that define this time limit. The default value for this parameter is 48.
The job metrics cover the following categories:
- Cluster resource use (CPU and memory)
- Duration (epoch)
- Task count (map, reduce, failed map, failed reduce)
- Map rates (record input and output, byte input and output)
- Reduce rates (record input and output, shuffle bytes)
- Task attempt counts (map, reduce, failed map, failed attempt)
- Task attempt durations (average map, average reduce, maximum map, maximum reduce)
The task attempt metrics cover the following categories:
- Times (task attempt duration, garbage collection time, CPU time)
- Local byte rate (read and written)
- Mapr-FS byte rate (read and written)
- Memory usage (bytes of physical and virtual memory)
- Records rates (map input, map output, reduce input, reduce output, skipped, spilled, combined input, combined output)
- Reduce task attempt input groups
- Reduce task attempt shuffle bytes
By default, the MapR cluster software applies limits to the maximum number of records used for job and task histograms. You can configure these records from the MCS through System Settings > Metrics or by editing the
db.conf file directly.
Metrics Protocol Buffers
The protocol buffer definition for cluster metrics data is
clustermetrics.proto, which is located in the
Example: Using MapR Metrics To Diagnose a Faulty Network Interface Card (NIC)
In this example, a node in your cluster has a NIC that is intermittently failing. This condition is leading to abnormally long task completion times due to that node being occasionally unreachable. In the Metrics interface, you can display a job's average and maximum task attempt durations for both map and reduce attempts. A high variance between the average and maximum attempt durations suggests that some task attempts are taking an unusually long time. You can sort the list of jobs by maximum map task attempt duration to find jobs with such an unusually high variance.
Click the name of a job name to display information about the job's tasks, then sort the task attempt list by duration to find the outliers. Because the list of tasks includes information about the node the task is running on, you can see that several of these unusually long-running task attempts are assigned to the same node. This information suggests that there may be an issue with that specific node that is causing task attempts to take longer than usual.
When you display summary information for that node, you can see that the Network I/O speeds are lower than the speeds for other similarly configured nodes in the cluster. You can use that information to examine the node's network I/O configuration and hardware and diagnose the specific cause.