- Mastering Hadoop
- Sandeep Karanth
- 475字
- 2025-02-23 16:44:05
MapReduce job counters
Counters are entities that can collect statistics at a job level. They can help in quality control, performance monitoring, and problem identification in Hadoop MapReduce jobs. Since they are global in nature, unlike logs, they need not be aggregated to be analyzed. Counters are grouped into logical groups using the CounterGroup
class. There are sets of built-in counters for each MapReduce job.
The following example illustrates the creation of simple custom counters to categorize lines into lines having zero words, lines with less than or equal to five words, and lines with more than five words. The program when run on the grant proposal subset files gives the following output:
14/04/13 23:27:00 INFO mapreduce.Job: Counters: 23 File System Counters FILE: Number of bytes read=446021466 FILE: Number of bytes written=114627807 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=535015319 HDFS: Number of bytes written=52267476 HDFS: Number of read operations=391608 HDFS: Number of large read operations=0 HDFS: Number of write operations=195363 Map-Reduce Framework Map input records=27862 Map output records=27862 Input split bytes=56007 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=66 Total committed heap usage (bytes)=62037426176 MasteringHadoop.MasteringHadoopCounters$WORDS_IN_LINE_COUNTER LESS_THAN_FIVE_WORDS=8449 MORE_THAN_FIVE_WORDS=19413 ZERO_WORDS=6766 File Input Format Counters Bytes Read=1817707 File Output Format Counters Bytes Written=189102
The first step in creating a counter is to define a Java enum with the names of the counters. The enum
type name is the counter group as shown in the following snippet:
public static enum WORDS_IN_LINE_COUNTER{ ZERO_WORDS, LESS_THAN_FIVE_WORDS, MORE_THAN_FIVE_WORDS };
When a condition is encountered to increment the counter, it can be retrieved by passing the name of the counter to the getCounter()
call in the context object of the task. Counters support an increment()
method call to globally increment the value of the counter.
The getCounter()
method in the context has a couple of other overloads. It can be used to create a dynamic counter by specifying a group and counter name at runtime.
The Mapper
class, as given in the following code snippet, illustrates incrementing the WORDS_IN_LINE_COUNTER
group counters based on the number of words in each sentence of a grant proposal:
public static class MasteringHadoopCountersMap extends Mapper<LongWritable, Text, LongWritable, IntWritable> { private IntWritable countOfWords = new IntWritable(0); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer tokenizer = new StringTokenizer(value.toString()); int words = tokenizer.countTokens(); if(words == 0) context.getCounter(WORDS_IN_LINE_COUNTER.ZERO_WORDS).increment(1); if(words > 0 && words <= 5) context.getCounter(WORDS_IN_LINE_COUNTER.LESS_THAN_FIVE_WORDS).increment(1); else context.getCounter(WORDS_IN_LINE_COUNTER.MORE_THAN_FIVE_WORDS).increment(1); countOfWords.set(words); context.write(key, countOfWords); } }
Counters are global variables in a distributed setting and have to be used prudently. The higher the number of counters, the more are the overheads on the framework that keeps track of them. Counters should not be used to aggregate very fine-grained statistics of an application.