MapReduce job counters

Counters are entities that can collect statistics at a job level. They can help in quality control, performance monitoring, and problem identification in Hadoop MapReduce jobs. Since they are global in nature, unlike logs, they need not be aggregated to be analyzed. Counters are grouped into logical groups using the CounterGroup class. There are sets of built-in counters for each MapReduce job.

The following example illustrates the creation of simple custom counters to categorize lines into lines having zero words, lines with less than or equal to five words, and lines with more than five words. The program when run on the grant proposal subset files gives the following output:

14/04/13 23:27:00 INFO mapreduce.Job: Counters: 23
    File System Counters
        FILE: Number of bytes read=446021466
        FILE: Number of bytes written=114627807
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=535015319
        HDFS: Number of bytes written=52267476
        HDFS: Number of read operations=391608
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=195363
    Map-Reduce Framework
        Map input records=27862
        Map output records=27862
        Input split bytes=56007
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=66
        Total committed heap usage (bytes)=62037426176
    MasteringHadoop.MasteringHadoopCounters$WORDS_IN_LINE_COUNTER
        LESS_THAN_FIVE_WORDS=8449
        MORE_THAN_FIVE_WORDS=19413
        ZERO_WORDS=6766
    File Input Format Counters 
        Bytes Read=1817707
    File Output Format Counters 
        Bytes Written=189102

The first step in creating a counter is to define a Java enum with the names of the counters. The enum type name is the counter group as shown in the following snippet:

    public static enum WORDS_IN_LINE_COUNTER{
            ZERO_WORDS,
            LESS_THAN_FIVE_WORDS,
            MORE_THAN_FIVE_WORDS
        };

When a condition is encountered to increment the counter, it can be retrieved by passing the name of the counter to the getCounter() call in the context object of the task. Counters support an increment() method call to globally increment the value of the counter.

Tip

An application should not use more than 15 to 20 custom counters.

The getCounter() method in the context has a couple of other overloads. It can be used to create a dynamic counter by specifying a group and counter name at runtime.

The Mapper class, as given in the following code snippet, illustrates incrementing the WORDS_IN_LINE_COUNTER group counters based on the number of words in each sentence of a grant proposal:

public static class MasteringHadoopCountersMap extends Mapper<LongWritable, Text, LongWritable, IntWritable> {
private IntWritable countOfWords = new IntWritable(0); 
            @Override
            protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

                StringTokenizer tokenizer = new StringTokenizer(value.toString());
                int words = tokenizer.countTokens();
if(words == 0) context.getCounter(WORDS_IN_LINE_COUNTER.ZERO_WORDS).increment(1);
                if(words > 0 && words <= 5)
context.getCounter(WORDS_IN_LINE_COUNTER.LESS_THAN_FIVE_WORDS).increment(1);
                else
context.getCounter(WORDS_IN_LINE_COUNTER.MORE_THAN_FIVE_WORDS).increment(1);
countOfWords.set(words);
                context.write(key, countOfWords);
            }
        }

Counters are global variables in a distributed setting and have to be used prudently. The higher the number of counters, the more are the overheads on the framework that keeps track of them. Counters should not be used to aggregate very fine-grained statistics of an application.