MapReduceIntroLab

By Todd Frederick, 10/08

Introduction
Google needed a way to efficiently write distributed programs that process large amounts of data, so it developed the MapReduce framework. For those of you familiar with how Google interviews everyone, including presidential candidates, it should come as no surprise that the core operation of MapReduce is a large sort:


 * 1) You write a mapper that reads raw data and produces results.
 * 2) MapReduce sorts these results, and passes them to . ..
 * 3) Your reducer, which outputs other results.

Unfortunately, Google doesn't let anyone else use MapReduce, including Yahoo. So the Apache Project, in this case strongly supported by Yahoo, developed Hadoop, a clone of MapReduce and other Google projects. We will focus on Hadoop Map-Reduce and the Hadoop Distributed File System, a clone of the Google File System. Later, you may also want to read about Hadoop HBase, a less-mature clone of Google's BigTable. From now on we will use Hadoop's terminology for internal Map-Reduce and DFS concepts.

Map-Reduce makes data processing faster, but any worthwhile operation will still take lots of wall clock time. Unlike interactive programs used at workstations, Map-Reduce is typically invoked from batch jobs. Your program and its parameters constitute a job that you submit to a cluster of computers for processing.

A simple example of a Map-Reduce job is a word frequency counter: how many times does each word in a body of text appear? Now we will try compiling and running a pre-written word count program in C++ and Java.

Using Helios
Helios is our production Beowulf cluster. Check it out sometime in our temporary server room.

Connecting
We'll use SSH to work with Helios.

ssh helios.public.stolaf.edu

You now have a shell on the Helios administrative node. From here, Hadoop clients can connect to the cluster nodes that actually run Map-Reduce jobs and store data in the DFS.

DFS
The DFS exists to make disk access efficient for Map-Reduce jobs. We'll create a short text file to use as input for our job.

Three blind mice, three blind mice! See how they run, see how they run! They all ran after the farmer's wife, She cut off their tails with a carving knife. Did you ever see such a thing in your life As three blind mice?

Now we'll put the file on the DFS, which has a structure somewhat like a UNIX filesystem. My home directory on the DFS is  for example. In the next step, you will specify input and output paths for your job relative to the DFS. Each of these paths must be directories, not single files. So we create a new directory and put the poem into it.

hadoop fs -mkdir poems_about_mice hadoop fs -put blindmice.txt poems_about_mice

Note that my file is now located at  on the DFS.

Word Count in C++
We will now run the standard word count example from the Hadoop documentation in both C++ and Java. Hadoop is written in Java, but and interface called Hadoop Pipes allows mappers and reducers to be written in C++.

cp /home/courses/hadoop/*.

Take a quick peek at the source file and see if you can tell what's going on. The Makefile is easier to understand, or at least to modify.

cat wordcount.cpp cat Makefile

Now make the wordcount binary.

make

We copy the executable to the DFS so Pipes can find it.

hadoop fs -mkdir bin hadoop fs -put wordcount bin

Pipes uses a special configuration file to find out what binary to run for a given job. Modify the line in   by replacing   by the name of your binary.

vi wordcount.xml hadoop.pipes.executable /user/frederit/bin/wordcount

Finally it's time to run the job. If you run this more than once, remember to specify a different output directory each time or delete the old one.

hadoop pipes -conf wordcount.xml -input /user/frederit/poems_about_mice -output /user/frederit/pipes1

When your job completes, you can see the output.

hadoop fs -cat /user/frederit/pipes1/part-00000

Word Count in Java
We can also run an equivalent job developed in Java. Using Java gives you the most control over Map-Reduce, though you may find C++ more efficient than Java.

Look at the Java source to see what looks familiar. Notice that this approach puts most configuration settings in the source, rather than a separate XML file.

cat WordCount.java

Now we compile the source. Hadoop likes working with JARs, so we package our compiled classes into one.

mkdir classes javac -classpath /home/apps/hadoop/hadoop-0.18.1-core.jar -d classes WordCount.java (/opt/hadoop/hadoop-0.18.1-core.jar on mist) jar cvf WordCount.jar -C classes edu

To run the job, we specify the JAR, the class, and any arguments required by the class's entry point. In this case, WordCount accpets an input path and an output path. Once again, remember that you can't use an existing output path.

hadoop jar WordCount.jar edu.stolaf.cs.WordCount /user/frederit/poems_about_mice /user/frederit/java1

You can  the output just as you did for the C++ example.

Parts of a Map-Reduce Job
Let's walk through the different components of Map-Reduce and see how they apply to the WordCount example. We will refer to the Java interface to avoid confusion.

Data Input
Data usually starts as files in DFS that are read into Map-Reduce. You can customize exactly how Map-Reduce reads your data, but typically you will choose a built-in format. The default is, which produces an input key-value pair for each line of the input file, where the key is some number and the value is the line itself.

You should also be aware of how, at a lower level, Map-Reduce handles large input files. The framework will divide each input file into blocks of 64MB by default. A separate mapper task is started for each block of input data. Each node in the cluster may only be able to run so many tasks at a time, 2 in our case, so not all tasks may actually start at once.

Mapper
You probably guessed that this is one of the two most important parts of a Map-Reduce job. Naturally, it is also one of the more flexible.


 * Input: 1 input key-value pair
 * Ouput: 0 or more intermediate key-value pairs

The type of the input key-value pair is determined by the input format. For the, the key is a   and the value is a. By default, the types for the intermediate key-value pairs are the same as those for the final output key-value pairs. In WordCount, these are  and , respectively. If you are familiar with generics, notice how the  in WordCount declares the input and output key and value types it works with.

The  in WordCount processes a line of text and outputs a key-value pair for each instance of a word it encounters. The  means that we found a word once.

Combiner
Try reading the Reducer section first, then come back here.

In WordCount, we will likely find the same word more than once on a given line, or in all the lines processed in a single mapper task. We really don't need to send all of these key-value pairs into the distributed sort, we can "pre-reduce" all the key-value pairs before they leave a mapper task.

For example, say we run WordCount on all of Wikipedia, and we find the word "walrus" 325 times in the first 64MB worth of articles. Rather than stress out the sort with 325 pairs of the form "walrus, 1", we can combine these pairs into a single "walrus, 325". Note that we may also find "walrus" 243 times in the next 64MB of articles, processed on a different machine, so we still need the reducer.

In most cases, but not all, the combiner and reducer are identical. Just remember that the combiner runs in the mapper task after the mapper itself but before the sort, while the reducer runs in a reducer task after the sort.

Partitioner and Comparators
Again, reading the Reducer section first might make this clearer. Just as a mapper task processes many input key-value pairs, a reducer task processes many intermediate key-value pairs. Map-Reduce has default behaviors for deciding what reducer task an intermediate key-value pair should go to (partitioner), how to sort intermediate keys (output key comparator), and when two intermediate keys should be considered the same (output value grouping comparator). For now, just be aware that all of these can be customized in more advanced jobs.

Reducer
Reducers distill the sorted intermediate key-value pairs and produce final output.


 * Input: 1 intermediate key, 1 or more intermediate values
 * Output: 0 or more output key-value pairs

All the intermediate key-value pairs that share the same key are run through the reducer together. In WordCount, all the counts for the same word are collected together, and the  simply adds them up and outputs the total count.

Recall that the number of mapper tasks is determined by the number of input splits. The number of reducer tasks, however, must be explicitly set by your program. For this introductory example, we just used one reducer task, but clearly more sizeable jobs will require more.

Data Output
WordCount uses, which writes each output key-value pair to a line in a text file.

Sequence files for cascading jobs
WordCount outputs text files, which are great for human readers, but what if you want to read the output from one Map-Reduce job as the input for a second Map-Reduce job? Use  and   to write and read key-value pairs without worrying about serialization. These work much as you might expect them to; whatever classes you use to write a sequence file are the same classes you must use to read from that sequence file.

Use sequence files to create two jobs, a word count and some filter or sorter that operates on the word count data.