INDIA +91 882 616 7094 | USA +1 949 299 0280 | GERMANY +49 176 3465 1507 info@navyuginfo.com

Wikipedia defines it like:

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a Map() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The “MapReduce System” (also called “infrastructure” or “framework”) orchestrates the processing by marshaling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

As such, a single-threaded implementation of MapReduce will usually not be faster than a traditional (non-MapReduce) implementation; any gains are usually only seen with multi-threaded implementations. The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play. Optimizing the communication cost is essential to a good MapReduce algorithm.

Refer: https://en.wikipedia.org/wiki/MapReduce

This article can be broken up into two parts. The first part covers the basic meaning and understanding of the terminologies and the flow involved. The second part, we will explore an extremely basic example of a MapReduce on MongoDB.

MapReduce: What is it?

A MapReduce operation involves two phases.

Map Phase: MongoDB applies this phase to each document. A document in MongoDB is analogous to a row in a SQL database. The purpose of map phase is to emit key-value pairs. This means for every document a map phase can emit numerous key-value pairs. In case of multiple key-value pairs with same keys, the values with the same key get merged. For eg. A => 9 and A => 10 will get merged to A => [9, 10]

If the above text doesn’t make much of sense to you, no worries, we will see all of this in action in the next part of the article.

Reduce Phase: The map phase is followed by the reduce phase. The reduce phase is called for every key-value pair. The reduce phase processes the values and outputs a result per key.

Following diagram illustrates a very simple example of MapReduce.

The collection consists of multiple test results of students. The map phase emits name and scores as key-value pairs. The reduce phase sums up all the scores.

This MapReduce output the total number of marks a student has scored in all the test. Batman scores the most, obviously. (Why? Because he is the BATMAN !)

MapReduce: Time for action

We will consider an extremely simple scenario, which arguably may not be the best scenario for a map reduce, but sufficient and reasonable for understanding the paradigm.

Problem Statement:
We are given a dataset that consists of names of students in a school. We need to count the number of students according to the first letter of their name.
For eg,

DataSet = {Batman, Bane, Joker}
Output:
2 students have names starting with B.
1 student has name starting with J.

I have prepared a sample dataset of names.

MongoDB allows us to run javascript, so the map and reduce functions will be written in javascript.

The Map phase:
This function will emit key-value pairs with “first letter of the name” as the key and “1” as the value.

function(){ 
        emit(this.name.charAt(0), 1);

}

The Reduce phase
The reduce function will sum all the 1’s at each key.

function(key, values) {

           return Array.sum(values);

}

MongoDB provides the MapReduce command to run MapReduce operations.
Refer to the syntax of MapReduce command: https://docs.mongodb.com/manual/reference/command/mapReduce/#dbcmd.mapReduce

This screenshot illustrates the output of the above MapReduce.

git clone -b article/mapreduce-in-mongodb git@192.168.3.4:root/Discourse_Snippets.git to get db dump and code snippets.
Refer : http://gitlab.navyuginfo.com/root/Discourse_Snippets/tree/article/mapreduce-in-mongodb/mapreduce-in-mongodb

Mentioned earlier, As such, a single-threaded implementation of MapReduce will usually not be faster than a traditional (non-MapReduce) implementation; any gains are usually only seen with multi-threaded implementations.

The real power of MapReduce can be exploited in parallel computing. Imagine above dataset to be in millions, We can share above database on multiple machines and run all map and reduce phase parallel on those multiple machines. The result of the map-reduce operations then will be further processed by the same reduce function, giving us the required result, thus increasing our computation speed multifold.

Certain things need to be kept in mind while writing a reduce function. Will cover those in a future article.