MapReduce in Mongodb

Wikipedia defines it like:

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a Map() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The “MapReduce System” (also called “infrastructure” or “framework”) orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

As such, a single-threaded implementation of MapReduce will usually not be faster than a traditional (non-MapReduce) implementation; any gains are usually only seen with multi-threaded implementations. The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play. Optimizing the communication cost is essential to a good MapReduce algorithm.


This article can be broken up into two parts. The first part covers the basic meaning and understanding of the terminologies and the flow involved. The second part, we will explore an extremely basic example of a mapreduce on mongodb.

MapReduce: What is it?

A mapReduce operation involves two phases.

Map Phase : Mongodb applies this phase to each document. A document in mongodb is analogous to a row in a sql database. The purpose of map phase is to emit key-value pairs. This means for every document a map phase can emit numerous key-value pairs. In case of multiple key-value pairs with same keys, the values with same key get merged. For eg. A => 9 and A => 10 will get merged to A => [9, 10]

If the above text doesn’t make much of sense to you, no worries, we will see all of this in action in the next part of the article.

Reduce Phase: The map phase is followed by the reduce phase. The reduce phase is called for every key value pair. The reduce phase processes the values and outputs a result per key.

Following diagram illustrates a very simple example of mapReduce.

The collection consists of multiple test results of students. The map phase emits name and scores as key value pairs. The reduce phase sums up all the scores.

This mapreduce output the total number of marks a student has scored in all the test. Batman scores the most, obviously. (Why ? Because he is the BATMAN !)

MapReduce: Time for action

We will consider an extremely simple scenario, which arguably may not be the best scenario for a map reduce, but sufficient and reasonable for understanding the paradigm.

Problem Statement:
We are given a dataset that consists of names of students in a school. We need to count the number of students according to the first letter of their name.
For eg,

DataSet = {Batman, Bane, Joker}
2 students have names starting with B.
1 student has name starting with J.

I have prepared a sample dataset of names.

Mongodb allows us to run javascript, so the map and reduce functions will be written in javascript.

The Map phase:
This function will emit key-value pairs with “first letter of name” as the key and “1” as the value.

        emit(, 1);


The Reduce phase
The reduce function will sum all the 1’s at each key.

function(key, values) {

           return Array.sum(values);


Mongodb provides the mapReduce command to run mapreduce operations.
Refer for the syntax of mapReduce command:

This screenshot illustrates the output of the above mapreduce.

git clone -b article/mapreduce-in-mongodb git@ to get db dump and code snippets.
Refer :

Mentioned earlier, As such, a single-threaded implementation of MapReduce will usually not be faster than a traditional (non-MapReduce) implementation; any gains are usually only seen with multi-threaded implementations.

The real power of mapreduce can be exploited in parallel computing. Imagine above dataset to be in millions, We can share above database on multiple machines and run all map and reduce phase parallel on those multiple machines. The result of the map-reduce operations then will be further processed by the same reduce function, giving us the required result, thus increasing our computation speed multi fold.

Certain things need to be kept in mind while writing a reduce function. Will cover those in a future article.

By – Saurabh Mehta, Senior Software Engineer

adminMapReduce in Mongodb
Share this post

Join the conversation

Related Posts