Wednesday, July 27, 2011

Min, max and MongoDB

What else can we say about MongoDB? Sure, we could talk about built-in map-reduce or the expressive query language or
journaling and sharding improvements in 1.8. But all of this is by now well-travelled territory, and that's a good thing. In addition to the features in the database itself 10gen and the community have done yeoman work in producing a rich collection of documentation describing MongoDB. This collection includes the MongoDB cookbook, and within that cookbook is a recipe illustrating how to find the maximum and minimum values of a field within a MongoDB collection using map-reduce. And that's where the problem started.

Solving this problem with map-reduce seems like entirely the wrong approach to me, although admittedly I'm new here. A solution using map-reduce must (almost by definition) access every document within the collection. This kind of approach makes sense if we're interested in max and min values for every key in the collection. But if we're only interested in one key (or some small subset of keys) this approach has us doing more work than we need to. MongoDB has strong support for indexes within a collection; we should be able to leverage these indexes to increase efficiency by reviewing only the keys we're interested in.

In order to make this work we need to be able to retrieve a max or min using only MongoDB's query language. We begin by finding all documents that match the target key or keys. We then sort these documents in descending or ascending order depending on whether we're after the max or min (respectively). We then limit our search results to contain no more than one element: this element must be the max or min value for the specified field.

To try this out, we expand upon the example (used in the cookbook) of blog posts, each of which has an author field, some content and a page_count field for tracking hits on each article. To make things interesting we create a large sample data set: 10,000 distinct authors, each of whom have written 100 posts. Each post has been viewed less than 2000 times. We want to compute the max and min page views for a given author.

The following Ruby script generates a set of random data for us, storing that data in a set of YAML files to enable repeated load operations:



We use the following Ruby script to load the data in these YAML files into our MongoDB instance:



Now that the data's loaded we move on to our queries. Let's start with the map-reduce approach, just to verify that it behaves as we expect. We'll start off with the mongo shell and re-use the map and reduce functions defined in the cookbook entry:



We expect this operation to access every document in the database. To provide a reference point we execute a query to count all documents before performing the map-reduce:


[@varese mongo_ruby]$ ~/local/mongodb/bin/mongo
MongoDB shell version: 1.8.2
connecting to: test
> use indexing
switched to db indexing
> db.indexing.count()
1010000
> db.indexing.mapReduce(map,reduce,{out: {inline:true}})
...
{
"_id" : "Zzxpnsnqjskpvpfwtpyeexpr",
"value" : {
"min" : {
"page_views" : 16,
"_id" : ObjectId("4e2fa865e8f06a08850b5735")
},
"max" : {
"page_views" : 1974,
"_id" : ObjectId("4e2fa865e8f06a08850b5709")
}
}
}
],
"timeMillis" : 177968,
"counts" : {
"input" : 1010000,
"emit" : 1010000,
"output" : 10000
},
"ok" : 1,
}


As expected we get the max and min values for all authors. And also as expected we have to look at every document to do it. Can we do any better with queries? Not right off the bat:


> db.indexing.find({'author':'Zzxpnsnqjskpvpfwtpyeexpr'}).sort({'page_views':-1}).limit(1)
{ "_id" : ObjectId("4e2fa865e8f06a08850b5709"), "author" : "Zzxpnsnqjskpvpfwtpyeexpr", "page_views" : 1974, "content" : "The quick brown fox jumped over the lazy dog" }
> db.indexing.find({'author':'Zzxpnsnqjskpvpfwtpyeexpr'}).sort({'page_views':1}).limit(1)
{ "_id" : ObjectId("4e2fa865e8f06a08850b5735"), "author" : "Zzxpnsnqjskpvpfwtpyeexpr", "page_views" : 16, "content" : "The quick brown fox jumped over the lazy dog" }
> db.indexing.find({'author':'Zzxpnsnqjskpvpfwtpyeexpr'}).sort({'page_views':-1}).limit(1).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 1010000,
"nscannedObjects" : 1010000,
...
"millis" : 1774,
...
}


A simple index on authors does help us out:


> db.indexing.ensureIndex({'author':1})
> db.indexing.find({'author':'Zzxpnsnqjskpvpfwtpyeexpr'}).sort({'page_views':-1}).limit(1).explain()
{
"cursor" : "BtreeCursor author_1",
"nscanned" : 101,
"nscannedObjects" : 101,
...
"millis" : 64,
...
}


We're now evaluating only the documents representing blog posts by the named author. This approach is clearly more performant for gathering per-author max and min values. In fact, at 64 ms per authors we're only four times slower than the map-reduce approach if we wished to gather values for all 10,000 authors. But we can do better. MongoDB's support for compound key indexes offers an additional speedup:


> db.indexing.ensureIndex({'author':1,'page_views':-1})
> db.indexing.ensureIndex({'author':1,'page_views':1})
> db.indexing.find({'author':'Zzxpnsnqjskpvpfwtpyeexpr'}).sort({'page_views':-1}).limit(1).explain()
{
"cursor" : "BtreeCursor author_1_page_views_-1",
"nscanned" : 1,
"nscannedObjects" : 1,
...
"millis" : 6,
...
}


Clearly for any single author (or reasonably small subset of authors) this is the approach to use. For that matter 6 ms/author is approximately three times faster than the map-reduce operation over the entire collection. That said, the comparison isn't as clear cut as it appears. Any attempt to replace the map-reduce approach would have to find the set of all distinct authors, and any such query would also have to evaluate all documents in the collection.

Of course we don't get this speedup for free. Index maintenance does impose some overhead, although it's not as bad as you might think. Loading the generated data to a MongoDB instance with no indexes looks something like the following:


[@varese mongo_ruby]$ time ruby load_data.rb
Loading data from file random_data_part01.yaml
...
Loading data from file random_data_part20.yaml

real 7m11.191s
user 0m6.410s
sys 4m25.909s
[@varese mongo_ruby]$ ~/local/mongodb/bin/mongo
MongoDB shell version: 1.8.2
connecting to: test
> use indexing
switched to db indexing
> db.indexing.count()
1010000


If we add the indexes in advance and then load the data we see this instead:


[@varese mongo_ruby]$ ~/local/mongodb/bin/mongo
MongoDB shell version: 1.8.2
connecting to: test
> use indexing
switched to db indexing
> db.indexing.count()
0
> db.indexing.ensureIndex({'author':1})
> db.indexing.ensureIndex({'author':1,'page_views':1})
> db.indexing.ensureIndex({'author':1,'page_views':-1})
> bye
[@varese mongo_ruby]$ time ruby load_data.rb
Loading data from file random_data_part01.yaml
...
Loading data from file random_data_part20.yaml

real 8m39.601s
user 0m4.208s
sys 4m18.249s
[@varese mongo_ruby]$ ~/local/mongodb/bin/mongo
MongoDB shell version: 1.8.2
connecting to: test
> use indexing
switched to db indexing
> db.indexing.count()
1010000
> db.indexing.find({'author':'Zzxpnsnqjskpvpfwtpyeexpr'}).sort({'page_views':-1}).limit(1).explain()
{
"cursor" : "BtreeCursor author_1_page_views_1 reverse",
"nscanned" : 1,
"nscannedObjects" : 1,
...
"millis" : 3,
...
}


Samples above use MongoDB 1.8.2 and MRI 1.9.2p180 (via RVM) on FC 13.