Thursday, September 30, 2010

Joins in CouchDB: Making it work

In a previous post we considered the problem of how to use CouchDB to model multiple types of documents and the relationships between them. Storing each type of document in a separate database didn't appear to resolve the problem, so my next approach was to add a resource "type" to every document and store everything in a single database. A general overview of this approach can be found in this blog post on the subject.

Let's dig in a bit by considering the three sample queries discussed in the previous post. Our friends at Shot of Jaq have since pulled the plug on their "shotcast" so we need a new subject. If only there were a prominent open-source project that were related to our investigation of CouchDB.... hmmm, perhaps we're on to something here. Let's consider tweets containing "#couchdb", the authors of those tweets and the followers of those authors. The three questions we were attempting to answer thus become the following:

1. For each author: what language do they write their tweets in and how many followers do they have?

We can answer this question in the multi-database case with a map function and we can do the same here as long as we account for the "type" property added to each document:


function(doc) {

if (doc.resource == "author") {

emit(doc.id,
{name: doc.name,
language: doc.lang,
followers: doc.followers_count});
}
}


2. How many tweets did each author write?

We were also able to answer this question in the multi-database case using a map and reduce function applied to the tweets database. Here again we can accomplish the same thing by accounting for the resource type:


// Map function
function(doc) {

if (doc.resource == "tweet") {

emit(doc.from_user,1);
}
}

// Reduce function
function(keys, values) {

return sum(values);
}


3. How many tweet authors are also followers of @damienkatz

The only question we couldn't answer using our previous model. After a number of false starts a solution to this problem did emerge. That solution is based on two principles:


  1. Keys returned by a map function can be lists of values rather than simple values

  2. By defining a grouping level we can bundle lists with a common group of leading elements into a single reduce call



Our map function must export information about all tweet authors as well as all followers of Damien into the intermediate key-value space. Each document of type author contains a distinct ID value while each document of type followers contains a list of follower IDs in the ids property. If our map function defines it's keys as a list containing the author/follower ID as the first element we can then use a grouping function to correlate author information to information about a follower. Something like the code below should do the trick:


function(doc) {

// Add keys for each author, keyed by their distinct
// Twitter ID
if (doc.resource == "author") {

emit([doc.id,"name"],doc.screen_name);
}
else if (doc.resource == "followers" &&
doc.screen_name == "damienkatz") {

// Also add keys for each follower of Damien's,
// again keyed by Twitter ID. This will allow
// us to use CouchDB's grouping functionality
// to correlate author IDs to followers.
for (var i = 0; i < doc.ids.length; ++i) {

emit([doc.ids[i],'is_following'],1);
}
}
}


Assuming the grouping operation described above is applied our reduce function will have one of three possible types of inputs:


  • An is_following key only. This corresponds to a follower who did not write a tweet with the tag of #couchdb

  • A name key only; this identifies a tweet author that is not a follower of Damien

  • A name and is_following tag point to followers who have also been tweeting.



The code below should accomplish what we're after:


function(keys, values, rereduce) {

// Make our reduce function safe for rereduce
if (rereduce) {
return values
}
else {

// Strip keys out of (key,id) tuples. Even
// though we're using a grouping function
// keys are of the form [Twitter ID, keyname]
// so we need to take the second item from
// whatever we find.
var truekeys = keys.map(function(arr) { return arr[0][1]; });

// Build an object containing key-value pairs.
// It's a bit like a simplified version of zip
// without all the error checking and special
// case handling.
var foo = {};
for (var i = 0; i < truekeys.length; ++i) {
foo[truekeys[i]] = values[i];
}

// If we have a "name" and an "is_following"
// value of 1 we've found an author who is
// also following Damien.
if (("name" in foo) &&
("is_following" in foo) &&
foo["is_following"] == 1) {
return foo["name"];
}
}
}


Testing this script on a CouchDB instance built from the PopulateSingleDatabase.py script give output like the following:


python EvalTemporaryView.py -d twitter -m js/map_3a.js -r js/reduce_3a.js -g 1
{"rows":[
...
{"key":[3049901],"value":null},
{"key":[3091161],"value":"SoreGums"},
{"key":[3123671],"value":null},
...


CouchDB will display a value for every key grouping, and most of these results will be null (since the majority of Damien's followers aren't tweeting about CouchDB on a regular basis). Non-null values indicate followers who are also tweet authors; filtering out the null values will identify exactly the population we're after.

Updated map and reduce functions for the single database case (as well as some minor updates to work with CouchDB 1.0.x) can be found on github.