Saturday, July 3, 2010

Joins in CouchDB: How not to do it

My original motivation for experimenting with the dispatch library was an attempt to leverage my familiarity with a general technology (RESTful Web services) to gather experience with a pair of tools (CouchDB, Scala) which were less familiar to me. This turned out to be a spectacularly dumb idea, largely due to a problem that should have been clear at the outset: it's difficult to leverage general experience when you don't have significant experience with any of the tools involved. A better approach might be to use a more familiar tool (Python perhaps) to monkey around with CouchDB and do a bit of woodshedding with Scala on the side. Sounds sensible enough... so what now?

One of the facets of CouchDB that has always been unclear to me is how one might join two or more databases together. A bit of context first: my general understanding of a CouchDB database was that it was an unordered collection of documents of some common type. Within a database views could be used to return documents in a certain order them, select certain subsets of the documents or transform some set of documents into something else entirely. If we were to attempt to relate this model to a SQL database [1] then a CouchDB database is equivalent to a table while a view is equivalent to a select statement on that table.

This model leads to a simple question: if I have database A for documents of type A and database B for documents of type B how can I relate documents in A to documents in B? The natural answer would be some kind of view but views in CouchDB are defined as documents within a given database and there is no notion of a cross-database or global view.

Let's try and make this a bit more concrete. We'll store a set of data gathered from Twitter (using the brilliant Python Twitter Tools package) in a CouchDB instance and interrogate that data in an attempt to answer a few questions. We're interested in the following set of resources:


  1. The first 100 tweets corresponding to a search term

  2. Information about the set of authors for these tweets

  3. The set of followers for each of these authors



Information about these resources will be stored in databases named (respectively) tweets,authors and followers. Once we've obtained this data we'd like to get answers to the following questions:


  1. For each author: what language do they write their tweets in and how many followers do they have?

  2. How many tweets did each author write?

  3. For a given tweet author foo how many tweet authors are also followers of foo?



We'll use the nice folks at Shot of Jaq as our reference point by searching for tweets containing #shotofjaq.

With our multi-database model we can easily answer the first two questions; the first can be answered with a simple map function applied to the authors database and the second with a map and reduce function applied to the tweets database. The third question poses more of a problem. A correct answer requires data from both the authors and followers database and there is simply no way to access both databases using the standard map-reduce functionality. Our model appears to be broken.

Complete code for this example (including a simple Python script to submit temporary views to a CouchDB database) can be found at Github. Note that this repository also includes code for the correct single-database solution to this problem, so no peeking until the next post goes up!

[1] Yes, the CouchDB folks go to great lengths to point out that this is a silly thing to do. I employ this comparison only to illuminate the question at hand.

2 comments:

  1. This is probably clear by now, but I'd suggest putting all your docs in on database, and using views to pull out the ones you are interested in. An additional advantage to keeping the docs together in a database is that they will replicate together.

    ReplyDelete