Heuristic Fencepost: 2010

Tuesday, December 28, 2010

base64 in Clojure: Once more, with feeling

In a previous post we sketched out a simple base64 implementation in Clojure and measured it's performance against clojure-contrib and commons-codec. Our initial implementation used the partition function to split the input into blocks and then applied a sequence of interdependent functions to translate each block into base64-encoded data. This approach worked well enough but it's not terribly efficient; it traverses the input data once to do the partitioning and then that same data is traversed again (in block form) when the encoding function is applied. There's no reason to make multiple passes if we attack the problem recursively rather than iteratively. Time for some further investigation.

Clojure optimizes recursive calls when the loop/recur constructs are used so we deploy them here. In addition we make significant use of destructuring. Each input chunk of data is routed to the appropriate function via a sequence of cond expressions. These functions are implemented as let statements which break the chunk into bytes (via destructuring) and then create bindings for the base64 characters representing these bytes by applying the relevant bit manipulations. The body of the let doesn't need to do much beyond combining these character bindings into an appropriate return value.

This refactoring offers the following benefits:

The code is a lot cleaner and much easier to read and understand

Using the same techniques we should be able to easily implement a decoding function

We should see some increase in performance

So, how'd we do? The following is fairly representative:


bartok ~/git/base64-clojure $ clojure fencepost/test/compare_base64.clj 
Encoding
Commons-codec
"Elapsed time: 85.219348 msecs"
Xm91U1RMbHRAK1ZCcWFqeEsmJEJrfkhIdSsxfStJdGBNc2txUTBkUyQrdCM7Wk9rInN2XFBtKGNTX002aw==
clojure-contrib
"Elapsed time: 896.538841 msecs"
Xm91U1RMbHRAK1ZCcWFqeEsmJEJrfkhIdSsxfStJdGBNc2txUTBkUyQrdCM7Wk9rInN2XFBtKGNTX002aw==
fencepost/base64_letfn
"Elapsed time: 315.582591 msecs"
Xm91U1RMbHRAK1ZCcWFqeEsmJEJrfkhIdSsxfStJdGBNc2txUTBkUyQrdCM7Wk9rInN2XFBtKGNTX002aw==
fencepost-recur
"Elapsed time: 215.516695 msecs"
Xm91U1RMbHRAK1ZCcWFqeEsmJEJrfkhIdSsxfStJdGBNc2txUTBkUyQrdCM7Wk9rInN2XFBtKGNTX002aw==

The recursive encoding operation is just north of a third faster. We still take about twice as long as commons-codec (an improvement on our previous performance) but we're now over four times faster than clojure-contrib.

As referenced above a decoding function was also completed using the same techniques described for the recursive encoding function. The performance increase for that operation was just as stark:


Decoding
Commons-codec
"Elapsed time: 33.148648 msecs"
^ouSTLlt@+VBqajxK&$Bk~HHu+1}+It`MskqQ0dS$+t#;ZOk"sv\Pm(cS_M6k
clojure-contrib
"Elapsed time: 1494.2707 msecs"
^ouSTLlt@+VBqajxK&$Bk~HHu+1}+It`MskqQ0dS$+t#;ZOk"sv\Pm(cS_M6k
fencepost-recur
"Elapsed time: 268.112751 msecs"
^ouSTLlt@+VBqajxK&$Bk~HHu+1}+It`MskqQ0dS$+t#;ZOk"sv\Pm(cS_M6k

Still nowhere near as fast as commons-codec but we're a bit better than five times faster than clojure-contrib.

As always code can be found on github.

Monday, December 20, 2010

A Funny Thing Happened While Writing Some Haskell....

Suppose that you've just finished Graham Hutton's solid introduction to Haskell and now want to take the language for a quick test drive. You decide to stick with something familiar; after noting that the Prelude doesn't have an equivalent to Python's range function you decide to whip one up yourself. Haskell's syntax is still a bit unfamiliar, however, so you decide to implement this function in a different "mostly functional" language that you have some experience with. You've been looking for a reason to dip your toes into Scheme again (and it seems to be a good fit here [1]) so you install Gambit Scheme and Chicken Scheme and dig in.

Perhaps you begin with a straightforward recursive definition in Scheme:


(define (range start stop step)
 (if (< start stop)
  (cons start (range (+ start step) stop step))
  '()))

This works well enough when start < stop but fails completely when start > stop and step is a negative value (a use case supported by Python's range()). To fix this problem we define two procedures via letrec and use whichever is appropriate based on the passed arguments:


(define (newrange start stop step)
  (letrec 
   ((upto (lambda (art op ep) 
     (if (< art op)
      (cons art (upto (+ art ep) op ep)) 
      '())))
    (downto (lambda (art op ep)
     (if (> art op)
      (cons art (downto (+ art ep) op ep)) 
      '()))))
 (if (< start stop)
  (upto start stop step)
  (downto start stop step))))

This function now works for both start < stop (with positive step) and start > stop (with negative step). There's still a problem, though; if we change the sign of step in either of these cases we find ourselves in an infinite loop. Python's range() returns an empty list in this case so we probably should too:


(define (newerrange start stop step)
 (letrec
  ((upto (lambda (art op ep)
    (if (< art op)
     (cons art (upto (+ art ep) op ep)) 
     '())))
   (downto (lambda (art op ep)
    (if (> art op)
     (cons art (downto (+ art ep) op ep)) 
     '()))))
    (cond ((and (< start stop) (< step 0)) '())
          ((and (> start stop) (> step 0)) '())
          ((< start stop) (upto start stop step))
          (else (downto start stop step)))))

While working on these implementations you discover that SRFI-1 includes a function iota which looks similar to Python's range(), although this function takes the number of elements in the return list rather than the start/stop/step values we're looking for. Still, both Chicken Scheme and Gambit Scheme support SFRI-1 as extensions or modules and we should be able to whip up a thin wrapper around iota to get what we need. Of course we first need to figure out how to load up these modules. And after some initial experimentation we see some curious behaviour in the iota implementation on both platforms...

But wait... weren't we supposed to be writing some Haskell code?

Oh, yeah.

Haskell's guarded equations allow for an easy expression of the same logic we were getting from cons in Scheme. We have no need to translate the recursive logic found in the Scheme version, however, since the list-related functions in the Prelude combined with a few sections gives us everything we need:


range art op ep | art < op && ep < 0 = []
                | art > op && ep > 0 = []
                | art < op = takeWhile (<op) (iterate (+ep) art)
                | art > op = takeWhile (>op) (iterate (+ep) art)

In all honesty this is a fairly elegant expression of the algorithm. And while there's clearly a lot to like in Haskell this exercise has rekindled a long-dormant romance with Scheme rather than encouraging exploration of something new.

[1] Yes, I realize Scheme isn't "purely functional". I'm not interested in taking a stand in this particular holy war, but there's little doubt that Scheme's bias against set! and it's kin justify the "mostly functional" qualifier.

Sunday, November 21, 2010

Making Hadoop more "functional" with Scala

Introduction

After spending more and more time working with Scala I found myself attempting to apply the language to more and more domains. One candidate was a long-standing issue with the mechanism used by Hadoop to express map, reduce and/or combine functions. These functions are expressed within the framework via a pair of interfaces, Mapper and Reducer. Jobs then add class references to concrete implementations of these interfaces.

The problem

So what's the problem? These interfaces largely exist to define a single function to perform either a map or reduce operation. Both Mapper and Reducer include a setup() and cleanup() method (as well as a run() method to change runtime behaviour if needed) but the heavy lifting for both is performed within a single method (map() and reduce() respectively). Strip away the overhead and it's clear that in a fair number of cases we really only need an anonymous block of code to implement our map or reduce functions. At least some of the setup work could be removed if this anonymous block of code could somehow "contain" the environment that was present when the function was defined.

It should be obvious that what's being described here is a closure. Scala supports closures, but so does Groovy, JRuby and several other JVM languages. Why prefer Scala to these options? It's all a matter of typing; my belief was that the typing of closures in Scala would provide easier integration with Hadoop. This assumption may prove to be incorrect in the long run (Clojure's strong Java integration might make it a better fit for example) but it does provide a starting point.

Note that defining these functions inline would prevent the kind of re-use provided by defining this functionality in distinct classes. This is only a concern if the map and/or reduce functions are sufficiently complex; if they reduce down to something in the neighbourhood of a one-liner then re-use isn't a large concern.

Finally, it's worth noting that the problem here does not stem from Hadoop itself. Implementing this functionality in discrete classes is a reasonable approach when the language doesn't provide a clear mechanism for expressing anonymous functions.

A sample

To make this more concrete let's analyze the "grep" sample job referenced in the document describing a single-node Hadoop setup. The main work is performed by a single job using a regex-based map implementation and a reducer which does little more than sum up long values. The meat of the mapping class is as follows:

org.apache.hadoop.mapreduce.lib.map.RegexMapper


public class RegexMapper extends Mapper {

  public static String PATTERN = "mapreduce.mapper.regex";
  public static String GROUP = "mapreduce.mapper.regexmapper..group";
  private Pattern pattern;
  private int group;

  public void setup(Context context) {
    Configuration conf = context.getConfiguration();
    pattern = Pattern.compile(conf.get(PATTERN));
    group = conf.getInt(GROUP, 0);
  }

  public void map(K key, Text value,
                  Context context)
    throws IOException, InterruptedException {
    String text = value.toString();
    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
      context.write(new Text(matcher.group(group)), new LongWritable(1));
    }
  }
}

The reducer implementation doesn't fare much better:

org.apache.hadoop.mapreduce.lib.reduce.LongSumReducer


public class LongSumReducer extends Reducer                                                 KEY,LongWritable> {

  private LongWritable result = new LongWritable();

  public void reduce(KEY key, Iterable values,
                     Context context) throws IOException, InterruptedException {
    long sum = 0;
    for (LongWritable val : values) {
      sum += val.get();
    }
    result.set(sum);
    context.write(key, result);
  }

}

That's a lot of boilerplate code in order to implement a pair of very simple functions. Surely we can do better with a language that offers native support for many elements of functional programming.

An initial implementation

As a first step we convert the "grep" example into Scala and run it using Hadoop 0.21. The conversion itself was fairly straightforward and the results can be found on github. At present the port isn't much more than a translation of the existing Java code into more idiomatic Scala; the hope is that this code will provide a basis for future work. The code has been verified to run using the following steps:

Add the Scala runtime to the Hadoop classpath by editing hadoop-env.sh:
export HADOOP_CLASSPATH=$HOME/local/scala/lib/scala-library.jar

Replace org.apache.hadoop.examples.Grep in mapred-examples with our Scala implementation:
jar uvf hadoop-mapred-examples-0.21.0.jar org/apache/hadoop/examples/Grep*
bin/hdfs namenode -format

bin/start-dfs.sh

bin/start-mapred.sh

bin/hadoop fs -put conf input

bin/hadoop jar hadoop-mapred-examples-0.21.0.jar grep input output-scala 'dfs[a-z.]+'

bin/hadoop fs -cat output-scala/*

Future work

In order to get the most out of writing Hadoop jobs in Scala we need to be able to define these functions directly within the job class. This is complicated somewhat by the current Hadoop architecture; jobs are configured to reference mapper and/or reducer classes rather than instances of those classes. Something akin to Groovy's support for using a closure as an implementation of a class with a Single Abstract Method (SAM) might be useful here, but near as I can tell Scala does not offer such support.

Sunday, October 31, 2010

base64 in Clojure: An initial attempt

After working through most of Stuart Halloway's excellent Programming Clojure I found myself in search of a meaty example to exercise my Clojure chops. Re-implementing the Project Euler examples in another language didn't seem terribly appealing. Maybe a base64 implementation; I hadn't implemented the encoding in a while and it is a reasonable exercise when learning a new language. So what kind of support does Clojure already have?

As expected there's no base64 support in the core. clojure-contrib does include some support, although the documentation doesn't inspire a great deal of confidence:


This is mainly here as an example.  It is much slower
 than the Apache Commons Codec implementation or 
sun.misc.BASE64Encoder.

Intriguing; looks like we have a winner!

The only complexity in a base64 implementation is handling the one and two byte cases. Any code must support returning the last two bits of the first byte (or the last four bits of the second byte) as distinct values in these cases as well as combining these bits with the leading bits of the second and third bytes (respectively) in other cases. My initial implementation was a fairly straightforward adaptation of this principle. The core consists of a letfn definition with functions for handling the one, two and three byte cases. Each function returns two values: a list of all 6-bit elements discovered so far as well a "remainder" value representing the trailing bits in the one and two byte cases. This structure leads to better function re-use; as an example the two-byte function can build on the one-byte function. A wrapper function calls the correct value for a given set of bytes. Now we need only partition our byte array into groups of three or less and apply the wrapper function to each group. A combination of the partition and map functions do exactly that.

So how'd we do? We implement a simple test program to generate a set of random data, map a base64 implementation on to each sample and then realize the lazy sequence by converting it into a vector. We time this conversion (for a common set of sample data) for our implementation, the clojure-contrib implementation and commons-codec. The following run is fairly representative:


bartok ~/git/base64-clojure $ source set_classpath.sh 
bartok ~/git/base64-clojure $ clojure fencepost/test/compare_base64.clj 
Commons-codec
"Elapsed time: 69.074926 msecs"
RFJlYWkqIF10RCc1PXlpO3FcIDY6dkJuQkhHNWxRJkwkOFdgciFGfHwhYz5EYG15PjxWdFJ9Ml83TGFoeltHSTs+ST9mdj0rfSZrcVNIKn5oKSdTI3U5a1FqIzBvIkRIV0BkK3xCZjtrSWYuTiM1XWB9UW14W2dVLFM=
clojure-contrib
"Elapsed time: 1221.124746 msecs"
RFJlYWkqIF10RCc1PXlpO3FcIDY6dkJuQkhHNWxRJkwkOFdgciFGfHwhYz5EYG15PjxWdFJ9Ml83TGFoeltHSTs+ST9mdj0rfSZrcVNIKn5oKSdTI3U5a1FqIzBvIkRIV0BkK3xCZjtrSWYuTiM1XWB9UW14W2dVLFM=
fencepost/base64_letfn
"Elapsed time: 375.829153 msecs"
RFJlYWkqIF10RCc1PXlpO3FcIDY6dkJuQkhHNWxRJkwkOFdgciFGfHwhYz5EYG15PjxWdFJ9Ml83TGFoeltHSTs+ST9mdj0rfSZrcVNIKn5oKSdTI3U5a1FqIzBvIkRIV0BkK3xCZjtrSWYuTiM1XWB9UW14W2dVLFM=

These results do confirm that that clojure-contrib implementation is very, very slow in relation to commons-codec. We've reduced that runtime by roughly a third but we're still five or six times slower than commons-codec. Surely we can do better; in the next installment we'll see how to do so.

As usual all code can be found on github.

Friday, October 22, 2010

Concurrent prime factorization in Go

For a little while now I've been looking for a problem that would allow me to sink my teeth into Google's Go language and see if it's as tasty as it appears to be. After coming across Alex Tkachman's request for concurrency benchmarks it was clear that I had what I needed. Alex is the driving force behind Groovy++ and his proposal was to implement a common benchmark (a concurrent version of prime factorization using the Sieve of Eratosthenes) in multiple JVM languages in order to compare and contrast what works and what doesn't. An implementation in Go wouldn't help much in benchmarking JVM languages but the highly concurrent nature of the problem does map naturally onto Go's feature set. The language distribution also includes an elegant implementation of the Sieve of Eratosthenes in it's tests so we already have a solid foundation to build from.

Let's begin with sieve.go. The implementation consists of chained goroutines. Each time a candidate makes it through the sieve and is returned as a new prime number a new goroutine is created to check future candidates and reject them if they divide evenly by the new prime. The implementation cleanly demonstrates the concept but isn't useful as a standalone application since it includes no termination condition. Let's begin by adding that in:


[varese factorization_benchmark]$ cd go
[varese go]$ ls
factor.go  sieve.go
[varese go]$ gobuild sieve.go
Parsing go file(s)...
Compiling main (sieve)...
Linking sieve...
[varese go]$ ./sieve -max 20
2
3
5
7
11
13
17
19

That seemed to work nicely. Adding in the factorization code should give us everything we need:


import "fmt"
import "flag"
import vector "container/vector"
...
var target *int = flag.Int("target",1,"Number we wish to factor")
flag.Parse()

t := *target
var rv vector.IntVector

// Retrieve a prime value and see if we can divide the target
// evenly by that prime.  If so perform the multiplication 
// and update the current value.
ch := make(chan int) // Create a new channel.
go Generate(t,ch)      // Start Generate() as a subprocess.
for prime := <-ch; (prime != -1) && (t > 1); prime = <-ch {

 for ;t % prime == 0; {
  t = t / prime
  rv.Push(prime)
 }

 // Create a goroutine for each prime number whether we use it or
 // not.  This performs the daisy chaining setup that was being
 // done by the Sieve() function in sieve.go.
 ch1 := make(chan int)
 go Filter(ch, ch1, prime)
 ch = ch1
}

fmt.Printf("Results: %s\n",fmt.Sprint(rv))

Simple enough. This leaves us with something that looks fairly robust:


[varese go]$ gobuild factor.go 
Parsing go file(s)...
Compiling main (factor)...
Linking factor...
[varese go]$ ./factor -target 60
Results: [2 2 3 5]

Examples here employ gobuild. This tool has proven to be very useful when building and testing Go applications. Complete code for both examples can be found on github. There is some interest in comparing a solution based on goroutines and channels to one based on actors and messages so in the future some Scala code might join the existing Go code.

Sunday, October 10, 2010

Default arguments and implicit conversions in Scala

In a previous post we went in search of an implicit conversion from Int to List[Int] such that each member of the list corresponds to the value at an equivalent position in the input Int (i.e. 987 = List(9,8,7)). At the time we mentioned that a properly tail recursive implementation proved to be a bit more complicated than one might expect. In this post we'll examine these problems in some detail.

A properly tail recursive implementation of this conversion function makes use of an accumulator array to store state as we recurse.


package org.fencepost.defaults

import org.scalatest.Suite

class ImplicitDefaultTest1 extends Suite {

  def int2list(arg:Int, acc:List[Int]):List[Int] = {

    val newmember = arg % 10
    if (arg <= 9)
      List(newmember) ::: acc
    else
      int2list(arg / 10,List(newmember) ::: acc)
  }
  
  implicit def toList(arg:Int) = int2list(arg,List())

  def testImplicit() = {

    assert(0.length == 1)
    assert(0(0) == 0)
    assert(5.length == 1)
    assert(5(0) == 5)
    assert(12345.length == 5)
    assert(12345(0) == 1)
    assert(12345(2) == 3)
    assert(12345(4) == 5)
    assert(98765432.length == 8)
    assert(98765432(4) == 5)
  }
}

The test above passes so everything looks good so far. On a second look, however, we note that the wrapper function toList() is less than ideal. The accumulator needs to be initialized to the empty list in order for the function to work correctly but defining a second function just to pass in an extra arg looks like unnecessary cruft. Scala 2.8 introduced default arguments to address situations such as this; perhaps we can deploy default arguments here to clean up our test:


package org.fencepost.defaults

import org.scalatest.Suite

class ImplicitDefaultTest2 extends Suite {

  implicit def int2list(arg:Int, acc:List[Int] = List()):List[Int] = {

    val newmember = arg % 10
    if (arg <= 9)
      List(newmember) ::: acc
    else
      int2list(arg / 10,List(newmember) ::: acc)
  }

  def testImplicit() = {

    assert(0.length == 1)
    assert(0(0) == 0)
    assert(5.length == 1)
    assert(5(0) == 5)
    assert(12345.length == 5)
    assert(12345(0) == 1)
    assert(12345(2) == 3)
    assert(12345(4) == 5)
    assert(98765432.length == 8)
    assert(98765432(4) == 5)
  }
}

When we attempt to execute the test above we're greeted rather rudely by sbt:


[info] Compiling test sources...
[error] .../src/test/scala/org/fencepost/defaults/ImplicitDefaultTest2.scala:21:
 value length is not a member of Int
[error]     assert(0.length == 1)
[error]              ^
[error] .../src/test/scala/org/fencepost/defaults/ImplicitDefaultTest2.scala:22:
 0 of type Int(0) does not take parameters
[error]     assert(0(0) == 0)
...

Clearly the implicit conversion of Int to List[Int] wasn't in play when this test was executed. But why not? Logically int2list(arg:Int, acc:List[Int] = List()) will convert Ints to List[Int] everywhere int2list(arg:Int, acc:List[Int]) does. We can demonstrate the validity of this claim by fooling the compiler using a variation on the front-end function we used before:


package org.fencepost.defaults

import org.scalatest.Suite

class ImplicitDefaultTest3 extends Suite {

  def int2list(arg:Int, acc:List[Int] = List()):List[Int] = {

    val newmember = arg % 10
    if (arg <= 9)
      List(newmember) ::: acc
    else
      int2list(arg / 10,List(newmember) ::: acc)
  }

  implicit def toList(arg:Int) = int2list(arg)

  def testImplicit() = {

    assert(0.length == 1)
    assert(0(0) == 0)
    assert(5.length == 1)
    assert(5(0) == 5)
    assert(12345.length == 5)
    assert(12345(0) == 1)
    assert(12345(2) == 3)
    assert(12345(4) == 5)
    assert(98765432.length == 8)
    assert(98765432(4) == 5)
  }
}

As expected this test passes without issue.

My suspicion is that this issue is a side effect of the fact that default arguments apparently aren't represented in the type system. It's not surprising that int2list(arg:Int, acc:List[Int]) isn't available as an implicit conversion; there's no way for the runtime to supply the required "acc" argument for an input Int instance. This is not true for int2list(arg:Int, acc:List[Int] = List()); in that case the default value of "acc" could be used to perform the translation. Note, however, that these two functions are represented by the same type in the Scala runtime:


$ scala
Welcome to Scala version 2.8.0.final (OpenJDK Client VM, Java 1.6.0_18).
Type in expressions to have them evaluated.
Type :help for more information.

scala>   def int2list(arg:Int, acc:List[Int]):List[Int] = {
...
int2list: (arg: Int,acc: List[Int])List[Int]

scala>   def int2list2(arg:Int, acc:List[Int] = List()):List[Int] = {
...
int2list2: (arg: Int,acc: List[Int])List[Int]

If the type system is unaware that default arguments are available for all arguments other than the type to convert from then it's not at all surprising that a function would be excluded from the set of valid implicit conversion functions.

Results tested and verified on both Scala 2.8.0 and 2.8.1 RC2.

Thursday, September 30, 2010

Joins in CouchDB: Making it work

In a previous post we considered the problem of how to use CouchDB to model multiple types of documents and the relationships between them. Storing each type of document in a separate database didn't appear to resolve the problem, so my next approach was to add a resource "type" to every document and store everything in a single database. A general overview of this approach can be found in this blog post on the subject.

Let's dig in a bit by considering the three sample queries discussed in the previous post. Our friends at Shot of Jaq have since pulled the plug on their "shotcast" so we need a new subject. If only there were a prominent open-source project that were related to our investigation of CouchDB.... hmmm, perhaps we're on to something here. Let's consider tweets containing "#couchdb", the authors of those tweets and the followers of those authors. The three questions we were attempting to answer thus become the following:

1. For each author: what language do they write their tweets in and how many followers do they have?

We can answer this question in the multi-database case with a map function and we can do the same here as long as we account for the "type" property added to each document:


function(doc) {

    if (doc.resource == "author") {

        emit(doc.id,
        {name: doc.name, 
         language: doc.lang, 
         followers: doc.followers_count});
    }
}

2. How many tweets did each author write?

We were also able to answer this question in the multi-database case using a map and reduce function applied to the tweets database. Here again we can accomplish the same thing by accounting for the resource type:


// Map function
function(doc) {

    if (doc.resource == "tweet") {

        emit(doc.from_user,1);
    }
}

// Reduce function
function(keys, values) {

    return sum(values);
}

3. How many tweet authors are also followers of @damienkatz

The only question we couldn't answer using our previous model. After a number of false starts a solution to this problem did emerge. That solution is based on two principles:

Keys returned by a map function can be lists of values rather than simple values

By defining a grouping level we can bundle lists with a common group of leading elements into a single reduce call

Our map function must export information about all tweet authors as well as all followers of Damien into the intermediate key-value space. Each document of type author contains a distinct ID value while each document of type followers contains a list of follower IDs in the ids property. If our map function defines it's keys as a list containing the author/follower ID as the first element we can then use a grouping function to correlate author information to information about a follower. Something like the code below should do the trick:


function(doc) {

    // Add keys for each author, keyed by their distinct 
    // Twitter ID
    if (doc.resource == "author") {

        emit([doc.id,"name"],doc.screen_name);
    }
    else if (doc.resource == "followers" && 
             doc.screen_name == "damienkatz") {

        // Also add keys for each follower of Damien's,
        // again keyed by Twitter ID.  This will allow
        // us to use CouchDB's grouping functionality 
        // to correlate author IDs to followers.
        for (var i = 0; i < doc.ids.length; ++i) {

            emit([doc.ids[i],'is_following'],1);
        }
    }
}

Assuming the grouping operation described above is applied our reduce function will have one of three possible types of inputs:

An is_following key only. This corresponds to a follower who did not write a tweet with the tag of #couchdb

A name key only; this identifies a tweet author that is not a follower of Damien

A name and is_following tag point to followers who have also been tweeting.

The code below should accomplish what we're after:


function(keys, values, rereduce) {

    // Make our reduce function safe for rereduce
    if (rereduce) {
        return values
    }
    else {

        // Strip keys out of (key,id) tuples.  Even
        // though we're using a grouping function 
        // keys are of the form [Twitter ID, keyname]
        // so we need to take the second item from 
        // whatever we find.
        var truekeys = keys.map(function(arr) { return arr[0][1]; });

        // Build an object containing key-value pairs.
        // It's a bit like a simplified version of zip
        // without all the error checking and special 
        // case handling.
        var foo = {};
        for (var i = 0; i < truekeys.length; ++i) { 
           foo[truekeys[i]] = values[i];
        }

        // If we have a "name" and an "is_following" 
        // value of 1 we've found an author who is 
        // also following Damien.
        if (("name" in foo) && 
            ("is_following" in foo) && 
            foo["is_following"] == 1) {
            return foo["name"];
        }
    }
}

Testing this script on a CouchDB instance built from the PopulateSingleDatabase.py script give output like the following:


python EvalTemporaryView.py -d twitter -m js/map_3a.js -r js/reduce_3a.js -g 1
{"rows":[
...
{"key":[3049901],"value":null},
{"key":[3091161],"value":"SoreGums"},
{"key":[3123671],"value":null},
...

CouchDB will display a value for every key grouping, and most of these results will be null (since the majority of Damien's followers aren't tweeting about CouchDB on a regular basis). Non-null values indicate followers who are also tweet authors; filtering out the null values will identify exactly the population we're after.

Updated map and reduce functions for the single database case (as well as some minor updates to work with CouchDB 1.0.x) can be found on github.

Monday, August 23, 2010

Learning Scala from Dead Swiss Mathematicians: Return of Palindromes

In a recent post we implemented a predicate to identify integers which were also palindromes. Our original implementation converted the input integer to a String in order to more easily access the leading digit, something that can't easily be done with an input integer. But is this really the case?

If we could easily convert Integers into a List[Integer] [1] we could then easily access the "head" and "tail" of this list for purposes of comparison. Ideally this conversion could be automated so that we don't have to explicitly track these conversions. Fortunately Scala's implicit conversions provide exactly these features. A simple implementation looks something like the following:


implicit def int2list(arg:Int):List[Int] = 
  if (arg <= 9)
    List(arg)
  else
    int2list(arg / 10) ::: List(arg % 10)

It's worth taking a moment to point out a few things about this function:

We're assuming base 10 integers here. We could make the function more flexible by adding a parameter to specify the integers base if necessary

A better implementation would be purely tail recursive, but this turns out to be a bit trickier than expected; more on that in a future post.

With this conversion in place we can now define a properly tail recursive predicate to check for palindromes:


def byInt(arg:List[Int]):Boolean = {

  if (arg.length == 0 || arg.length == 1)
    return true
  if (arg.head != arg.last)
    false
  else
    byInt(arg.slice(1,arg.length - 1))
}

Very nice, but Scala's pattern matching allows for an even better (by which we mean "simpler") implementation that makes use of pattern guards:


def byIntMatch(arg:List[Int]):Boolean = {

  // Note that we don't need to check for lists of length 
  // 0 or length 1 as we do in byInt above.  The first 
  // two cases of our match operation below handle these 
  // cases.
  arg match {
    case List() => true
    case List(_) => true
    case arghead :: rest if arg.last == arghead => byIntMatch(rest.slice(0,rest.length - 1))
    case _ => false
  }
}

The full implementation can be found at github. Most of the code discussed here can be found in the org.fencepost.palindrome package; a full solution to Problem 4 using this code is also included.

[1] We use integers here only as a matter of convenience; of course we really only need a Byte as long as we're talking about base 256 or less. We can always optimize for space later if needed.

UPDATE - Shortly after completing this post I began another chapter in the "stairway" book. That chapter (covering details of the implementation of the List class in Scala) promptly pointed out that the list concatenation operator executes in time proportional to the size of the operand on the left. In order to avoid the performance hit we shift to an approach based on iteration and mutation.


implicit def int2list(arg:Int):List[Int] = {

  val buff = new ListBuffer[Int]
  var counter = arg
  while (counter > 9) {
    buff += (counter % 10)
    counter = (counter / 10)
  }
  buff += counter
  buff toList
}

The process for choosing when to use an iterative approach over a functional approach is still somewhat opaque. Scala has an explicit bias in favor of functional methods yet in many cases (including this one) an iterative implementation is the correct one to use. Presumably identifying when to use which approach is largely a function of experience with the language.

Monday, July 26, 2010

Learning Scala from Dead Swiss Mathematicians : Palindromes

As part of my effort to get up to speed with Scala I returned to an old favorite for exploring new languages; working through some of the problems at Project Euler. Posting complete answers to specific problems isn't terribly interesting and I have no plans to do so here. That said, some facets of these problems do lead to useful digressions which allow for a deeper exploration of the language. We'll dig into one such example here.

The problem under consideration is Problem 4 which asks for "the largest palindrome made from the product of two 3-digit numbers". It doesn't take much imagination to determine that part of the solution will involve a method, most likely a function of some kind, for identifying whether a number is in fact a palindrome. Okay, but what would such a method look like? The problem lends itself to a simple recursive implementation: divide the integer into a leading digit, a middle integer and a trailing digit and return false if the leading and trailing digits differ, otherwise return the result of the a recursive call with the middle integer.

Easy enough, but a moments reflection tells us we have a problem; we can access the trailing digit in an integer using the mod function and the base of the input integer but there's no easy way to access the leading digit. It's not as if Int (or even RichInt) extend Seq[Int] or even Seq[Byte]. Fortunately we can shift the problem domain by observing that an input integer is equivalent to an input String in which each digit of the integer maps to a Char in the String. Even better, Scala offers built-in support for regular expressions, and a fairly simple regex should give us access to both the leading and trailing characters and everything in the middle. So, how does Scala's regex support stack up?


// Pattern to be used by the regex-based predicate.  
// Absolutely must use conservative matching and 
// the backref here to make this work.
val palindromePattern = """(\d)(\d*?)\1""".r

// Recursive helper function to check for a 
//palindrome using a regex
def byRegexHelper(n:String):Boolean = {

  // Base case; empty string and single characters 
  // are by definition palindromes.  Place this 
  // test up front so that we can handle input values
  // of a single character.
  if (n.length == 0 || n.length == 1)
    return true
  palindromePattern.unapplySeq(n) match {

    case Some(matches) => byRegexHelper(matches(1))
    case None => false
  }
}

// Regex-based predicate; convert to a string and call
// the recrusive function
def byRegex(arg:Int):Boolean = byRegexHelper(arg.toString)

Not bad, actually. It's not Perl or Ruby but the deep support for pattern matching in the language combined with relatively easy generation of a Regex from a String makes for a fairly clean syntax. Regex literals would be a small improvement but this is still cleaner than what one finds in Java or even Python.

So we've solved the problem, but can we do better? Do we really need the regex? Strings (or the StringOps they're implicitly converted to in 2.8) make use of the SeqLike[Char] trait allowing easy access to the leading and trailing characters using simple method calls. This leads to the following implementation which avoids the overhead of evaluating the regular expression:


// Recursive helper function to perform the check 
// based on comparisons of the head and last 
// characters in a string
def byStringHelper(n:String):Boolean = {

  // Base case; empty string and single characters
  // are by definition palindromes.  Place this test
  // up front so that we can handle input values of
  // a single character.
  if (n.length == 0 || n.length == 1)
    return true
  if (n.head != n.last)
    false
  else
    byStringHelper(n.substring(1,n.length - 1))
}

// String-based predicate; convert to string and call
// the recursive function
def byString(arg:Int):Boolean = byStringHelper(arg.toString)

Also not bad, but still not completely satisfying. Moving away from the use of regular expressions minimizes the benefit of mapping the input integer onto a String in order to solve the problem. Recall that the primary argument for doing so was the inability to access leading digits in an input integer. How significant is this constraint? Is it something we can work around? More on this next time.

Saturday, July 3, 2010

Joins in CouchDB: How not to do it

My original motivation for experimenting with the dispatch library was an attempt to leverage my familiarity with a general technology (RESTful Web services) to gather experience with a pair of tools (CouchDB, Scala) which were less familiar to me. This turned out to be a spectacularly dumb idea, largely due to a problem that should have been clear at the outset: it's difficult to leverage general experience when you don't have significant experience with any of the tools involved. A better approach might be to use a more familiar tool (Python perhaps) to monkey around with CouchDB and do a bit of woodshedding with Scala on the side. Sounds sensible enough... so what now?

One of the facets of CouchDB that has always been unclear to me is how one might join two or more databases together. A bit of context first: my general understanding of a CouchDB database was that it was an unordered collection of documents of some common type. Within a database views could be used to return documents in a certain order them, select certain subsets of the documents or transform some set of documents into something else entirely. If we were to attempt to relate this model to a SQL database [1] then a CouchDB database is equivalent to a table while a view is equivalent to a select statement on that table.

This model leads to a simple question: if I have database A for documents of type A and database B for documents of type B how can I relate documents in A to documents in B? The natural answer would be some kind of view but views in CouchDB are defined as documents within a given database and there is no notion of a cross-database or global view.

Let's try and make this a bit more concrete. We'll store a set of data gathered from Twitter (using the brilliant Python Twitter Tools package) in a CouchDB instance and interrogate that data in an attempt to answer a few questions. We're interested in the following set of resources:

The first 100 tweets corresponding to a search term

Information about the set of authors for these tweets

The set of followers for each of these authors

Information about these resources will be stored in databases named (respectively) tweets,authors and followers. Once we've obtained this data we'd like to get answers to the following questions:

For each author: what language do they write their tweets in and how many followers do they have?

How many tweets did each author write?

For a given tweet author foo how many tweet authors are also followers of foo?

We'll use the nice folks at Shot of Jaq as our reference point by searching for tweets containing #shotofjaq.

With our multi-database model we can easily answer the first two questions; the first can be answered with a simple map function applied to the authors database and the second with a map and reduce function applied to the tweets database. The third question poses more of a problem. A correct answer requires data from both the authors and followers database and there is simply no way to access both databases using the standard map-reduce functionality. Our model appears to be broken.

Complete code for this example (including a simple Python script to submit temporary views to a CouchDB database) can be found at Github. Note that this repository also includes code for the correct single-database solution to this problem, so no peeking until the next post goes up!

[1] Yes, the CouchDB folks go to great lengths to point out that this is a silly thing to do. I employ this comparison only to illuminate the question at hand.