Friday, November 25, 2011

Musings on List Comprehensions, Functional Programming and Python

For simplicity and elegance in your programming constructs the list comprehension is hard to beat. [1] A list comprehension can filter an input list, transform it or do both, all in a single expression and with very readable syntax. At it's best a list comprehension is "beautiful code" distilled: very clear and expressive with no unnecessary noise. You can, of course, make list comprehensions "ugly" but at least you have to try a bit to do so.

List comprehensions have their roots in functional programming and several modern functional languages include support for them. We'll see comprehensions in Haskell, Clojure and Scala. [2] Python also includes support for comprehensions; in fact the basis for Guido's argument to remove the map and filter functions from py3k was that expressions using these functions could be easily re-written as comprehensions. We also consider comprehensions in Python, including differences between the Python implementation and those in other languages and what effect those differences may have.

Let's consider a fairly straightforward problem problem: given a list of (x,y) coordinates in some two-dimensional space provide a list of the coordinates that are within some fixed distance of the origin of (0,0). Our application also requires that results be displayed in polar coordinates, so for extra bonus points we should return our results in that notation. We can solve the problem quite easily in Haskell and Clojure using list comprehensions:





In Scala we use for expressions to accomplish something very similar:



And finally, a straightforward solution in Python:



Note that in each of these solutions we're doing a bit more work than we need to. Our implementation computes the distance from the origin twice, once when filtering values and again when generating the final output of the transformation process. This seems unnecessary; a better option would be to somehow define this value as intermediate state. This state could then be available to both filter and transform expressions. Haskell and Clojure support the introduction of intermediate bindings in their comprehension syntax:





Scala also allows for intermediate bindings within a for expression:



What about Python? It turns out we cannot solve this problem in Python using a single comprehension; the syntax doesn't allow for the introduction of intermediate state which can be used in either the predicate or transform expression. On the face of it this seems a bit odd; the language encourages the use of comprehensions for filtering and/or transformation while providing a less robust version of that very construct. To some degree this discrepancy reflects differing language goals. Guido's post on the history of list comprehensions seems to indicate that the motivation for adding these features was pragmatic; the syntax is an elegant way to express most filter and transform operations. Functional languages use list comprehensions as "syntactic sugar" for monadic effects [3] that don't really have an equivalent in standard Python usage. The syntax may look the same, but if you're coming from a functional perspective they can feel just a bit off. The same is true for a few other common functional idioms:


  • Lazy evaluation - List comprehensions in Python are not lazily evaluated. Generator expressions, which look very similar to list comprehensions, are lazily evaluated.

  • Higher-order functions - Anonymous functions are supported in Python but these functions are famously limited to a single expression. Functions can return functions but for non-trivial functions a named function must be declared and returned.



A couple things should be noted here. First, let us clearly state that Python is not and does not claim to be a functional programming language. While absolutely true, this fact doesn't change the underlying point. Moving from functional concepts back into Python can be a bit jarring; some things look similar but don't behave quite like you'd expect.

It's also worth noting that the inability to solve this problem with list comprehensions in Python doesn't mean that this problem cannot be solved in idiomatic Python. We wish to return our intermediate state as well as filter results based on it's value; this dual use allows us to solve the problem with nested comprehensions. The inner comprehension will generate the final representation (including the intermediate state) and the outer comprehension will filter results based on that representation. In Python this looks something like:



This works only because the intermediate state is also returned in the final result. If that state were not explicitly returned (i.e. if it's values were used as input to a conditional expression which returned, say, a string value describing the distance) this solution would not apply.

We can also solve this problem using generators. Using the state maintained by the generator we can iterate through the list, compute the intermediate state and yield a value only when we've satisfied our predicate. A generator-based solution would look something like:



Finally, none of these comments should be construed as a criticism of Python, the design choices that went into the language or the inclusion of list comprehensions generally. The pragmatic case for inclusion of this feature seems very strong. This post is interested only in the interplay between these features and similar features in other languages.

[1] Some languages (perhaps most notably Scala) use the term "for comprehension" or "for expression", and in some of these languages (Scala again) these constructs are more flexible than a list comprehension. That said, it's fairly straightforward to make Scala's for expressions behave like conventional list comprehensions.

[2] A purist might object that Scala is designed to mix features of object-oriented and functional languages, but the bias in favor of functional constructs justifies Scala's inclusion here.

[3] As an example, note that in Haskell list comprehensions can be replaced with do-notation. See the Haskell wiki for details.

Monday, September 12, 2011

Ruby and Concurrency: The Mechanics of Akka and JRuby

I'm interested in concurrency. I'm also interested in Ruby. There doesn't seem to be much reason to keep these two guys apart anymore.

This post marks the beginning of an occasional series on the topic of using Ruby to write concurrent code. Ruby doesn't yet have a big reputation in the world of concurrent and/or parallel applications, but there is some interesting work being done in this space. And since problems in concurrency are notoriously difficult to reason about we could probably do a lot worse than attempt to address those problems in a language designed to make life easier for the developer.

We begin with Akka, an excellent library designed to bring actors, actor coordination and STM to Scala and, to a slightly lesser degree, the JVM generally. Our initial task is simple: we wish to be able to define an Akka actor by providing a message handler as a code block. We'll start by attempting to implement the actor as a standalone class and then abstract that solution into code which requires the input block and nothing else.

A simple initial implementation looks something like this:



[@varese ruby]$ ruby --version
jruby 1.6.2 (ruby-1.8.7-p330) (2011-05-23 e2ea975) (OpenJDK Client VM 1.6.0_22) [linux-i386-java]
[@varese ruby]$ ruby akka_sample1.rb
ArgumentError: Constructor invocation failed: ActorRef for instance of actor [org.jruby.proxy.akka.actor.UntypedActor$Proxy0] is not in scope.
You can not create an instance of an actor explicitly using 'new MyActor'.
You have to use one of the factory methods in the 'Actor' object to create a new actor.
Either use:
'val actor = Actor.actorOf[MyActor]', or
'val actor = Actor.actorOf(new MyActor(..))'
(root) at akka_sample1.rb:20


Well, that didn't work very well. Apparently the class we defined in JRuby is being exposed to the Java lib as a proxy object and that proxy's class is unknown to the Akka runtime. No problem; Akka supports a factory model for actor creation, and by using that approach the underlying class of our actor should become a non-issue. With a few simple changes we're ready to try again:



[@varese ruby]$ ruby akka_sample2.rb
Received message: foo


We now have a working actor, but we still have some work to do; remember, we want to be able to define arbitrary actors by supplying just a code block. We need a few additional pieces to make this work:


  • A generic actor implementation whose state includes a block or Proc instance. The onReceive method of this actor could then simply call the contained block/Proc, passing the input message as an arg.

  • An ActorFactory implementation which takes a code block as an arg, stores it in internal state and then uses that block to build an instance of the generic actor described above on demand.



A first cut at this concept might look something like this:



[@varese ruby]$ ruby akka_sample3.rb
ArgumentError: wrong number of arguments for constructor
create at akka_sample3.rb:29
(root) at akka_sample3.rb:37


What went wrong here? UntypedActor is a concrete class with a defined no-arg constructor. That constructor is being called in favor of the one we've provided, and as a consequence our block never gets into the mix. There's almost certainly a cleaner way to solve this using JRuby, but for the moment we can get around the problem (in an admittedly ugly way) by providing a setter on our generic actor class:



[@varese ruby]$ ruby akka_sample4.rb
Message recieved: foo


We now have what we wanted.

If you're interested in this topic note that Nick Sieger has covered similar ground (including the interaction between JRuby and Akka) here. Nick's article draws on some very good work done by Daniel Ribeiro late last year. The code referenced in Daniel's article is available on Github. I didn't come across Daniel's post until my own was nearly done but there is quite a bit of overlap between his code and mine. That said, I recommend taking a look at both articles, if for no other reason than the fact that both authors are much better at writing Ruby code than I am.

Friday, September 2, 2011

Cassandra and Clojure: Things To Bytes And Back Again

In a previous post we briefly discussed the idea of using idiomatic Clojure to access data contained in a Cassandra instance, including transparent conversion to and from Clojure data. We'll explore an implementation of this idea, although we won't address the question of laziness, in part because there are sizable trade-offs to consider. For example, any solution that provides lazy evaluation must do so while also attempting to minimize the number of trips to the underlying data store. This question may be picked up in yet more future work, but for now we'll continue on. We've upgraded to Cassandra 0.8.2 and Clojure 1.2, and we're using a new data model (see below), but for the most part we'll try to pick up where we left off.

At the core the Cassandra data model is centered around columns. Each of these columns contains both a name and a value, both of which are represented as binary data. This model is very simple, and while it may appear limiting it is in reality quite flexible. The lack of any pre-defined data types avoids any "impedance mismatch" resulting from structured data being shoehorned into data types that don't really fit. We're free to represent column values in any way we see fit; if we can convert it into bytes it's fair game. Our problem thus devolves into a question of serialization, and suddenly there are many suitors vying for our attention. Among the set of well-known serialization packages we find Google's Protocol Buffers, Thrift and Avro. And since we're working in a language with good Java interop we can always have Java's built-in serialization (or something similar like Kryo) available. Finally we're always free to roll our own.

Let's begin by ruling out that last idea. There are already well-tested third-party serialization libraries so unless we believe that all of them suffer from some fatal error it's difficult to justify the risk and expense of creating something new. We'd also like our approach to have some level of cross-platform support so native Java serialization is excluded (along with Kryo). We also need to be able to encode and decode arbitrary data without defining message types or their payload(s) in advance, a limitation that rules out Protocol Buffers and Thrift. The last man standing is Avro, and fortunately for us he's a good candidate. Avro is schema-based but the library includes facilities for generating schemas on the fly by inspecting objects via Java reflection. Also included is a notion of storing schemas with data, allowing the original object to be easily reconstructed as needed. The Avro data model includes a rich set of basic types as well as support for "complex" types such as arrays and maps.

We'll need to implement a Clojure interface into the Avro functionality; this can be as simple as methods to encode and decode data and schemas. At least some implementations of Avro data files use a pre-defined "meta-schema" (consisting of the schema for the embedded data and that data itself) for storing both items. Consumers of these files first decode the meta-schema then use the discovered schema to decode the underlying data. We'll follow a similar path for our library. We'll also extend our Cassandra support a bit in order to support the insertion of new columns for a given key and column family.

Our core example will be built around a collection of employee records. We want to create a report on these employees using attributes defined in various well-known columns. We'd like to access the values in these columns as simple data types (booleans, integers, perhaps even an array or a map) but we'd like to do so through a uniform interface. We don't want to access certain columns in certain ways, as if we "knew" that a specific column contained data of a specific type. Any such approach is by definition brittle if the underlying data model should shift in flight (as data models are known to do).

After finishing up with our Clojure libraries we implement a simple app for populating our Cassandra instance with randomly-generated employee information:



[varese clojure]$ ~/local/bin/clojure add_employee_data.clj
Added user gFc0pVnLKPnjrrLx: [158 true [77 73 99 58 31 64 1 37 70 69]]
Added user 5gGh9anHwFINpr5t: [459 true [34 71 28 1 2 84 11 33 37 28]]
Added user pGRMeXBTFoBIWhud: [945 true [45 83 51 45 11 4 80 68 73 27]]
...


We're now ready to create our reporting application. Using Avro and a consistent meta-schema this comes off fairly easily:



[varese clojure]$ ~/local/bin/clojure get_employee_data.clj
Found 10 users
Username: 5gGh9anHwFINpr5t
Userid: 459
Userid is greater than zero
Active: true
User is active
...


And in order to verify our assumptions about simple cross-platform support we create a Ruby version of something very much like our reporting application:



[varese 1.8]$ ruby get_employee_data.rb
Username: 5gGh9anHwFINpr5t, user ID: 459, active: true
User ID is greater than zero
User is active
Username: 76v8iEJcc79Huj9L, user ID: 469, active: false
User ID is greater than zero
User is not active
...


This code meets our basic requirement, but as always there were a few stumbling blocks along the way. Avro includes strings as a primitive type, but unfortunately the Java API (which we leverage for our Clojure code) returns string instances as a Utf8 type. We can get a java.lang.String from these objects, but unfortunately we need another toString() method call that (logically) is completely unnecessary. We also don't fully support complex types. The Avro code maps the Clojure classes representing arrays and maps onto a "record" type that includes the various fields exposed via getters. Supporting these types requires the ability to reconstruct the underlying object based on these fields, and doing so reliably is beyond the scope of this work. Finally, we were forced to use Ruby 1.8.x for the Ruby examples since the Avro gem apparently doesn't yet support 1.9.


Full code can be found on github.

Tuesday, August 2, 2011

A Short Folding Note

Or maybe that's "a short note on folding"

While reading through Real World Haskell a short code snippet got me thinking about folds. Chapter 4, devoted to functional programming in general, uses a brief example built around the tails function to illustrate Haskell's as-patterns. This function makes it's home in the Data.List module and, given an input string s, will return a list of all results of tail x where x is a sublist of s. It's a bit convoluted to describe in prose, but the example offered by O'Sullivan et. al. makes it clear:


[@varese ~]$ ghci
...
*Main> :m +Data.List
*Main Data.List> tails "foobar"
["foobar","oobar","obar","bar","ar","r",""]


Maybe it was the fact that this function was introduced just after a lengthy discussion on left and right folds in the same chapter. Maybe it was due to an uptick in my use of folds when working with Scala. Whatever the cause, a thought grabbed hold of me: this function should be easy to implement by folding, shouldn't it? Unfortunately my previous attempt to sit down with Haskell code didn't quite work out as expected. Throw in a new FC 15 install (and all the goodies it contains) and there was no end of temptation for the unfocused. Fortunately good sense prevailed: GHC was installed, Chicken Scheme was not and it was time to start.

My intuition turned out to be correct; a version using foldl came together fairly quickly:



The overall algorithm used here is fairly straightforward. Beginning with an empty list we move left to right and do two things with every character c we find:


  • Create a new string in our accumulator list containing c

  • Add c to every string already in the accumulator



Visualizing the "left to right" evaluation of foldl gives us something like the following:


  1. r1 = append 'f' to every string in [] and add string 'f'

  2. r2 = append 'o' to every string in r1 and add string 'o'

  3. r3 = append 'o' to every string in r2 and add string 'o'



The result is exactly what we expect:


*Main> mytailsl "foobar"
["foobar","oobar","obar","bar","ar","r",""]


So we now have foldl... what about foldr? At this point I must offer a confession; left folds have always seemed intuitive to me while right folds most definitely have not. A left fold appears to be equivalent to a simple traversal of a list from "start" to "finish", gathering data as you move along. A right folds seems artificial somehow, like some kind of inside-out list traversal. And while a simple mental model for right folds remains somewhat elusive working through this sample did ease the pain somewhat. It took a bit of work, but after getting a handle on the order of evaluation a working implementation emerged:



This algorithm is a mirror image of the one used with foldl; since we're moving right to left we have to build member strings from back to front. For every new character c we build a new string by appending c to the largest string in the accumulator. We then return a new list with the new string for a head and the old accumulator for a tail. Note that we have to account for the empty list case when evaluating the first expression. Using this version on our test string we get:


*Main> mytailsr "foobar"
["foobar","oobar","obar","bar","ar","r",""]


Interestingly, while struggling to come to terms with right folds a related function presented itself. Taking the algorithm used in the left fold case and applying it directly to a right fold yields a function that might be called antitails. More precisely, for an input string s this function returns a list of strings (s - t) for every t in tails s.


*Main> reverse (antitails "foobar")
["","f","fo","foo","foob","fooba","foobar"]
*Main> mytailsl "foobar"
["foobar","oobar","obar","bar","ar","r",""]


The logic behind this operation looks something like the following:



Finally, it should be noted that the actual implementation of tails in GHC appears to be a recursive definition built around... as-patterns.

Wednesday, July 27, 2011

Min, max and MongoDB

What else can we say about MongoDB? Sure, we could talk about built-in map-reduce or the expressive query language or
journaling and sharding improvements in 1.8. But all of this is by now well-travelled territory, and that's a good thing. In addition to the features in the database itself 10gen and the community have done yeoman work in producing a rich collection of documentation describing MongoDB. This collection includes the MongoDB cookbook, and within that cookbook is a recipe illustrating how to find the maximum and minimum values of a field within a MongoDB collection using map-reduce. And that's where the problem started.

Solving this problem with map-reduce seems like entirely the wrong approach to me, although admittedly I'm new here. A solution using map-reduce must (almost by definition) access every document within the collection. This kind of approach makes sense if we're interested in max and min values for every key in the collection. But if we're only interested in one key (or some small subset of keys) this approach has us doing more work than we need to. MongoDB has strong support for indexes within a collection; we should be able to leverage these indexes to increase efficiency by reviewing only the keys we're interested in.

In order to make this work we need to be able to retrieve a max or min using only MongoDB's query language. We begin by finding all documents that match the target key or keys. We then sort these documents in descending or ascending order depending on whether we're after the max or min (respectively). We then limit our search results to contain no more than one element: this element must be the max or min value for the specified field.

To try this out, we expand upon the example (used in the cookbook) of blog posts, each of which has an author field, some content and a page_count field for tracking hits on each article. To make things interesting we create a large sample data set: 10,000 distinct authors, each of whom have written 100 posts. Each post has been viewed less than 2000 times. We want to compute the max and min page views for a given author.

The following Ruby script generates a set of random data for us, storing that data in a set of YAML files to enable repeated load operations:



We use the following Ruby script to load the data in these YAML files into our MongoDB instance:



Now that the data's loaded we move on to our queries. Let's start with the map-reduce approach, just to verify that it behaves as we expect. We'll start off with the mongo shell and re-use the map and reduce functions defined in the cookbook entry:



We expect this operation to access every document in the database. To provide a reference point we execute a query to count all documents before performing the map-reduce:


[@varese mongo_ruby]$ ~/local/mongodb/bin/mongo
MongoDB shell version: 1.8.2
connecting to: test
> use indexing
switched to db indexing
> db.indexing.count()
1010000
> db.indexing.mapReduce(map,reduce,{out: {inline:true}})
...
{
"_id" : "Zzxpnsnqjskpvpfwtpyeexpr",
"value" : {
"min" : {
"page_views" : 16,
"_id" : ObjectId("4e2fa865e8f06a08850b5735")
},
"max" : {
"page_views" : 1974,
"_id" : ObjectId("4e2fa865e8f06a08850b5709")
}
}
}
],
"timeMillis" : 177968,
"counts" : {
"input" : 1010000,
"emit" : 1010000,
"output" : 10000
},
"ok" : 1,
}


As expected we get the max and min values for all authors. And also as expected we have to look at every document to do it. Can we do any better with queries? Not right off the bat:


> db.indexing.find({'author':'Zzxpnsnqjskpvpfwtpyeexpr'}).sort({'page_views':-1}).limit(1)
{ "_id" : ObjectId("4e2fa865e8f06a08850b5709"), "author" : "Zzxpnsnqjskpvpfwtpyeexpr", "page_views" : 1974, "content" : "The quick brown fox jumped over the lazy dog" }
> db.indexing.find({'author':'Zzxpnsnqjskpvpfwtpyeexpr'}).sort({'page_views':1}).limit(1)
{ "_id" : ObjectId("4e2fa865e8f06a08850b5735"), "author" : "Zzxpnsnqjskpvpfwtpyeexpr", "page_views" : 16, "content" : "The quick brown fox jumped over the lazy dog" }
> db.indexing.find({'author':'Zzxpnsnqjskpvpfwtpyeexpr'}).sort({'page_views':-1}).limit(1).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 1010000,
"nscannedObjects" : 1010000,
...
"millis" : 1774,
...
}


A simple index on authors does help us out:


> db.indexing.ensureIndex({'author':1})
> db.indexing.find({'author':'Zzxpnsnqjskpvpfwtpyeexpr'}).sort({'page_views':-1}).limit(1).explain()
{
"cursor" : "BtreeCursor author_1",
"nscanned" : 101,
"nscannedObjects" : 101,
...
"millis" : 64,
...
}


We're now evaluating only the documents representing blog posts by the named author. This approach is clearly more performant for gathering per-author max and min values. In fact, at 64 ms per authors we're only four times slower than the map-reduce approach if we wished to gather values for all 10,000 authors. But we can do better. MongoDB's support for compound key indexes offers an additional speedup:


> db.indexing.ensureIndex({'author':1,'page_views':-1})
> db.indexing.ensureIndex({'author':1,'page_views':1})
> db.indexing.find({'author':'Zzxpnsnqjskpvpfwtpyeexpr'}).sort({'page_views':-1}).limit(1).explain()
{
"cursor" : "BtreeCursor author_1_page_views_-1",
"nscanned" : 1,
"nscannedObjects" : 1,
...
"millis" : 6,
...
}


Clearly for any single author (or reasonably small subset of authors) this is the approach to use. For that matter 6 ms/author is approximately three times faster than the map-reduce operation over the entire collection. That said, the comparison isn't as clear cut as it appears. Any attempt to replace the map-reduce approach would have to find the set of all distinct authors, and any such query would also have to evaluate all documents in the collection.

Of course we don't get this speedup for free. Index maintenance does impose some overhead, although it's not as bad as you might think. Loading the generated data to a MongoDB instance with no indexes looks something like the following:


[@varese mongo_ruby]$ time ruby load_data.rb
Loading data from file random_data_part01.yaml
...
Loading data from file random_data_part20.yaml

real 7m11.191s
user 0m6.410s
sys 4m25.909s
[@varese mongo_ruby]$ ~/local/mongodb/bin/mongo
MongoDB shell version: 1.8.2
connecting to: test
> use indexing
switched to db indexing
> db.indexing.count()
1010000


If we add the indexes in advance and then load the data we see this instead:


[@varese mongo_ruby]$ ~/local/mongodb/bin/mongo
MongoDB shell version: 1.8.2
connecting to: test
> use indexing
switched to db indexing
> db.indexing.count()
0
> db.indexing.ensureIndex({'author':1})
> db.indexing.ensureIndex({'author':1,'page_views':1})
> db.indexing.ensureIndex({'author':1,'page_views':-1})
> bye
[@varese mongo_ruby]$ time ruby load_data.rb
Loading data from file random_data_part01.yaml
...
Loading data from file random_data_part20.yaml

real 8m39.601s
user 0m4.208s
sys 4m18.249s
[@varese mongo_ruby]$ ~/local/mongodb/bin/mongo
MongoDB shell version: 1.8.2
connecting to: test
> use indexing
switched to db indexing
> db.indexing.count()
1010000
> db.indexing.find({'author':'Zzxpnsnqjskpvpfwtpyeexpr'}).sort({'page_views':-1}).limit(1).explain()
{
"cursor" : "BtreeCursor author_1_page_views_1 reverse",
"nscanned" : 1,
"nscannedObjects" : 1,
...
"millis" : 3,
...
}


Samples above use MongoDB 1.8.2 and MRI 1.9.2p180 (via RVM) on FC 13.

Sunday, June 19, 2011

First steps from node.js to concurrent JavaScript

I've always been somewhat ambivalent about node.js.

The project has received praise for advancing the idea of event-based server development. This argument never seemed terribly persuasive; Twisted and Event Machine (among others) have been plowing that soil for some time now and it's not clear that node.js adds anything new in this space. This does not mean that the project has made no contribution. The most impressive accomplishment of node.js has been to show what JavaScript can do when let out of it's browser-shaped cage. Crockford and others have put in a considerable amount of work to demonstrate that there's a fully-formed programming language hiding behind the event handlers and DOM manipulation. node.js provides further evidence that they were right all along.

The functional side of JavaScript (anonymous functions and/or closures specifically) lends itself to asynchronous network programming. But what about other paradigms? Suppose we're writing a concurrent application with multiple synchronous "threads" [1] of control. Let's further suppose that these "threads" implement a share-nothing architecture and communicate via messaging. Does JavaScript have anything to offer our application? We would expect the answer to be "yes"; in addition to anonymous functions/closures serving as message handlers we might also lean on the easy creation of arbitrary objects to achieve something resembling Erlang's tuples. But that's just speculation. In order to know for sure we should try our model on an actual concurrent problem or two.

But let's not get too far ahead of ourselves. Before we dig into any samples we should probably define our framework. We'll start with the Go language. Go's support for lightweight processes (i.e. goroutines) maps naturally onto the "threads" we refer to above so it's a natural match for our problem space. We can also easily define our goal in terms of goroutines: we wish to create the ability to implement a goroutine in JavaScript. Finally, Go's built-in support for native code should facilitate interaction with our JavaScript engine. While we're on that subject, we'll follow the lead of node.js and use v8 as our JavaScript implementation.

The ability to implement goroutines in JavaScript requires at least three components:


  • Obviously we require the ability to execute arbitrary JavaScript from within a Go program

  • We'll also need to add bindings to the v8 environment from within Go. This allows us to pass channels and other objects into the JavaScript code

  • JavaScript code running in v8 should be able to interact with Go code. Our JavaScript goroutines must be able to send and receive messages from Go channels



We begin with the execution of JavaScript from within Go. The language supports good integration with C code, but v8 is written in C++. A few previous efforts work around this mismatch by way of a C shim into the C++ code. This works but seems somewhat artificial. Go now supports the use of SWIG to interact with C++; in fact it's the recommended method for interacting with C++ code. We'll start there and see where it takes us.

Initially we try to expose the various C++ structures referenced in the v8 sample documentation to Go via SWIG, but this approach rapidly runs aground for a couple of reasons. First, the v8 API includes a few stack-allocated data structures which won't persist across SWIG calls. Second, the API also makes some use of nested C++ classes, a feature not yet supported by SWIG [2]. So we shift gears a bit; we should be able to define a set of classes and/or static functions which can do the heavy lifting for us. This approach enables us to execute JavaScript, but unfortunately we can't easily bind Go objects or expose Go code in the JS environment. v8 does allow us to define C++ functions that can be made available to that environment, but unfortunately SWIG can only expose C/C++ code to higher level languages; it can't do the inverse. As a consequence there's no way to make any C++ function we inject aware of the Go code that's interacting with it. JavaScript code thus cannot interact with channels, so at the moment the best we can do is to return a value from our JavaScript function which can then be accessed from within the Go code.


[@varese v8-golang]$ make buildshell
swig -c++ -go v8wrapper.i
g++ -c -I/home/fencepost/svn/v8/include -fpic v8wrapper.cc
g++ -c -I/home/fencepost/svn/v8/include -fpic v8wrapper_wrap.cxx
g++ -shared -L/home/fencepost/svn/v8 v8wrapper_wrap.o v8wrapper.o -o v8.so -lv8
8g v8.go
8c -I/home/fencepost/hg/go/src/pkg/runtime v8_gc.c
gopack grc v8.a v8.8 v8_gc.8
8g -I. v8shell.go
8l -L . -r . -o v8shell v8shell.8
[@varese v8-golang]$ more fib.js
// Simple accumulator-based Fibonacci implementation
function fib(a,b,count) {
function _fib(_a,_b,_count,_accum) {
if (_count <= 0) {
return _accum;
}
var n = _a + _b;
return _fib(_b,n,(_count - 1),_accum + "," + n);
}
return _fib(a,b,(count - 2),a + "," + b);
}
fib(0,1,10);

[@varese v8-golang]$ ./v8shell -script=fib.js
Using JavaScript file: fib.js
Result: 0,1,1,2,3,5,8,13,21,34


The above code was built with SWIG 2.0.4. Full code can be found on github.

This technique works but isn't very satisfying. We don't have communication via channels, which in turn prevents us from having long-running JavaScript as our goroutine. We're also constrained to returning strings from our JavaScript functions, hardly a general purpose solution. We'll try to move past these limitations in future work.


[1] Think "thread of execution" rather than something your operating system might recognize
[2] There are some workarounds but nothing that seemed viable for our purposes

Tuesday, May 24, 2011

Cassandra and Clojure: The Beginning of a Beautiful Friendship

Soon after I began working with Cassandra it became clear to me that if you were in the market for a platform for creating applications that interact with this database you could do a lot worse than Clojure. The lack of a query language [1] suggests that filtering and slicing lists of keys and columns might be a fairly common activity for apps powered by Cassandra. And while many languages support the map/filter/reduce paradigm Clojure's use of sequences throughout the core suggest a natural means to integrate this data into the rest of your application.

Cassandra itself provides an API that uses the Thrift protocol for manipulating data. We'll use this interface to implement a simple proof-of-concept application that might serve as a testbed for manipulating data managed by Cassandra in idiomatic Clojure. Note that the Clojure ecosystem already includes several open-source projects that connect Clojure to Cassandra: these include clj-cassandra and clj-hector, the latter leveraging the Hector Java client to do it's work. In order to keep things simple we choose to avoid any of these third-party libraries; it's not as if the Thrift interface imposes a heavy burden on us. Let's see how far we can get with what's already in the packaging.

That sounds great... so what exactly are we trying to do? Beginning with the database generated during our previous work with Cassandra we should be able to access sets of keys within a keyspace and a set of columns for any specific key. These structures should be available for manipulation in idiomatic Clojure as sequences. Ideally these sequences would be at least somewhat lazy and transparently support multiple datatypes. [2]

Using the Thrift interface requires working with a fair number of Java objects representing return types and/or arguments to the various exposed functions. My Clojure isn't yet solid enough to hash out Java interop code without flailing a bit so we kick things off with a Scala implementation. This approach allows us to simplify the interop problem without sacrificing the functional approach, all within a language that is by now fairly familiar.

The Scala code includes a fair number of intermediate objects but is otherwise fairly clean:



Translating this code into Clojure is more straightforward than expected:



These results seem fairly promising, although we're nowhere near done. This code assumes that all column names and values are strings, a perfectly ridiculous assumption. We also don't offer any support for nested data types, although in fairness this was a failing of our earlier work as well. Finally we haven't built in much support for lazy evaluation; we lazily convert column names to Clojure keywords but that's about it. But fear not, gentle reader; we'll revisit some or all of these points in future work.

[1] At least until CQL arrives in Cassandra 0.8
[2] We won't be able to meet these last two goals in our initial implementation, but with any luck we'll be able to revisit them in future work

Sunday, May 1, 2011

Scala, Bloom Filters and Shakespeare: Introduction

The Problem


Maybe it was a recent New York Times review of a new production of "Macbeth" starring John Douglas Thompson. Maybe it was a reflection on Kurosawa's use of Shakespearean plots and devices within some of his best work. Maybe it was just one of those random thoughts that pop into your head while walking out the door in the morning. It's origin may be hidden from me now, but recently a question entered my mind and refused to leave:

What percentage of the words in an arbitrary play by Shakespeare cannot be found in a common modern dictionary?

There's no doubt that some of Shakespeare's language would now be considered archaic. Quite a bit of it is still in common use today, however, and some usages have made their way into the lexicon via the modest success enjoyed by the Bard through the years. So how big is that set of archaic words? Is it a majority, a small minority or something else entirely?

After pondering a few approaches to this question it became clear that there was room for a fair amount of further work. A standalone program could help us answer this question but there was no reason to stop there. The solution to this problem lends itself naturally to decomposition, and that fact in turn suggests that this problem might be an interesting vehicle for exploring more complex solutions. With any luck we'll get to this exploration in later posts, but for now our goal is to describe the problem space and provide the general framework of a solution.

A Few Solutions


Broadly speaking we will answer this question by constructing a filter containing all the terms in our dictionary. We will then extract all individual words from the script of an arbitrary play and apply our filter to each of these words. Simple, right?

We might consider using a Bloom filter to implement our filter. We would expect a Bloom filter to be a very compact yet very efficient representation of the contents of our input dictionary. The issue is that any Bloom filter would produce some number of false positives, values that for which our filter should return false (i.e. "not present") but return true instead. This problem is offset somewhat by the fact that a negative result from a Bloom filter will always be accurate, but in the end this approach will give us at best a close approximation to the actual value. For our purposes this should be adequate enough.

Note: this post is not intended to be a primer on Bloom filters or their implementation. If you're looking for such a thing there are already excellent sources available.

An alternate approach might be to implement a filter using a search index such as Lucene. The result would not be as compact as a Bloom filter but it would certainly offer a more accurate result.

To make things interesting we choose to create both versions. We can live with a good approximation so we focus on the Bloom filter implementation. The Lucene implementation will be used primarily as a baseline for comparing performance and accuracy of results. In order to justify this approach we impose the following additional constraints on the Bloom filter implementation:


  • It must outperform the Lucene version

  • It should use a minimal amount of memory



The resulting code can be found on github.

Observations Along the Way



Extracting words from plays


Shakespeare's plays are available in XML format from ibiblio. We are still tasked with extracting individual words from these documents, however, and our concern for resource usage prevents us from simply reading the entire document into memory and querying it via XPath or XQuery. In the end we resolved this issue by implementing what is at heart a finite state machine (FSM) driven by inputs obtained from the XML pull parser provided by the Scala library.

Minimizing false positives


There are known approaches to constraining the number of false positives produced by a Bloom value to some reasonable maximum. Unfortunately the size of our data set and our resource constraints prevent us from leveraging these techniques. The number of false positives is also inflated by some of our implementation choices.

The dictionary used in our implementation was provided by the "words" package in Fedora Core 13. The dictionary in this package contains approximately 480,000 words. Assuming distinct internal hash values for every word our Bloom filter should support all 19-bit input values (since the base-2 log of 480,000 is 18.87). To avoid congestion we strive for a filter that is ~50% full, which in turn requires support of all 20-bit inputs. Note that our space constraints make the use of a second hash function unwise; we'd need a much larger capacity for a population size as large as ours. Also note that Bloom filters use a bit set for a backing store and the memory usage of a bit set is driven by the largest value contained in that set. It follows that our filter will use no more than (2 ^ 20)/8 = 131KB of memory.

Unfortunately a few of our implementation choices help to artificially inflate our false positive rate. These include:


  • Scala's bit set data structure accepts only non-negative values while the MurmurHash function we use internally can generate arbitrary 32-bit output (including the full range of negative integers on the JVM). We resolve this issue by using the absolute value of the MurmurHash output, the result of which is that two output values now result in the same output from our internal function(s).

  • Bit set size is constrained by masking some number of bits from the MurmurHash output. This technique also causes multiple MurmurHash outputs to result in the same output from our internal function(s).



The results below suggest a false positive rate on the order of 4%, an exceedingly high value.

Immutable and mutable data structures


The end of a previous post described my struggles with deciding when algorithms based on mutable data were acceptable within Scala's ethos. The language has an obvious bias against mutable data, but it's a bias that seems quite reasonable; the arguments in favor of immutability are quite convincing. Yet mutability is explicitly supported by the language and is understood to be useful in some circumstances.

An example presented itself while completing the Bloom filter implementation. The original construction used a foldLeft operation to reduce the set of internal hash values into a single bit set:


private val bloom = (BitSet() /: (vals map evalAllFns))(_ ++ _)


Unfortunately this approach required the creation of a new BitSet instance for each input value, a potentially costly process. The performance was quite poor:


[@varese scala_bloom]$ time scala -classpath target/scala_2.8.1/classes:lib/lucene-core-3.1.0.jar org.fencepost.bloom.FilterApplication macbeth.xml /usr/share/dict/words lucene
Filter constructed in 12295 milliseconds
Total words in script: 16347
Matches: 14328
Searching completed in 51 milliseconds

real 0m17.116s
user 0m9.666s
sys 0m4.980s
[@varese scala_bloom]$ time scala -classpath target/scala_2.8.1/classes:lib/commons-lang-2.6.jar org.fencepost.bloom.FilterApplication macbeth.xml /usr/share/dict/words bloom
Filter constructed in 133928 milliseconds
Total words in script: 16347
Matches: 15046
Searching completed in 10 milliseconds

real 2m19.174s
user 0m27.765s
sys 1m40.000s


Clearly searching wasn't the problem. The performance was vastly improved by shifting to something based on mutable data structures:


import scala.collection.mutable.BitSet

private val bloom = BitSet()
for (aval <- (vals flatMap evalAllFns)) { bloom += aval }



[mersault@varese scala_bloom]$ time scala -classpath target/scala_2.8.1/classes:lib/commons-lang-2.6.jar org.fencepost.bloom.FilterApplication macbeth.xml /usr/share/dict/words bloom
Filter constructed in 7525 milliseconds
Total words in script: 16347
Matches: 14992
Searching completed in 9 milliseconds

real 0m11.872s
user 0m6.779s
sys 0m4.476s


In this case the mutability is contained within a single class: the bit set is itself a val and is never modified after being generated at construction. That said, the process for determining when mutable data should be used is still somewhat opaque.

Tuesday, March 29, 2011

Data Models and Cassandra

In a pair of previous posts we considered how we might use CouchDB to store and query a collection of tweets and tweet authors. The goal was to explore the document-oriented data model by constructing queries across multiple types of documents in a loose approximation to a join statement in SQL. Cassandra implements a data model built on ColumnFamilies which differs from both SQL and document-oriented databases such as CouchDB. Since we already have a problem space defined we begin our exploration of Cassandra by attempting to answer the same questions using a new data model.

The examples below are implemented in Python with a little help from a few excellent libraries:


  • Pycassa for interacting with Cassandra.

  • Twitter tools when we have to interact with Twitter via their REST API



We'll begin by looking at how we might query an Cassandra instance populated with data in order to find the answers to the questions in our problem space. We'll close by briefly discussing how to get the data into Cassandra.

We should be all set; bring on the questions!

For each author: what language do they write their tweets in and how many followers do they have?



The organizing structure of the Cassandra data model is the column family defined within a keyspace. The keyspace is exactly what it sounds like: a collection of keys, each identifying a distinct entity in your data model. Each column family represents a logical grouping of data about these keys. This data is represented by one or more columns, which are really not much more than a tuple containing a name, value and timestamp. A column family can contain one or more columns for a key in the keyspace, and as it turns out you have a great deal of flexibility here; columns in a column family don't have to be defined in advance and do not have to be the same for all keys.

We begin by imagining a column family named "authors" in a keyspace defined by the user's Twitter handle or screen name. Assume that for each of these keys the "authors" column family contains a set of columns, one for each property returned by the "user/show" resource found in the Twitter REST API. Let's further assume that included in this data are fields named "lang" and "followers_count" and that these fields correspond to exactly the data we're looking for. We can satisfy our requirement by using a range query to identify all keys that fall within a specified range. In our case we want to include all alphanumeric screen names so we query across the range of printable ASCII characters. The Pycassa API makes this very easy [1]:



The result is exactly what we wanted:

[@varese src]$ python Query1.py
Key: Alanfbrum, language: en, followers: 73
Key: AloisBelaska, language: en, followers: 8
Key: DASHmiami3, language: en, followers: 11
Key: DFW_BigData, language: en, followers: 21


How many tweets did each author write?



Okay, so we've got the idea of a column family; we can use them to define a set of information for keys in our keyspace. Clearly this is a useful organizing principle, but in some cases we need a hierarchy that goes a bit deeper. The set of tweets written by an author illustrates just such a case: tweets are written by a specific author, but each tweet has data of it's own (the actual content of the tweet, a timestamp indicating when it's written, perhaps some geo-location info, etc.). How can we represent this additional level of hierarchy?

We could define a new keyspace consisting of some combination of the author's screen name and the tweet ID but this doesn't seem terribly efficient; identifying all tweets written by an author is now unnecessarily complicated. Fortunately Cassandra provides a super column family which meets our needs exactly. The value of each column in a super column family is itself a collection of regular columns.

Let's apply this structure to the problem at hand. Assume that we also have a super column family named "tweets" within our keyspace. For each key we define one or more super columns, one for each tweet written by the author identified by the key. The value of any given super column is a collection of columns, one for each field contained in the results we get when we search for tweets using Twitter's search API. Once again we utilize a range query to access the relevant keys:



Running this script gives us the following:

[@varese src]$ python Query2.py
Authors: Alanfbrum, tweets written: 1
Authors: AloisBelaska, tweets written: 1
Authors: DASHmiami3, tweets written: 1
Authors: DFW_BigData, tweets written: 1
...
Authors: LaariPimenteel, tweets written: 2
Authors: MeqqSmile, tweets written: 1
Authors: Mesoziinha, tweets written: 2


How many tweet authors are also followers of @spyced?



This problem presented the largest challenge when trying to model this problem space using CouchDB. Somehow we have to relate one type of resource (the set of followers for a Twitter user) to the set of authors defined in our "authors" column family. The Twitter REST API exposes the set of user IDs that follow a given user, so one approach to this problem might be to obtain these IDs and, for each of them, query the "authors" table to see if we have an author with a matching ID. As for the user to search for... well, when we were working with CouchDB we used @damienkatz so it only seems logical that we use Jonathan Ellis (@spyced) in this case.

Newer versions of Cassandra provide support for indexed slices on a column family. The database maintains an index for a named column within a column family, enabling very efficient lookups for rows in which the column matches a simple query. Query terms can test for equality or whether a column value is greater than or less than an input value [2]. Multiple search clauses can be combined within a single query, but in our case we're interested in strict equality only. Our solution to this problem looks something like the following:



The result looks pretty promising:

[@varese src]$ python Query3.py
tlberglund
evidentsoftware
sreeix
...
joealex
cassandra
21


Some spot checking of these results using the Twitter Web interface seems to confirm the results. [3]

Populating the data



So far we've talked about accessing data from a Cassandra instance that has already been populated with the data we need. But how do we get it in there in the first place? The answer to this question is a two-step process; first we create the relevant structures within Cassandra and then we use our Python tools to gather and add the actual data.

My preferred tool for managing my Cassandra instance is the pycassaShell that comes with Pycassa. This tool makes it's easy to create the column families and indexes we've been working with:

SYSTEM_MANAGER.create_keyspace('twitter',replication_factor=1)
SYSTEM_MANAGER.create_column_family('twitter','authors',comparator_type=UTF8_TYPE)
SYSTEM_MANAGER.create_column_family('twitter','tweets',super=True,comparator_type=UTF8_TYPE)
SYSTEM_MANAGER.create_index('twitter','authors','id_str',UTF8_TYPE)


There are plenty of similar tools around; your mileage may vary.

When it comes to the heavy lifting, we combine Pycassa and Twitter tools into a simple script that does everything we need:



[1] For sake of brevity this discussion omits a few details, including the configuration of a partitioner and tokens in order to use range queries effectively and how keys are ordered. Consult the project documentation and wiki for additional details.

[2] Here "greater than" and "less than" are defined in terms of the type provided at index creation time.

[3] You could complain that we're cheating a bit here. When we were working with CouchDB we were tasked with joining two distinct "types" of resources using a map-reduce operation applied to documents within the database; that was the whole point of the exercise. Here we're just querying the DB in response to some external data source. This is true to a point, but in my defense it's worth noting that we could easily create a "followers" column family containing the followers @spyced and then execute the same logic against this column family rather than the Twitter REST API directly. This isn't a very satisfying answer, however; this issue could very well be taken up in a future post.

Monday, March 7, 2011

A Few More Words on Laziness

A few updates to the recent post on lazy evaluation. While most of the discussion below is directed at the Ruby implementation the same principles could also be applied to the Scala and/or Clojure versions. The implementations in these two languages are already fairly speedy, however, so for now we'll focus on bringing the Ruby version into line.

So, without further ado...

Don't blame JRuby


Of the three implementations discussed in the previous post the Ruby version was by far the worst in terms of performance. This fact should in no way be interpreted as a criticism of JRuby itself; the fault rests entirely on the code being executed (and by extension the abilities of the programmer who wrote it), not the platform it runs on. To prove the point, consider the following:


[@varese ruby]$ more Euler10Primes.rb
require 'prime.rb'

puts Prime.each(2000000).reduce(0) { |i,total| total + i }
[@varese ruby]$ time jruby --1.9 Euler10Primes.rb
142913828922

real 0m27.737s
user 0m20.807s
sys 0m6.181s


Clearly we have work to do.

Remember that part about testing fewer candidates?


Our intial algorithm tested a candidate set of all odd numbers beginning with three. The underlying logic is two-fold:


  • Any even number is by definition divisible by two (a prime) and therefore not prime

  • We can exclude these numbers by generating candidates in a clever way (starting from an odd value and incrementing by two) rather than testing every candidate to see if it's even or odd



As a first optimization we consider this process for generating candidates. Is there any room for improvement here? As it turns out there is; approximately 20% of the time we're testing a candidate that could be excluded with zero computation. Permit me to explain.

Multiples of five (a prime) always end in zero or five. Anything ending in zero must be even (and therefore already excluded) but so far we haven't excluded anything ending in five. And since one out of every five of our current candidate pool ends in five we're doing exactly 20% more work than we need to. Okay, so how do we keep these values out of our set of candidates? We know that odd numbers must end with a fixed range of values (1,3,5,7 or 9). If we exclude five from that list and then add the remaining values to successive multiples of ten we'll generate the full set of potential candidate values.

This optimization alone yields some benefit:


[@varese ruby]$ time jruby Euler10.rb
142913828922

real 2m42.012s
user 0m14.394s
sys 2m23.462s


There's still plenty of room for improvement.

Laziness can be fleeting


Languages that fully support lazy evaluation try not to return values until absolutely necessary. Suppose you want to return the first three items from a lazily defined sequence in Clojure. You might use "(take 3 coll)", but note that this doesn't return an array containing three values; it returns another lazy sequence. You could thus chain operations together without completely evaluating (or "realizing" as the Clojure folks are fond of saying) anything. Given a predicate "pred", for example, you could identify which of the these values satisfy pred using "(filter pred (take 3 coll))". This expression also returns a lazy sequence; if you want to return the values you need something like "(doall (filter pred (take 3 coll)))"

Ruby does not support lazy evaluation. We've been using "yield" to approximate lazy evaluation but it's only an approximation. In Ruby "coll.take 5" will return an array containing the first five items in the collection rather than another Enumerable that will generate these values as needed. As a result we're once again doing more work than we need to. And unfortunately this is exactly what we find in our initial Ruby implementation:


if @primearr.take_while {|i| i <= root }.all? {|i| @nat % i != 0}
@primearr << @nat
yield @nat
end


On the plus side we're only returning all primes up to the square root of our candidate. On the negative side we completely negate the fail-fast functionality of the "all?" method. Wouldn't it be better if we could exit this process as soon as we find a prime that the candidate divides evenly into?

This problem is actually a bit trickier than it looks. We can exit our test by finding a prime that our candidate divides into (indicating that the candidate isn't prime) or by finding a prime greater than the square root of the candidate (indicating that the candidate is prime). How can we distinguish between these two cases? One way to implement this test is to combine both conditions into a single predicate and find the first prime that satisfies it. If that prime is greater than the square root of the candidate we have a prime, otherwise we don't. We know this find operation will always be successful since eventually we'll find a prime greater than the square root of the candidate.

The resulting code (including the candidate generation fix mentioned above) looks like this:



So, how'd we do?


[@varese ruby]$ time jruby Euler10.rb
142913828922

real 0m43.730s
user 0m33.408s
sys 0m9.040s


That's more like it! This code overtakes the original Clojure implementation to move into second place. We're still slower than the implementation using the prime number generator that comes with JRuby but we've made up quite a bit of ground.

You don't need a JVM to be lazy


If you're interested in lazy evaluation you could do a lot worse than spending some time with Haskell. The language fully embraces lazy evaluation which, according to some people, matters a great deal to the utility of functional programming.

An initial implementation of our algorithm in Haskell looks something like this:



Unfortunately the performance is dreadful:


[@varese haskell]$ ghc -o euler10 euler10.hs
[@varese haskell]$ time ./euler10
142913828922

real 4m50.253s
user 0m8.976s
sys 4m35.131s


Once again it's important to realize that these numbers don't reflect on Haskell or the GHC in any way. They do reflect on my utter lack of experience and/or skill with this language. There's almost certainly a better (and probably more idiomatic) way to implement this algorithm in Haskell, although in my defense even this performance is orders of magnitude better than what we get using the Sieve of Eratosthenes implementation from Hutton's text:


[@varese haskell]$ more euler10sieve.hs
module Main where

primes = sieve [2..]
sieve(p:xs) = p:sieve[x|x <- xs, x `mod` p /= 0]

main = print (foldr (+) 0 (takeWhile (<2000000) primes))
[@varese haskell]$ ghc -o euler10sieve euler10sieve.hs
[@varese haskell]$ time ./euler10sieve
142913828922

real 401m25.043s
user 0m0.069s
sys 394m13.061s


In the end, of course, this probably indicates only that the Sieve of Eratosthenes is probably not an ideal solution to this particular problem.

Thursday, March 3, 2011

In Defense of Laziness and Being Self-Centered

On a whim I decided to dig into a few more Project Euler problems. Problem 10 was up next, and on the face of things the issue seemed fairly straightforward: compute the sum of all prime numbers less than two million. Unfortunately the simple implementation of the Sieve of Eratosthenes referenced in the Scala documentation (and used in a solution to a previous problem) promptly blows the stack when trying to perform the necessary computations. Clearly we need a more efficient way to compute large collections of primes.

It may seem counter-intuitive but the brute force method of trial division works very well in this situation. In addition to being relatively easy to implement there are several entry points that might be used to increase performance. We might try to select fewer candidates for testing. Or perhaps we can minimize the number of tests for any specific candidate. Ideally we could do both.

As an initial optimization we observe that 2 is the only possible prime number; if we fix our candidate sequence to return 2 as a fixed initial value and then only generate odd values we'll keep our work down. We keep the number of comparisons on each candidate low by utilizing the principle that for an integer N we need only compare primes less than or equal to the square root of N in order to determine primality. The square root of two million is a little more than 1414 so we only need to compare a relatively small set of primes for each candidate.

An implementation of this algorithm in Scala looks something like this:



This code is quite performant on my commodity laptop:


[@varese src]$ time scala org.fencepost.Euler10
Result: 142913828922

real 0m18.876s
user 0m15.149s
sys 0m3.042s


Note that in this code the stream "primes" is a self-referential data type. An initial value of 2 is specified, but all subsequent values are determined by filtering the candidate pool using the primes defined (and returned) so far. Don't let the indirection of the "pred" function fool you; it's there only to make the code a bit easier to read. We could just as easily include the logic defined in "pred" as an inline function to the filter call within "primes".

The definition of self-referential data types is entirely dependent upon the idea of lazy evaluation. If a self-referential type were evaluated immediately we'd quickly find ourselves in a recursive loop with no termination condition. Using lazy evaluation, however, values are only computed as they are needed. We provide a function to generate new values from previous values and enough initial values to provide the necessary input(s) to the first invocation of this function; lazy evaluation handles the rest.

Clojure also includes extensive support for lazy evaluation within sequences; we should be able to use them to implement something similar. A bit of work is required to get the scoping right but eventually we come up with the following:



The Clojure implementation isn't as fast as our Scala version but it is still respectable; understand that we are computing just under 150,000 primes here:


[@varese clojure]$ time clojure euler10.clj
142913828922

real 1m2.860s
user 0m51.556s
sys 0m6.327s


As it happens I've been working through the Ruby chapter of Bruce Tate's Seven Languages in Seven Weeks. Bruce strongly encourages experimentation with the various languages described in this text (and I could use the practice) so let's make an attempt to implement the same algorithm in Ruby. The language doesn't offer support for self-referential data types or lazy evaluation but we can get close using an implementation of Enumerable that uses this algorithm to create the values our "each" method will yield. The code I came up with is the following:



I would ask the experienced Ruby hackers in the back to please keep the snickering down.

We test this implementation using JRuby, specifically RC2 of the upcoming 1.6 release. Why JRuby over the alternatives? The other languages we've considered here are JVM languages so it seemed sensible to remain on the JVM unless there was a compelling reason not to. And since JRuby is generally regarded as the fastest Ruby implementation going there was little incentive to leave.


[@varese ruby]$ time jruby Euler10.rb
142913828922

real 3m4.725s
user 0m17.733s
sys 2m41.426s


We've definitely violated Project Euler's "one minute rule" but we haven't strayed into the realm of completely unacceptable performance. That said, there's no doubt that we've taken a step backwards.

Self-referential data types can be a very concise way to represent a stream of data, even a stream of infinite length. They're just one of the ways lazy evaluation can help you think about problems in different ways. If your language of choice supports this feature it's worth your time to experiment with it.

Friday, February 4, 2011

Forks and joins in Scala

A short while ago Alex Miller (aka puredanger) wrote up a blog entry detailing how to use Clojure to interact with the new fork/join concurrency framework to be included with Java7 (and already available from Doug Lea's site). The meat of Alex's solution is a Java wrapper for Clojure's IFn type which integrates nicely with the fork/join framework. At the time there was some discussion about whether a Scala implementation would require an equivalent amount of scaffolding. It turns out that supporting this functionality in Scala requires a bit more effort than one might expect, although the problems one faces are quite different than what Alex worked through in Clojure. We review the steps to a working Scala implementation and see what we can learn from them.

Before we move on, a few notes about fork/join itself. A full overview is outside the scope of this post, however details can be found in the jsr166y section of Doug Lea's site. A solid (although slightly out-of-date) review can be found in a developerWorks article by Brian Goetz.

We begin as simply as possible; a straight port of Alex's code (in this case an implementation of Fibonacci using fork/join) to Scala. We create a named function fib that will be recursively called by our RecursiveTasks. We also use an implicit conversion to map no-arg anonymous functions onto RecursiveTask instances using the technique for supporting SAM interfaces discussed in an earlier post. The resulting code looks pretty good but an attempt to build with sbt gives an unexpected error:


[info] == test-compile ==
[info] Source analysis: 2 new/modified, 0 indirectly invalidated, 0 removed.
[info] Compiling test sources...
[error] .../src/test/scala/org/fencepost/forkjoin/ForkJoinTestFailOne.scala:23: method compute cannot be accessed in jsr166y.RecursiveTask[Int]
[error] (f2 compute) + (f1 join)
[error] ^
[error] one error found


A bit of investigation reveals the problem; RecursiveTask.compute() is a protected abstract method in the fork/join framework. The compute method in the class returned by our implicit conversion consists of nothing more than a call to fib.apply() but there's no way to know this when fib is defined. It follows that an attempt to access RecursiveTask.compute() from within the body of fib is (correctly) understood as an attempt to access a protected method.

How does Alex get around this problem in Clojure? He doesn't; the problem is actually resolved via the Java wrapper. Note that in his Java code compute() has been "elevated" to be a public method. He's thus able to call this method without issue from his "fib-task" function. Without this modification his code fails with similar errors [1].

Strangely, there's no obvious way to resolve this issue within Scala itself. The language provides protected and private keywords to restrict access to certain methods but there is no public keyword; methods are assumed to be public if not annotated otherwise. As a consequence there's no obvious way to "raise" the access level of a method in a subclass. We work around this constraint by way of an egregious hack; we define a new subclass of RecursiveTask containing a surrogate method that provides public access to compute. We then return this subclass from our implicit conversion.

Our new implementation now compiles, but when we try to actually execute the test with sbt we see a curious error; the test starts up and then simply hangs:


[info] == test-compile ==
[info]
[info] == copy-test-resources ==
[info] == copy-test-resources ==
[info]
[info] == test-start ==
[info] == test-start ==
[info]
[info] == org.fencepost.forkjoin.ForkJoinTestFailTwo ==
[info] Test Starting: testFibonacci


Well, this isn't good. What's going on here?

The key point to understand is that the implicit conversion is called every time an existing value needs to be converted to a different type. As written the calls to f1.fork() and f1.join() both require conversion, and the implicit conversion function in play here will return a new instance on each invocation. This means that even though the input to the conversion function is identical in both cases (f1) the object that invokes fork() will be a different object than that which invokes join(). For fork/join tasks this matters; join() is called on an object that was never forked(). Voila, an unterminated wait.

We can resolve this issue one of two ways:


  • Caching previous values returned by the implicit conversion function and re-using them when appropriate

  • Explicitly specifying the type of f1 and f2 as our wrapper subclass. This has the effect of forcing implicit conversion at the time of assignment rather than when we call fork() and join()



I chose the first approach, mainly because it seemed a bit cooler. Oh, and it's a bit cleaner logically as well. The final result runs without issue.





[1] Using Alex's code (as published on his blog):


bartok ~/git/puredanger_forkjoin $ clojure process.clj
(1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181)


After changing IFnTask.compute() to protected (with no other changes):


bartok ~/git/puredanger_forkjoin $ clojure process.clj
(Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at jline.ConsoleRunner.main(Unknown Source)
Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: No matching field found: compute for class revelytix.federator.process.IFnTask (process.clj:0)
at clojure.lang.Compiler.eval(Compiler.java:5440)
at clojure.lang.Compiler.load(Compiler.java:5857)
at clojure.lang.Compiler.loadFile(Compiler.java:5820)
at clojure.main$load_script.invoke(main.clj:221)
at clojure.main$script_opt.invoke(main.clj:273)

Thursday, January 20, 2011

Beautiful Code 1: On Pointers in Go and Regular Expressions

Before we proceed permit me a word or two of explanation.

A few weeks ago I began working through Beautiful Code and almost immediately it became clear that this was going to be a problem. The first chapter documents an implementation of regular expressions in C, yet pondering the ideas contained within this code led to an digression on pointers and reference types in Go. That digression forms the meat of the post that follows. A quick review of the remaining chapters suggests that this probably isn't a one-time thing. This post likely represents the first of a series; thus the title. We'll see how that works out.

Thanks for listening. We now return you to your regularly scheduled blog post.

The first chapter of Oram & Wilson's work contains an analysis by Brian Kernighan (yes, that Brian Kernighan) of an elegant bit of C code. The code in question was written by Rob Pike a little over a decade ago as a simple yet useful example for The Practice of Programming. The code combines a number of C's features (strings as pointers to the first character in an array of characters, null-terminated strings, pointer arithmetic) with recursion to create a compact and elegant implementation of a simple regular expression syntax. Kernighan observes that:

"Rob's implementation itself is a superb example of beautiful code: compact, elegant, efficient and useful. It's one of the best examples of recursion that I have ever seen, and it shows the power of C pointers."


That's not hyperbole, folks.

The algorithm implemented by this code is subtle without being cryptic. In order to determine whether a given regex matches a string s with length n and consisting of characters s(0) through s(n - 1) we compare the regex to every substring of s that also ends in s(n - 1). C uses null-terminated character arrays to represent strings and a pointer to the first character of the array to refer to the string so we can enumerate all relevant substrings of s simply by incrementing the pointer to the first character via pointer arithmetic and recursively calling our comparison function with every substring generated by this process.

Go also supports pointers in order to provide call-by-reference semantics so in theory the same algorithm could be implemented in this language as well. Attempting to do so seemed like an interesting experiment, not least because Rob is somewhat involved with the Go language. Unfortunately according to the specification Go does not allow pointers to refer to individual characters in a string and it isn't clear whether pointer arithmetic is supported in any capacity [1]. All is not lost, however; Go does include a reference type for arrays and their sub-arrays in the form of slices. Using slices we can perform the processing described above in an (arguably) more explicit syntax. The results can be found on github.

A curious side effect: Python also supports slices on lists so in the end our implementation looks quite a bit like the Python implementation of this algorithm.

[1] If you don't support referencing individual array elements a lot of the arguments for pointer arithmetic don't make quite as much sense.

Friday, January 7, 2011

Transmutation of Scala closures into SAM instances

Okay, it's not actually alchemy. But it is pretty cool nonetheless.

A brief refresher on terminology: a single abstract method (or SAM) interface or abstract method contains exactly one method that concrete implementations must define. Java includes a number of interfaces that satisfy this property, including Runnable, Callable and Comparable.

The basic idea at play here is that a closure or anonymous function (either will do here) can be used in place of a SAM implementation if the parameters and return type of the closure match up to those of the method. That assumes your language cares about such things, of course; duck typing makes this less of an issue (as we'll see).

JRuby automatically performs this operation via closure conversion (scroll down to the bottom of the page). Stuart Sierra has recently published a macro for doing something similar in Clojure. Even Java is considering including this feature in an eventual closure implementation (see Brian Goetz's writeup for details). Why should Scala miss out on all the fun?

Let's take a look at some code to make this discussion more tangible. A simple example of closure conversion in JRuby can be found here. This is where duck typing helps us out; the only real requirement on our closure is that we actually return something (in order to clearly distinguish between Runnable and Callable). Our objective is to implement a Scala unit test that does something similar. Any such approach will be built around Scala's support for implicit conversion of types, but in this case a bit of care and feeding is required to line up the parameters and return types of the closure with that of the contents of the SAM interface. The basic approach works as follows:


  1. The implicit conversion accepts a function with the same parameter list and return value as the lone method in the SAM interface

  2. The conversion then returns a new concrete instance of the SAM interface. The implementation of the method doesn't need to be anything other than invoking apply() on the input function



The resulting ScalaTest class can be found here.