Tuesday, July 10, 2012

Ruby, Perl and Eloquence

In an attempt to make my Ruby code a bit more idiomatic I've been spending a bit of time recently with Russ Olsen's excellent Eloquent Ruby. There are many reasons to love writing Ruby code, not least of which is that Ruby deploys the same terse but expressive power of Perl while employing better overall principles of programming. The effect isn't universal; on occasion my problems with Ruby look quite a bit like my problems with Perl. Given the overall elegance of the language it seems likely that there's a "better" (or at least more idiomatic) way to accomplish my goal. And so I turn to Eloquent Ruby.

As an example of this tension consider the following example.

Perl has a well-deserved reputation for efficiently processing text files with regular expressions. We'll consider an example from another text I've been spending a bit of time with: Hofstadter's seminal Godel, Escher, Bach. A simple implementation of the productions of the MIU system in Perl [1] might look like the following:

Reasonable enough, but there's a lot of magic going on here. We're relying on the "magic" variable $_ to access the current line, and to make things worse we have to obtain those lines using the INFILE identifier that only has meaning due to a side effect of the open() call [2]. There's also those "magic" $1 and $2 variables for accessing capture groups in a regex.

The Ruby version is both slightly shorter and a bit cleaner:

We've made some nice strides here. The use of File.new() allows us to avoid open() and it's side effects. The use of a code block allows us to remove the global $_ in favor of a scoped variable line.

But we're still stuck with $1 and $2 for those capture groups.

One can imagine an elegant object-oriented solution based on match objects. Any such implementation would have to accomplish three things:
  1. The match object will be used as the condition of an if/unless expression so nil should be returned if there's no match
  2. The match object should be bound to a variable name in scope
  3. References to capture groups in the if-clause should use the scoped variable rather than the $1,$2, etc.
But remember, this exercise is only useful if we don't have to compromise on elegance. If all we're after is an explicit object-oriented solution we could go with the Python version:

That's probably not what we want. [3]

After pondering this question for a bit we realize we may not be in such a bad spot. /regex/.match(str) already returns nil if there is no match so our first requirement is satisfied. Assignment is just another expression, so our match object (or nil) will be returned to the if-expression test, helping us with our second goal. And match objects provide access to capture groups using []. So long as the assigned variable is in scope we should have everything we need. A bit of scuffling [4] brings us to the following:

This example is free of any "magic" variables, although we have sacrificed a bit on the clarity front. It's also worth noting that we could have accomplished something very similar in Perl:

This implementation is hardly idiomatic. It's also quite a bit less clear than our earlier efforts in the language.

Where does this leave us? Do we keep in touch with our Perl roots and live with $1 in order to keep things terse and expressive? Do we sacrifice a bit of clarity and go with an object-oriented approach? Or do we do something else entirely?

Answers to this question (and others like it) are what I'm hoping to get out of Eloquent Ruby.

[1] We're ignoring nasty things like error handling and complex edge cases in order to keep the conversation focused.

[2] We could use lexical file handles here but that doesn't really change the underlying point. Even in that case we still have to call open() in order for $fh to be useful.

[3] Python does a lot of things very, very well, but this solution to this problem seems unnecessarily verbose.

[4] The requirement to declare foo in advance when using the modifier form of if was a bit surprising. Shifting to an if expression removed this requirement. The upcoming Perl version also didn't require this advance declaration when using an equivalent to the modifier form. An MRI quirk, perhaps?

Tuesday, April 24, 2012

Small Facets of Ruby: redcar and rvm

An occasional series of brief discussions on some small aspect of Ruby and/or it's ecosystem. Some bits may turn into gems in the future, but in general topics discussed here will likely be small enough that a gem would be overkill.

I spend most of my time in the terminal, but when a graphical text editor is useful I'm a fan of redcar. Early versions were a bit rough but the project has progressed nicely and the editor now appears to be quite stable. Redcar has always been solid in terms of syntax highlighting: out-of-the-box we find support for Ruby, Clojure, Scala and Haskell (!) and many others. The only slight hiccup is that redcar requires JRuby, and while I'm often using JRuby anyway it would be nice to have redcar accessible even when working with MRI or Rubinius.

It seems like rvm should be able to help us out here. We should be able to install redcar into it's own gemset and create a script that switches to JRuby with that gemset and starts up the editor. Any such script shouldn't interfere with the Ruby interpreter in use in the current shell. A dead simple approach doesn't get very far:

[@varese ~]$ more bin/redcar
rvm use jruby-1.6.7@redcar
[@varese ~]$ bin/redcar

RVM is not a function, selecting rubies with 'rvm use ...' will not work.
You need to change your terminal settings to allow shell login.
Please visit https://rvm.io/workflow/screen/ for example.

Fortunately Wayne includes some excellent documentation about sourcing the necessary functions into your scripts in order to setup the environment as expected. A small change gives us the following:

A quick test verifies that we're up and running:

[@varese ~]$ which ruby
[@varese ~]$ bin/redcar
Using /home/mersault/.rvm/gems/jruby-1.6.7 with gemset redcar
Redcar 0.13 ( java )
[@varese ~]$ which ruby

Monday, February 20, 2012

Ruby and Concurrency: Maintaining Purity in a World of Actors

It's time to come clean.

The previous post on designing actors with Akka and JRuby took a bit of a shortcut. Not a big one, mind you, but a shortcut nonetheless. Fortunately the use of this shortcut provides a window into a few interesting aspects of actor design, so despite my embarrassment it's worth spending a bit of time here.

Let's start with the constraints imposed when designing with actors. We've already seen a few of these constraints: all mutable state must be contained within an actor, the outside world (including other actors) can only interact with an actor by sending messages and these messages are handled one at a time so there's no contention on access to mutable state. All this talk of message handling might lead one to wonder; exactly what can an actor do when it receives those messages?

To no one's great surprise there are some constraints here as well. Generally speaking the message handler in an actor consists of some combination of the following actions:

  • Creating other actors

  • Sending messages to other actors

  • Retrieving or updating managed state

There's certainly nothing horrible in that list either. None of these constraints seem terribly onerous. So where did we go wrong?

The problem starts with the Controller's handling of :next messages. Each candidate value is sent off in a message to every model for evaluation, in each case returning a future that can be used to access the model's response when it's provided. The first candidate that gets a positive response from all the models is our next prime. The implementation returns the expected value, but there's a big problem here: how did we observe the response from any of the models if we're still in the message handler for a :next message? Remember that messages are processed one at a time and we're not done with the current message yet. The model is an actor and actors communicate with messages; it seems reasonable to expect the response from the model to be delivered by a message sent to the controller. So how did we see that message if we're not done with the current message? If we can generalize a bit: how do futures (or a request-response messaging pattern in general) fit into an actor-based design?

The short answer appears to be that they don't really fit at all. A better approach would be for the controller to send messages to the models for the current candidate and then return. The models would then compute their answer and send a message containing that answer back to the controller. The message handler at the controller would then determine if an all responses have been received. If every model has answered, the controller computes an overall answer and takes action: if the candidate is prime it's returned to the caller, otherwise a new candidate is generated and the process starts all over again. [1]

So what does this new approach look like in Ruby? [2] Our controller is now a bit longer but after a bit of cleanup the code reads reasonably well:

The model doesn't really change much:

Complete code (including tests) can be found on github.

[1] Viktor Klang mentioned a non-blocking solution in comments on the original post. Any non-blocking implementation seemed to be constrained by a variation on the theme mentioned above; how do we preserve a communciation channel to the original caller (a caller who is not an actor) if the message handler for the :next message doesn't return a response directly? It wasn't until I came across Akka's notion of a reply channel (and the ability to preserve it as an instance variable in our actor) that this problem was resolved.

[2] I briefly considered explicitly modelling the actors as state machines using the intriguing statemachine gem but decided against it; formally defining the state machines involved didn't add much value to actors that are as lightweight as ours. Larger systems with more complex actors would very likely benefit from the use of this gem.

Thursday, January 5, 2012

Ruby and Concurrency: Design with Actors and Akka

In a semi-recent post we looked at how we might define actors in JRuby using a combination of Akka and code blocks. That post was very focused on the process of creating actors; we didn't necessarily consider how to build systems with actors and/or whether Ruby might be helpful in doing so. There are plenty of issues to ponder here, but before we dig in let's take a step back and talk briefly about some of the characteristics of actors.

Actors implement a shared-nothing model for concurrency. Mutable system state should be contained by one or more actors in the system. This state is not exposed to external entities directly; it can only be accessed or modified via messages sent to the actor. Since the actor is the only player in this drama that can access the state it contains there is no need to lock or synchronize state access. It should be fairly clear that an actor consists of some amount of mutable system state and some logic for handling incoming messages... and not much more.

Okay, sounds great... but how does it work in practice? We'll consider this question by way of an alogrithm that by now should be pretty familiar: prime number generation using the Sieve of Eratosthenes. We considered this chestnut (or perhaps you prefer "war horse" at this point) in a previous post discussing an implementation of this algorithm in the Go language. Let's review that implementation and see what else we can do with it.

The support of lightweight goroutines in Go encouraged a pipeline implementation with one goroutine per discovered prime. We can think of the "state" of this system as the total set of discovered primes. Each goroutine knows only about the prime it contains and the channel where candidates should be sent if they "pass" (i.e. do not divide evenly into it's prime number). Once the goroutine is created it's state doesn't change. New state is added by creating a new goroutine for a newly-discovered prime. State is never deleted; once a prime is discovered removing it from consideration is non-sensical. As a consequence state is completely distributed; no entity in the system knows about all discovered primes.

We also note that the algorithm described here isn't very friendly to parallel decomposition [1]. Candidate values are compared to previously discovered primes one at a time: if the candidate passes the test it's allowed to move on, otherwise no further evaluation occurs. This technique is referred to as "fail-fast" behaviour; if a value fails a test it doesn't lead to any wasteful extra work. The concurrent approach is quite different: compare the candidate to all discovered primes at once and return success only if all tests pass. Comparisons are done independently, so even if the "first" [2] comparison fails all other comparisons still execute. We lose our fail-fast behaviour but gain the ability to decompose the entire process into smaller jobs that can execute in parallel. A trade-off in engineering... surprising, I know.

We'll set out to implement the Sieve of Eratosthenes in Ruby using JRuby and Akka. Our new implementation will have more of a bias towards parallelism; this time we're okay with throwing away some work. Clearly we'll need an actor to compare a candidate to one or more discovered primes; that is, after all, why we're here. We can think of these actors as maintaining the state of our system (just like the goroutines in the Go implementation) so we'll borrow a term from MVC and call these "model" actors. Concurrently evaluating candidates against these models implies an organizing actor that is aware of all models; it seems natural to call this the "controller" actor. To keep things simple we don't want to support an unlimited number of models so we create a fixed-size "pool" and distribute system state (i.e. the set of discovered primes) between these models. When a new prime is discovered the controller will be responsible for adding that prime to an existing model.

The implementation in Ruby looks like this:

Ruby offers a number of features which help us out here. As shown in this implementation messages can be as simple as a list with a leading symbol (used to indicate the message "type") and a payload contained in the remainder of the list. This follows a similar convention found in Erlang, although that language uses tuples rather than lists. Support for destructuring/multiple assignment makes working with these messages quite simple.

Our previous work building actors from code blocks doesn't apply here due to an implementation detail so we define both the controller and model actors as distinct classes. As it turns out this change isn't much of a problem; we're able to implement both classes, a helper Enumerable and a second Enumerable wrapping the controller in less than 140 lines with comments. Full code (including tests) can be found on github.

Spend a little time with JRuby and Akka and it becomes clear that they work well together and form a good basis for building concurrent applications in Ruby.

[1] This is a bit of a loaded statement; you will get some parallelism as multiple candidates move through the "stages" (i.e. goroutines) of the pipeline. That said, this notion of parallelism applies to the system as a whole. The process of determining whether a single candidate is or is not a prime number is still very sequential.

[2] "First" here means nothing more than "whichever test happens to complete and return a value first"

Friday, November 25, 2011

Musings on List Comprehensions, Functional Programming and Python

For simplicity and elegance in your programming constructs the list comprehension is hard to beat. [1] A list comprehension can filter an input list, transform it or do both, all in a single expression and with very readable syntax. At it's best a list comprehension is "beautiful code" distilled: very clear and expressive with no unnecessary noise. You can, of course, make list comprehensions "ugly" but at least you have to try a bit to do so.

List comprehensions have their roots in functional programming and several modern functional languages include support for them. We'll see comprehensions in Haskell, Clojure and Scala. [2] Python also includes support for comprehensions; in fact the basis for Guido's argument to remove the map and filter functions from py3k was that expressions using these functions could be easily re-written as comprehensions. We also consider comprehensions in Python, including differences between the Python implementation and those in other languages and what effect those differences may have.

Let's consider a fairly straightforward problem problem: given a list of (x,y) coordinates in some two-dimensional space provide a list of the coordinates that are within some fixed distance of the origin of (0,0). Our application also requires that results be displayed in polar coordinates, so for extra bonus points we should return our results in that notation. We can solve the problem quite easily in Haskell and Clojure using list comprehensions:

In Scala we use for expressions to accomplish something very similar:

And finally, a straightforward solution in Python:

Note that in each of these solutions we're doing a bit more work than we need to. Our implementation computes the distance from the origin twice, once when filtering values and again when generating the final output of the transformation process. This seems unnecessary; a better option would be to somehow define this value as intermediate state. This state could then be available to both filter and transform expressions. Haskell and Clojure support the introduction of intermediate bindings in their comprehension syntax:

Scala also allows for intermediate bindings within a for expression:

What about Python? It turns out we cannot solve this problem in Python using a single comprehension; the syntax doesn't allow for the introduction of intermediate state which can be used in either the predicate or transform expression. On the face of it this seems a bit odd; the language encourages the use of comprehensions for filtering and/or transformation while providing a less robust version of that very construct. To some degree this discrepancy reflects differing language goals. Guido's post on the history of list comprehensions seems to indicate that the motivation for adding these features was pragmatic; the syntax is an elegant way to express most filter and transform operations. Functional languages use list comprehensions as "syntactic sugar" for monadic effects [3] that don't really have an equivalent in standard Python usage. The syntax may look the same, but if you're coming from a functional perspective they can feel just a bit off. The same is true for a few other common functional idioms:

  • Lazy evaluation - List comprehensions in Python are not lazily evaluated. Generator expressions, which look very similar to list comprehensions, are lazily evaluated.

  • Higher-order functions - Anonymous functions are supported in Python but these functions are famously limited to a single expression. Functions can return functions but for non-trivial functions a named function must be declared and returned.

A couple things should be noted here. First, let us clearly state that Python is not and does not claim to be a functional programming language. While absolutely true, this fact doesn't change the underlying point. Moving from functional concepts back into Python can be a bit jarring; some things look similar but don't behave quite like you'd expect.

It's also worth noting that the inability to solve this problem with list comprehensions in Python doesn't mean that this problem cannot be solved in idiomatic Python. We wish to return our intermediate state as well as filter results based on it's value; this dual use allows us to solve the problem with nested comprehensions. The inner comprehension will generate the final representation (including the intermediate state) and the outer comprehension will filter results based on that representation. In Python this looks something like:

This works only because the intermediate state is also returned in the final result. If that state were not explicitly returned (i.e. if it's values were used as input to a conditional expression which returned, say, a string value describing the distance) this solution would not apply.

We can also solve this problem using generators. Using the state maintained by the generator we can iterate through the list, compute the intermediate state and yield a value only when we've satisfied our predicate. A generator-based solution would look something like:

Finally, none of these comments should be construed as a criticism of Python, the design choices that went into the language or the inclusion of list comprehensions generally. The pragmatic case for inclusion of this feature seems very strong. This post is interested only in the interplay between these features and similar features in other languages.

[1] Some languages (perhaps most notably Scala) use the term "for comprehension" or "for expression", and in some of these languages (Scala again) these constructs are more flexible than a list comprehension. That said, it's fairly straightforward to make Scala's for expressions behave like conventional list comprehensions.

[2] A purist might object that Scala is designed to mix features of object-oriented and functional languages, but the bias in favor of functional constructs justifies Scala's inclusion here.

[3] As an example, note that in Haskell list comprehensions can be replaced with do-notation. See the Haskell wiki for details.

Monday, September 12, 2011

Ruby and Concurrency: The Mechanics of Akka and JRuby

I'm interested in concurrency. I'm also interested in Ruby. There doesn't seem to be much reason to keep these two guys apart anymore.

This post marks the beginning of an occasional series on the topic of using Ruby to write concurrent code. Ruby doesn't yet have a big reputation in the world of concurrent and/or parallel applications, but there is some interesting work being done in this space. And since problems in concurrency are notoriously difficult to reason about we could probably do a lot worse than attempt to address those problems in a language designed to make life easier for the developer.

We begin with Akka, an excellent library designed to bring actors, actor coordination and STM to Scala and, to a slightly lesser degree, the JVM generally. Our initial task is simple: we wish to be able to define an Akka actor by providing a message handler as a code block. We'll start by attempting to implement the actor as a standalone class and then abstract that solution into code which requires the input block and nothing else.

A simple initial implementation looks something like this:

[@varese ruby]$ ruby --version
jruby 1.6.2 (ruby-1.8.7-p330) (2011-05-23 e2ea975) (OpenJDK Client VM 1.6.0_22) [linux-i386-java]
[@varese ruby]$ ruby akka_sample1.rb
ArgumentError: Constructor invocation failed: ActorRef for instance of actor [org.jruby.proxy.akka.actor.UntypedActor$Proxy0] is not in scope.
You can not create an instance of an actor explicitly using 'new MyActor'.
You have to use one of the factory methods in the 'Actor' object to create a new actor.
Either use:
'val actor = Actor.actorOf[MyActor]', or
'val actor = Actor.actorOf(new MyActor(..))'
(root) at akka_sample1.rb:20

Well, that didn't work very well. Apparently the class we defined in JRuby is being exposed to the Java lib as a proxy object and that proxy's class is unknown to the Akka runtime. No problem; Akka supports a factory model for actor creation, and by using that approach the underlying class of our actor should become a non-issue. With a few simple changes we're ready to try again:

[@varese ruby]$ ruby akka_sample2.rb
Received message: foo

We now have a working actor, but we still have some work to do; remember, we want to be able to define arbitrary actors by supplying just a code block. We need a few additional pieces to make this work:

  • A generic actor implementation whose state includes a block or Proc instance. The onReceive method of this actor could then simply call the contained block/Proc, passing the input message as an arg.

  • An ActorFactory implementation which takes a code block as an arg, stores it in internal state and then uses that block to build an instance of the generic actor described above on demand.

A first cut at this concept might look something like this:

[@varese ruby]$ ruby akka_sample3.rb
ArgumentError: wrong number of arguments for constructor
create at akka_sample3.rb:29
(root) at akka_sample3.rb:37

What went wrong here? UntypedActor is a concrete class with a defined no-arg constructor. That constructor is being called in favor of the one we've provided, and as a consequence our block never gets into the mix. There's almost certainly a cleaner way to solve this using JRuby, but for the moment we can get around the problem (in an admittedly ugly way) by providing a setter on our generic actor class:

[@varese ruby]$ ruby akka_sample4.rb
Message recieved: foo

We now have what we wanted.

If you're interested in this topic note that Nick Sieger has covered similar ground (including the interaction between JRuby and Akka) here. Nick's article draws on some very good work done by Daniel Ribeiro late last year. The code referenced in Daniel's article is available on Github. I didn't come across Daniel's post until my own was nearly done but there is quite a bit of overlap between his code and mine. That said, I recommend taking a look at both articles, if for no other reason than the fact that both authors are much better at writing Ruby code than I am.

Friday, September 2, 2011

Cassandra and Clojure: Things To Bytes And Back Again

In a previous post we briefly discussed the idea of using idiomatic Clojure to access data contained in a Cassandra instance, including transparent conversion to and from Clojure data. We'll explore an implementation of this idea, although we won't address the question of laziness, in part because there are sizable trade-offs to consider. For example, any solution that provides lazy evaluation must do so while also attempting to minimize the number of trips to the underlying data store. This question may be picked up in yet more future work, but for now we'll continue on. We've upgraded to Cassandra 0.8.2 and Clojure 1.2, and we're using a new data model (see below), but for the most part we'll try to pick up where we left off.

At the core the Cassandra data model is centered around columns. Each of these columns contains both a name and a value, both of which are represented as binary data. This model is very simple, and while it may appear limiting it is in reality quite flexible. The lack of any pre-defined data types avoids any "impedance mismatch" resulting from structured data being shoehorned into data types that don't really fit. We're free to represent column values in any way we see fit; if we can convert it into bytes it's fair game. Our problem thus devolves into a question of serialization, and suddenly there are many suitors vying for our attention. Among the set of well-known serialization packages we find Google's Protocol Buffers, Thrift and Avro. And since we're working in a language with good Java interop we can always have Java's built-in serialization (or something similar like Kryo) available. Finally we're always free to roll our own.

Let's begin by ruling out that last idea. There are already well-tested third-party serialization libraries so unless we believe that all of them suffer from some fatal error it's difficult to justify the risk and expense of creating something new. We'd also like our approach to have some level of cross-platform support so native Java serialization is excluded (along with Kryo). We also need to be able to encode and decode arbitrary data without defining message types or their payload(s) in advance, a limitation that rules out Protocol Buffers and Thrift. The last man standing is Avro, and fortunately for us he's a good candidate. Avro is schema-based but the library includes facilities for generating schemas on the fly by inspecting objects via Java reflection. Also included is a notion of storing schemas with data, allowing the original object to be easily reconstructed as needed. The Avro data model includes a rich set of basic types as well as support for "complex" types such as arrays and maps.

We'll need to implement a Clojure interface into the Avro functionality; this can be as simple as methods to encode and decode data and schemas. At least some implementations of Avro data files use a pre-defined "meta-schema" (consisting of the schema for the embedded data and that data itself) for storing both items. Consumers of these files first decode the meta-schema then use the discovered schema to decode the underlying data. We'll follow a similar path for our library. We'll also extend our Cassandra support a bit in order to support the insertion of new columns for a given key and column family.

Our core example will be built around a collection of employee records. We want to create a report on these employees using attributes defined in various well-known columns. We'd like to access the values in these columns as simple data types (booleans, integers, perhaps even an array or a map) but we'd like to do so through a uniform interface. We don't want to access certain columns in certain ways, as if we "knew" that a specific column contained data of a specific type. Any such approach is by definition brittle if the underlying data model should shift in flight (as data models are known to do).

After finishing up with our Clojure libraries we implement a simple app for populating our Cassandra instance with randomly-generated employee information:

[varese clojure]$ ~/local/bin/clojure add_employee_data.clj
Added user gFc0pVnLKPnjrrLx: [158 true [77 73 99 58 31 64 1 37 70 69]]
Added user 5gGh9anHwFINpr5t: [459 true [34 71 28 1 2 84 11 33 37 28]]
Added user pGRMeXBTFoBIWhud: [945 true [45 83 51 45 11 4 80 68 73 27]]

We're now ready to create our reporting application. Using Avro and a consistent meta-schema this comes off fairly easily:

[varese clojure]$ ~/local/bin/clojure get_employee_data.clj
Found 10 users
Username: 5gGh9anHwFINpr5t
Userid: 459
Userid is greater than zero
Active: true
User is active

And in order to verify our assumptions about simple cross-platform support we create a Ruby version of something very much like our reporting application:

[varese 1.8]$ ruby get_employee_data.rb
Username: 5gGh9anHwFINpr5t, user ID: 459, active: true
User ID is greater than zero
User is active
Username: 76v8iEJcc79Huj9L, user ID: 469, active: false
User ID is greater than zero
User is not active

This code meets our basic requirement, but as always there were a few stumbling blocks along the way. Avro includes strings as a primitive type, but unfortunately the Java API (which we leverage for our Clojure code) returns string instances as a Utf8 type. We can get a java.lang.String from these objects, but unfortunately we need another toString() method call that (logically) is completely unnecessary. We also don't fully support complex types. The Avro code maps the Clojure classes representing arrays and maps onto a "record" type that includes the various fields exposed via getters. Supporting these types requires the ability to reconstruct the underlying object based on these fields, and doing so reliably is beyond the scope of this work. Finally, we were forced to use Ruby 1.8.x for the Ruby examples since the Avro gem apparently doesn't yet support 1.9.

Full code can be found on github.