Friday, September 2, 2011

Cassandra and Clojure: Things To Bytes And Back Again

In a previous post we briefly discussed the idea of using idiomatic Clojure to access data contained in a Cassandra instance, including transparent conversion to and from Clojure data. We'll explore an implementation of this idea, although we won't address the question of laziness, in part because there are sizable trade-offs to consider. For example, any solution that provides lazy evaluation must do so while also attempting to minimize the number of trips to the underlying data store. This question may be picked up in yet more future work, but for now we'll continue on. We've upgraded to Cassandra 0.8.2 and Clojure 1.2, and we're using a new data model (see below), but for the most part we'll try to pick up where we left off.

At the core the Cassandra data model is centered around columns. Each of these columns contains both a name and a value, both of which are represented as binary data. This model is very simple, and while it may appear limiting it is in reality quite flexible. The lack of any pre-defined data types avoids any "impedance mismatch" resulting from structured data being shoehorned into data types that don't really fit. We're free to represent column values in any way we see fit; if we can convert it into bytes it's fair game. Our problem thus devolves into a question of serialization, and suddenly there are many suitors vying for our attention. Among the set of well-known serialization packages we find Google's Protocol Buffers, Thrift and Avro. And since we're working in a language with good Java interop we can always have Java's built-in serialization (or something similar like Kryo) available. Finally we're always free to roll our own.

Let's begin by ruling out that last idea. There are already well-tested third-party serialization libraries so unless we believe that all of them suffer from some fatal error it's difficult to justify the risk and expense of creating something new. We'd also like our approach to have some level of cross-platform support so native Java serialization is excluded (along with Kryo). We also need to be able to encode and decode arbitrary data without defining message types or their payload(s) in advance, a limitation that rules out Protocol Buffers and Thrift. The last man standing is Avro, and fortunately for us he's a good candidate. Avro is schema-based but the library includes facilities for generating schemas on the fly by inspecting objects via Java reflection. Also included is a notion of storing schemas with data, allowing the original object to be easily reconstructed as needed. The Avro data model includes a rich set of basic types as well as support for "complex" types such as arrays and maps.

We'll need to implement a Clojure interface into the Avro functionality; this can be as simple as methods to encode and decode data and schemas. At least some implementations of Avro data files use a pre-defined "meta-schema" (consisting of the schema for the embedded data and that data itself) for storing both items. Consumers of these files first decode the meta-schema then use the discovered schema to decode the underlying data. We'll follow a similar path for our library. We'll also extend our Cassandra support a bit in order to support the insertion of new columns for a given key and column family.

Our core example will be built around a collection of employee records. We want to create a report on these employees using attributes defined in various well-known columns. We'd like to access the values in these columns as simple data types (booleans, integers, perhaps even an array or a map) but we'd like to do so through a uniform interface. We don't want to access certain columns in certain ways, as if we "knew" that a specific column contained data of a specific type. Any such approach is by definition brittle if the underlying data model should shift in flight (as data models are known to do).

After finishing up with our Clojure libraries we implement a simple app for populating our Cassandra instance with randomly-generated employee information:



[varese clojure]$ ~/local/bin/clojure add_employee_data.clj
Added user gFc0pVnLKPnjrrLx: [158 true [77 73 99 58 31 64 1 37 70 69]]
Added user 5gGh9anHwFINpr5t: [459 true [34 71 28 1 2 84 11 33 37 28]]
Added user pGRMeXBTFoBIWhud: [945 true [45 83 51 45 11 4 80 68 73 27]]
...


We're now ready to create our reporting application. Using Avro and a consistent meta-schema this comes off fairly easily:



[varese clojure]$ ~/local/bin/clojure get_employee_data.clj
Found 10 users
Username: 5gGh9anHwFINpr5t
Userid: 459
Userid is greater than zero
Active: true
User is active
...


And in order to verify our assumptions about simple cross-platform support we create a Ruby version of something very much like our reporting application:



[varese 1.8]$ ruby get_employee_data.rb
Username: 5gGh9anHwFINpr5t, user ID: 459, active: true
User ID is greater than zero
User is active
Username: 76v8iEJcc79Huj9L, user ID: 469, active: false
User ID is greater than zero
User is not active
...


This code meets our basic requirement, but as always there were a few stumbling blocks along the way. Avro includes strings as a primitive type, but unfortunately the Java API (which we leverage for our Clojure code) returns string instances as a Utf8 type. We can get a java.lang.String from these objects, but unfortunately we need another toString() method call that (logically) is completely unnecessary. We also don't fully support complex types. The Avro code maps the Clojure classes representing arrays and maps onto a "record" type that includes the various fields exposed via getters. Supporting these types requires the ability to reconstruct the underlying object based on these fields, and doing so reliably is beyond the scope of this work. Finally, we were forced to use Ruby 1.8.x for the Ruby examples since the Avro gem apparently doesn't yet support 1.9.


Full code can be found on github.

No comments:

Post a Comment