At the core the Cassandra data model is centered around columns. Each of these columns contains both a name and a value, both of which are represented as binary data. This model is very simple, and while it may appear limiting it is in reality quite flexible. The lack of any pre-defined data types avoids any "impedance mismatch" resulting from structured data being shoehorned into data types that don't really fit. We're free to represent column values in any way we see fit; if we can convert it into bytes it's fair game. Our problem thus devolves into a question of serialization, and suddenly there are many suitors vying for our attention. Among the set of well-known serialization packages we find Google's Protocol Buffers, Thrift and Avro. And since we're working in a language with good Java interop we can always have Java's built-in serialization (or something similar like Kryo) available. Finally we're always free to roll our own.
Let's begin by ruling out that last idea. There are already well-tested third-party serialization libraries so unless we believe that all of them suffer from some fatal error it's difficult to justify the risk and expense of creating something new. We'd also like our approach to have some level of cross-platform support so native Java serialization is excluded (along with Kryo). We also need to be able to encode and decode arbitrary data without defining message types or their payload(s) in advance, a limitation that rules out Protocol Buffers and Thrift. The last man standing is Avro, and fortunately for us he's a good candidate. Avro is schema-based but the library includes facilities for generating schemas on the fly by inspecting objects via Java reflection. Also included is a notion of storing schemas with data, allowing the original object to be easily reconstructed as needed. The Avro data model includes a rich set of basic types as well as support for "complex" types such as arrays and maps.
We'll need to implement a Clojure interface into the Avro functionality; this can be as simple as methods to encode and decode data and schemas. At least some implementations of Avro data files use a pre-defined "meta-schema" (consisting of the schema for the embedded data and that data itself) for storing both items. Consumers of these files first decode the meta-schema then use the discovered schema to decode the underlying data. We'll follow a similar path for our library. We'll also extend our Cassandra support a bit in order to support the insertion of new columns for a given key and column family.
Our core example will be built around a collection of employee records. We want to create a report on these employees using attributes defined in various well-known columns. We'd like to access the values in these columns as simple data types (booleans, integers, perhaps even an array or a map) but we'd like to do so through a uniform interface. We don't want to access certain columns in certain ways, as if we "knew" that a specific column contained data of a specific type. Any such approach is by definition brittle if the underlying data model should shift in flight (as data models are known to do).
After finishing up with our Clojure libraries we implement a simple app for populating our Cassandra instance with randomly-generated employee information:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
; Populate data for a set of random users to a Cassandra instance. | |
; | |
; Users consist of the following set of data: | |
; - a username [String] | |
; - a user ID [integer] | |
; - a flag indicating whether the user is "active" [boolean] | |
; - a list of location IDs for each user [list of integer] | |
; | |
; User records are keyed by username rather than user IDs, mainly because at the moment | |
; we only support strings for key values. The Cassandra API exposes keys as byte arrays | |
; so we could extend our Cassandra support to include other datatypes. | |
(use '[fencepost.avro]) | |
(use '[fencepost.cassandra]) | |
(import '(org.apache.commons.lang3 RandomStringUtils) | |
'(java.util Random) | |
) | |
; Utility function to combine our Avro lib with our Cassandra lib | |
(defn add_user [client username userid active locationids] | |
(let [userid_data (encode_with_schema userid) | |
active_data (encode_with_schema active) | |
locationids_data (encode_with_schema locationids)] | |
(insert client username "employee" "userid" userid_data) | |
(insert client username "employee" "active" active_data) | |
(insert client username "employee" "locationids" locationids_data) | |
) | |
) | |
; Generate a list of random usernames | |
(let [client (connect "localhost" 9160 "employees")] | |
(dotimes [n 10] | |
(let [username (RandomStringUtils/randomAlphanumeric 16) | |
random (Random.) | |
userid (.nextInt random 1000) | |
active (.nextBoolean random) | |
locationids (into [] (repeatedly 10 #(.nextInt random 100)))] | |
(add_user client username userid active locationids) | |
(println (format "Added user %s: [%s %s %s]" username userid active locationids)) | |
) | |
) | |
) |
[varese clojure]$ ~/local/bin/clojure add_employee_data.clj
Added user gFc0pVnLKPnjrrLx: [158 true [77 73 99 58 31 64 1 37 70 69]]
Added user 5gGh9anHwFINpr5t: [459 true [34 71 28 1 2 84 11 33 37 28]]
Added user pGRMeXBTFoBIWhud: [945 true [45 83 51 45 11 4 80 68 73 27]]
...
We're now ready to create our reporting application. Using Avro and a consistent meta-schema this comes off fairly easily:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
; Retrieve information from the Cassandra database about one of our employees | |
(use '[fencepost.avro]) | |
(use '[fencepost.cassandra]) | |
(defn evaluate_user [slices username] | |
"Gather information for the specified user and display a minimal report about them" | |
; Note that the code below says nothing about types. We specify the column names we | |
; wish to access but whatever Cassandra + Avro supplies for the value of that column | |
; is what we get. | |
(let [user_data (range_slices_columns slices username) | |
userid (decode_from_schema (user_data :userid)) | |
active (decode_from_schema (user_data :active)) | |
locationids (decode_from_schema (user_data :locationids))] | |
(println (format "Username: %s" username)) | |
(println (format "Userid: %s" userid)) | |
(println (if (> userid 0) "Userid is greater than zero" "Userid is not greater than zero")) | |
(println (format "Active: %s" active)) | |
(println (if active "User is active" "User is not active")) | |
; Every user should have at least one location ID. | |
; | |
; Well, they would if we were able to successfully handle an Avro record. | |
;(assert (> (count locationids) 0)) | |
) | |
) | |
(let [client (connect "localhost" 9160 "employees") | |
key_slices (get_range_slices client "employee" "!" "~") | |
keys (range_slices_keys key_slices)] | |
(println (format "Found %d users" (count keys))) | |
(dotimes [n (count keys)] | |
(evaluate_user key_slices (nth keys n)) | |
) | |
) |
[varese clojure]$ ~/local/bin/clojure get_employee_data.clj
Found 10 users
Username: 5gGh9anHwFINpr5t
Userid: 459
Userid is greater than zero
Active: true
User is active
...
And in order to verify our assumptions about simple cross-platform support we create a Ruby version of something very much like our reporting application:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'rubygems' | |
require 'avro' | |
require 'cassandra' | |
def evaluate_avro_data bytes | |
# Define the meta-schema | |
meta_schema = Avro::Schema.parse("{\"type\": \"map\", \"values\": \"bytes\"}") | |
# Read the meta source and extract the contained data and schema | |
meta_datum_reader = Avro::IO::DatumReader.new(meta_schema) | |
meta_val = meta_datum_reader.read(Avro::IO::BinaryDecoder.new(StringIO.new(bytes))) | |
# Build a new reader which can handle the indicated schema | |
schema = Avro::Schema.parse(meta_val["schema"]) | |
datum_reader = Avro::IO::DatumReader.new(schema) | |
val = datum_reader.read(Avro::IO::BinaryDecoder.new(StringIO.new(meta_val["data"]))) | |
end | |
client = Cassandra.new('employees', '127.0.0.1:9160') | |
client.get_range(:employee,{:start_key => "!",:finish_key => "~"}).each do |k,v| | |
userid = evaluate_avro_data v["userid"] | |
active = evaluate_avro_data v["active"] | |
locationids = evaluate_avro_data v["locationids"] | |
puts "Username: #{k}, user ID: #{userid}, active: #{active}" | |
puts "User ID #{(userid > 0) ? "is" : "is not"} greater than zero" | |
puts "User #{active ? "is" : "is not"} active" | |
# Ruby's much more flexible notion of truthiness makes the tests above somewhat less | |
# compelling. For extra validation we add the following | |
"Oops, it's not a number" unless userid.is_a? Fixnum | |
end |
[varese 1.8]$ ruby get_employee_data.rb
Username: 5gGh9anHwFINpr5t, user ID: 459, active: true
User ID is greater than zero
User is active
Username: 76v8iEJcc79Huj9L, user ID: 469, active: false
User ID is greater than zero
User is not active
...
This code meets our basic requirement, but as always there were a few stumbling blocks along the way. Avro includes strings as a primitive type, but unfortunately the Java API (which we leverage for our Clojure code) returns string instances as a Utf8 type. We can get a java.lang.String from these objects, but unfortunately we need another toString() method call that (logically) is completely unnecessary. We also don't fully support complex types. The Avro code maps the Clojure classes representing arrays and maps onto a "record" type that includes the various fields exposed via getters. Supporting these types requires the ability to reconstruct the underlying object based on these fields, and doing so reliably is beyond the scope of this work. Finally, we were forced to use Ruby 1.8.x for the Ruby examples since the Avro gem apparently doesn't yet support 1.9.
Full code can be found on github.
No comments:
Post a Comment