Tuesday, May 24, 2011

Cassandra and Clojure: The Beginning of a Beautiful Friendship

Soon after I began working with Cassandra it became clear to me that if you were in the market for a platform for creating applications that interact with this database you could do a lot worse than Clojure. The lack of a query language [1] suggests that filtering and slicing lists of keys and columns might be a fairly common activity for apps powered by Cassandra. And while many languages support the map/filter/reduce paradigm Clojure's use of sequences throughout the core suggest a natural means to integrate this data into the rest of your application.

Cassandra itself provides an API that uses the Thrift protocol for manipulating data. We'll use this interface to implement a simple proof-of-concept application that might serve as a testbed for manipulating data managed by Cassandra in idiomatic Clojure. Note that the Clojure ecosystem already includes several open-source projects that connect Clojure to Cassandra: these include clj-cassandra and clj-hector, the latter leveraging the Hector Java client to do it's work. In order to keep things simple we choose to avoid any of these third-party libraries; it's not as if the Thrift interface imposes a heavy burden on us. Let's see how far we can get with what's already in the packaging.

That sounds great... so what exactly are we trying to do? Beginning with the database generated during our previous work with Cassandra we should be able to access sets of keys within a keyspace and a set of columns for any specific key. These structures should be available for manipulation in idiomatic Clojure as sequences. Ideally these sequences would be at least somewhat lazy and transparently support multiple datatypes. [2]

Using the Thrift interface requires working with a fair number of Java objects representing return types and/or arguments to the various exposed functions. My Clojure isn't yet solid enough to hash out Java interop code without flailing a bit so we kick things off with a Scala implementation. This approach allows us to simplify the interop problem without sacrificing the functional approach, all within a language that is by now fairly familiar.

The Scala code includes a fair number of intermediate objects but is otherwise fairly clean:

Translating this code into Clojure is more straightforward than expected:

These results seem fairly promising, although we're nowhere near done. This code assumes that all column names and values are strings, a perfectly ridiculous assumption. We also don't offer any support for nested data types, although in fairness this was a failing of our earlier work as well. Finally we haven't built in much support for lazy evaluation; we lazily convert column names to Clojure keywords but that's about it. But fear not, gentle reader; we'll revisit some or all of these points in future work.

[1] At least until CQL arrives in Cassandra 0.8
[2] We won't be able to meet these last two goals in our initial implementation, but with any luck we'll be able to revisit them in future work

Sunday, May 1, 2011

Scala, Bloom Filters and Shakespeare: Introduction

The Problem

Maybe it was a recent New York Times review of a new production of "Macbeth" starring John Douglas Thompson. Maybe it was a reflection on Kurosawa's use of Shakespearean plots and devices within some of his best work. Maybe it was just one of those random thoughts that pop into your head while walking out the door in the morning. It's origin may be hidden from me now, but recently a question entered my mind and refused to leave:

What percentage of the words in an arbitrary play by Shakespeare cannot be found in a common modern dictionary?

There's no doubt that some of Shakespeare's language would now be considered archaic. Quite a bit of it is still in common use today, however, and some usages have made their way into the lexicon via the modest success enjoyed by the Bard through the years. So how big is that set of archaic words? Is it a majority, a small minority or something else entirely?

After pondering a few approaches to this question it became clear that there was room for a fair amount of further work. A standalone program could help us answer this question but there was no reason to stop there. The solution to this problem lends itself naturally to decomposition, and that fact in turn suggests that this problem might be an interesting vehicle for exploring more complex solutions. With any luck we'll get to this exploration in later posts, but for now our goal is to describe the problem space and provide the general framework of a solution.

A Few Solutions

Broadly speaking we will answer this question by constructing a filter containing all the terms in our dictionary. We will then extract all individual words from the script of an arbitrary play and apply our filter to each of these words. Simple, right?

We might consider using a Bloom filter to implement our filter. We would expect a Bloom filter to be a very compact yet very efficient representation of the contents of our input dictionary. The issue is that any Bloom filter would produce some number of false positives, values that for which our filter should return false (i.e. "not present") but return true instead. This problem is offset somewhat by the fact that a negative result from a Bloom filter will always be accurate, but in the end this approach will give us at best a close approximation to the actual value. For our purposes this should be adequate enough.

Note: this post is not intended to be a primer on Bloom filters or their implementation. If you're looking for such a thing there are already excellent sources available.

An alternate approach might be to implement a filter using a search index such as Lucene. The result would not be as compact as a Bloom filter but it would certainly offer a more accurate result.

To make things interesting we choose to create both versions. We can live with a good approximation so we focus on the Bloom filter implementation. The Lucene implementation will be used primarily as a baseline for comparing performance and accuracy of results. In order to justify this approach we impose the following additional constraints on the Bloom filter implementation:

  • It must outperform the Lucene version

  • It should use a minimal amount of memory

The resulting code can be found on github.

Observations Along the Way

Extracting words from plays

Shakespeare's plays are available in XML format from ibiblio. We are still tasked with extracting individual words from these documents, however, and our concern for resource usage prevents us from simply reading the entire document into memory and querying it via XPath or XQuery. In the end we resolved this issue by implementing what is at heart a finite state machine (FSM) driven by inputs obtained from the XML pull parser provided by the Scala library.

Minimizing false positives

There are known approaches to constraining the number of false positives produced by a Bloom value to some reasonable maximum. Unfortunately the size of our data set and our resource constraints prevent us from leveraging these techniques. The number of false positives is also inflated by some of our implementation choices.

The dictionary used in our implementation was provided by the "words" package in Fedora Core 13. The dictionary in this package contains approximately 480,000 words. Assuming distinct internal hash values for every word our Bloom filter should support all 19-bit input values (since the base-2 log of 480,000 is 18.87). To avoid congestion we strive for a filter that is ~50% full, which in turn requires support of all 20-bit inputs. Note that our space constraints make the use of a second hash function unwise; we'd need a much larger capacity for a population size as large as ours. Also note that Bloom filters use a bit set for a backing store and the memory usage of a bit set is driven by the largest value contained in that set. It follows that our filter will use no more than (2 ^ 20)/8 = 131KB of memory.

Unfortunately a few of our implementation choices help to artificially inflate our false positive rate. These include:

  • Scala's bit set data structure accepts only non-negative values while the MurmurHash function we use internally can generate arbitrary 32-bit output (including the full range of negative integers on the JVM). We resolve this issue by using the absolute value of the MurmurHash output, the result of which is that two output values now result in the same output from our internal function(s).

  • Bit set size is constrained by masking some number of bits from the MurmurHash output. This technique also causes multiple MurmurHash outputs to result in the same output from our internal function(s).

The results below suggest a false positive rate on the order of 4%, an exceedingly high value.

Immutable and mutable data structures

The end of a previous post described my struggles with deciding when algorithms based on mutable data were acceptable within Scala's ethos. The language has an obvious bias against mutable data, but it's a bias that seems quite reasonable; the arguments in favor of immutability are quite convincing. Yet mutability is explicitly supported by the language and is understood to be useful in some circumstances.

An example presented itself while completing the Bloom filter implementation. The original construction used a foldLeft operation to reduce the set of internal hash values into a single bit set:

private val bloom = (BitSet() /: (vals map evalAllFns))(_ ++ _)

Unfortunately this approach required the creation of a new BitSet instance for each input value, a potentially costly process. The performance was quite poor:

[@varese scala_bloom]$ time scala -classpath target/scala_2.8.1/classes:lib/lucene-core-3.1.0.jar org.fencepost.bloom.FilterApplication macbeth.xml /usr/share/dict/words lucene
Filter constructed in 12295 milliseconds
Total words in script: 16347
Matches: 14328
Searching completed in 51 milliseconds

real 0m17.116s
user 0m9.666s
sys 0m4.980s
[@varese scala_bloom]$ time scala -classpath target/scala_2.8.1/classes:lib/commons-lang-2.6.jar org.fencepost.bloom.FilterApplication macbeth.xml /usr/share/dict/words bloom
Filter constructed in 133928 milliseconds
Total words in script: 16347
Matches: 15046
Searching completed in 10 milliseconds

real 2m19.174s
user 0m27.765s
sys 1m40.000s

Clearly searching wasn't the problem. The performance was vastly improved by shifting to something based on mutable data structures:

import scala.collection.mutable.BitSet

private val bloom = BitSet()
for (aval <- (vals flatMap evalAllFns)) { bloom += aval }

[mersault@varese scala_bloom]$ time scala -classpath target/scala_2.8.1/classes:lib/commons-lang-2.6.jar org.fencepost.bloom.FilterApplication macbeth.xml /usr/share/dict/words bloom
Filter constructed in 7525 milliseconds
Total words in script: 16347
Matches: 14992
Searching completed in 9 milliseconds

real 0m11.872s
user 0m6.779s
sys 0m4.476s

In this case the mutability is contained within a single class: the bit set is itself a val and is never modified after being generated at construction. That said, the process for determining when mutable data should be used is still somewhat opaque.