Sunday, May 1, 2011

Scala, Bloom Filters and Shakespeare: Introduction

The Problem

Maybe it was a recent New York Times review of a new production of "Macbeth" starring John Douglas Thompson. Maybe it was a reflection on Kurosawa's use of Shakespearean plots and devices within some of his best work. Maybe it was just one of those random thoughts that pop into your head while walking out the door in the morning. It's origin may be hidden from me now, but recently a question entered my mind and refused to leave:

What percentage of the words in an arbitrary play by Shakespeare cannot be found in a common modern dictionary?

There's no doubt that some of Shakespeare's language would now be considered archaic. Quite a bit of it is still in common use today, however, and some usages have made their way into the lexicon via the modest success enjoyed by the Bard through the years. So how big is that set of archaic words? Is it a majority, a small minority or something else entirely?

After pondering a few approaches to this question it became clear that there was room for a fair amount of further work. A standalone program could help us answer this question but there was no reason to stop there. The solution to this problem lends itself naturally to decomposition, and that fact in turn suggests that this problem might be an interesting vehicle for exploring more complex solutions. With any luck we'll get to this exploration in later posts, but for now our goal is to describe the problem space and provide the general framework of a solution.

A Few Solutions

Broadly speaking we will answer this question by constructing a filter containing all the terms in our dictionary. We will then extract all individual words from the script of an arbitrary play and apply our filter to each of these words. Simple, right?

We might consider using a Bloom filter to implement our filter. We would expect a Bloom filter to be a very compact yet very efficient representation of the contents of our input dictionary. The issue is that any Bloom filter would produce some number of false positives, values that for which our filter should return false (i.e. "not present") but return true instead. This problem is offset somewhat by the fact that a negative result from a Bloom filter will always be accurate, but in the end this approach will give us at best a close approximation to the actual value. For our purposes this should be adequate enough.

Note: this post is not intended to be a primer on Bloom filters or their implementation. If you're looking for such a thing there are already excellent sources available.

An alternate approach might be to implement a filter using a search index such as Lucene. The result would not be as compact as a Bloom filter but it would certainly offer a more accurate result.

To make things interesting we choose to create both versions. We can live with a good approximation so we focus on the Bloom filter implementation. The Lucene implementation will be used primarily as a baseline for comparing performance and accuracy of results. In order to justify this approach we impose the following additional constraints on the Bloom filter implementation:

  • It must outperform the Lucene version

  • It should use a minimal amount of memory

The resulting code can be found on github.

Observations Along the Way

Extracting words from plays

Shakespeare's plays are available in XML format from ibiblio. We are still tasked with extracting individual words from these documents, however, and our concern for resource usage prevents us from simply reading the entire document into memory and querying it via XPath or XQuery. In the end we resolved this issue by implementing what is at heart a finite state machine (FSM) driven by inputs obtained from the XML pull parser provided by the Scala library.

Minimizing false positives

There are known approaches to constraining the number of false positives produced by a Bloom value to some reasonable maximum. Unfortunately the size of our data set and our resource constraints prevent us from leveraging these techniques. The number of false positives is also inflated by some of our implementation choices.

The dictionary used in our implementation was provided by the "words" package in Fedora Core 13. The dictionary in this package contains approximately 480,000 words. Assuming distinct internal hash values for every word our Bloom filter should support all 19-bit input values (since the base-2 log of 480,000 is 18.87). To avoid congestion we strive for a filter that is ~50% full, which in turn requires support of all 20-bit inputs. Note that our space constraints make the use of a second hash function unwise; we'd need a much larger capacity for a population size as large as ours. Also note that Bloom filters use a bit set for a backing store and the memory usage of a bit set is driven by the largest value contained in that set. It follows that our filter will use no more than (2 ^ 20)/8 = 131KB of memory.

Unfortunately a few of our implementation choices help to artificially inflate our false positive rate. These include:

  • Scala's bit set data structure accepts only non-negative values while the MurmurHash function we use internally can generate arbitrary 32-bit output (including the full range of negative integers on the JVM). We resolve this issue by using the absolute value of the MurmurHash output, the result of which is that two output values now result in the same output from our internal function(s).

  • Bit set size is constrained by masking some number of bits from the MurmurHash output. This technique also causes multiple MurmurHash outputs to result in the same output from our internal function(s).

The results below suggest a false positive rate on the order of 4%, an exceedingly high value.

Immutable and mutable data structures

The end of a previous post described my struggles with deciding when algorithms based on mutable data were acceptable within Scala's ethos. The language has an obvious bias against mutable data, but it's a bias that seems quite reasonable; the arguments in favor of immutability are quite convincing. Yet mutability is explicitly supported by the language and is understood to be useful in some circumstances.

An example presented itself while completing the Bloom filter implementation. The original construction used a foldLeft operation to reduce the set of internal hash values into a single bit set:

private val bloom = (BitSet() /: (vals map evalAllFns))(_ ++ _)

Unfortunately this approach required the creation of a new BitSet instance for each input value, a potentially costly process. The performance was quite poor:

[@varese scala_bloom]$ time scala -classpath target/scala_2.8.1/classes:lib/lucene-core-3.1.0.jar org.fencepost.bloom.FilterApplication macbeth.xml /usr/share/dict/words lucene
Filter constructed in 12295 milliseconds
Total words in script: 16347
Matches: 14328
Searching completed in 51 milliseconds

real 0m17.116s
user 0m9.666s
sys 0m4.980s
[@varese scala_bloom]$ time scala -classpath target/scala_2.8.1/classes:lib/commons-lang-2.6.jar org.fencepost.bloom.FilterApplication macbeth.xml /usr/share/dict/words bloom
Filter constructed in 133928 milliseconds
Total words in script: 16347
Matches: 15046
Searching completed in 10 milliseconds

real 2m19.174s
user 0m27.765s
sys 1m40.000s

Clearly searching wasn't the problem. The performance was vastly improved by shifting to something based on mutable data structures:

import scala.collection.mutable.BitSet

private val bloom = BitSet()
for (aval <- (vals flatMap evalAllFns)) { bloom += aval }

[mersault@varese scala_bloom]$ time scala -classpath target/scala_2.8.1/classes:lib/commons-lang-2.6.jar org.fencepost.bloom.FilterApplication macbeth.xml /usr/share/dict/words bloom
Filter constructed in 7525 milliseconds
Total words in script: 16347
Matches: 14992
Searching completed in 9 milliseconds

real 0m11.872s
user 0m6.779s
sys 0m4.476s

In this case the mutability is contained within a single class: the bit set is itself a val and is never modified after being generated at construction. That said, the process for determining when mutable data should be used is still somewhat opaque.

No comments:

Post a Comment