Sunday, October 31, 2010

base64 in Clojure: An initial attempt

After working through most of Stuart Halloway's excellent Programming Clojure I found myself in search of a meaty example to exercise my Clojure chops. Re-implementing the Project Euler examples in another language didn't seem terribly appealing. Maybe a base64 implementation; I hadn't implemented the encoding in a while and it is a reasonable exercise when learning a new language. So what kind of support does Clojure already have?

As expected there's no base64 support in the core. clojure-contrib does include some support, although the documentation doesn't inspire a great deal of confidence:


This is mainly here as an example. It is much slower
than the Apache Commons Codec implementation or
sun.misc.BASE64Encoder.


Intriguing; looks like we have a winner!

The only complexity in a base64 implementation is handling the one and two byte cases. Any code must support returning the last two bits of the first byte (or the last four bits of the second byte) as distinct values in these cases as well as combining these bits with the leading bits of the second and third bytes (respectively) in other cases. My initial implementation was a fairly straightforward adaptation of this principle. The core consists of a letfn definition with functions for handling the one, two and three byte cases. Each function returns two values: a list of all 6-bit elements discovered so far as well a "remainder" value representing the trailing bits in the one and two byte cases. This structure leads to better function re-use; as an example the two-byte function can build on the one-byte function. A wrapper function calls the correct value for a given set of bytes. Now we need only partition our byte array into groups of three or less and apply the wrapper function to each group. A combination of the partition and map functions do exactly that.

So how'd we do? We implement a simple test program to generate a set of random data, map a base64 implementation on to each sample and then realize the lazy sequence by converting it into a vector. We time this conversion (for a common set of sample data) for our implementation, the clojure-contrib implementation and commons-codec. The following run is fairly representative:


bartok ~/git/base64-clojure $ source set_classpath.sh
bartok ~/git/base64-clojure $ clojure fencepost/test/compare_base64.clj
Commons-codec
"Elapsed time: 69.074926 msecs"
RFJlYWkqIF10RCc1PXlpO3FcIDY6dkJuQkhHNWxRJkwkOFdgciFGfHwhYz5EYG15PjxWdFJ9Ml83TGFoeltHSTs+ST9mdj0rfSZrcVNIKn5oKSdTI3U5a1FqIzBvIkRIV0BkK3xCZjtrSWYuTiM1XWB9UW14W2dVLFM=
clojure-contrib
"Elapsed time: 1221.124746 msecs"
RFJlYWkqIF10RCc1PXlpO3FcIDY6dkJuQkhHNWxRJkwkOFdgciFGfHwhYz5EYG15PjxWdFJ9Ml83TGFoeltHSTs+ST9mdj0rfSZrcVNIKn5oKSdTI3U5a1FqIzBvIkRIV0BkK3xCZjtrSWYuTiM1XWB9UW14W2dVLFM=
fencepost/base64_letfn
"Elapsed time: 375.829153 msecs"
RFJlYWkqIF10RCc1PXlpO3FcIDY6dkJuQkhHNWxRJkwkOFdgciFGfHwhYz5EYG15PjxWdFJ9Ml83TGFoeltHSTs+ST9mdj0rfSZrcVNIKn5oKSdTI3U5a1FqIzBvIkRIV0BkK3xCZjtrSWYuTiM1XWB9UW14W2dVLFM=


These results do confirm that that clojure-contrib implementation is very, very slow in relation to commons-codec. We've reduced that runtime by roughly a third but we're still five or six times slower than commons-codec. Surely we can do better; in the next installment we'll see how to do so.

As usual all code can be found on github.

No comments:

Post a Comment