This past Tuesday we (Vermonster and Greg) hosted the fourth installment of Newer Yankee Workshop. This one was on the über-trendy topic of NoSQL data stores. We covered two - Basho’s Riak and Apache’s CouchDB. After this past NoSQL Live event, I personally became very exited about Riak, a relatively new player in NoSQL. CouchDB is a nice RESTful document data and has been around longer than Riak.
The weather was crap, another day of solid rain. Although we filled our capacity on EventBrite, I wasn’t actually expecting many to trudge through a soggy walk to our office. But I was pleasantly surprised with a solid showing – we had about 17 people in attendance.

Using a crude slide show as a guide, we had a super brief overview of NoSQL, Brewer’s CAP theorem, and an introduction to our two data stores. Most everyone in the crowd was armed with OS X, outside of a few running Ubuntu, so installing and starting Riak and CouchDB (for development) was a snap. I recommended people use the Riak binaries (although I think some built from source), and Greg instructed folks to use a simple one-click-installer for CouchDB. We got our RVM on, built a gemset for the workshop and loaded a few gems.
After everyone was up and running with two shiny NoSQL databases, we presented a simple business problem. The context is a project we are working on here at Vermonster called OpenDocket. It is still very early on, and respecting brevity, I won’t go into too many details in this post. But this did introduce two concepts we modeled - 1) Docket Items and 2) Motions. A docket item could have several motions. And a motion belongs to a docket item. A motion also has a number of yea’s and nay’s (such that more yea’s then nay’s means the motion has passed).

We modeled things in both data stores; using Ripple as an ORM for Riak and Couch Potato for CouchDB. You can see the two versions of the code at github (see the riak and couch branches). Everyone cloned the repo, ran spec ./spec in each branch and saw green. Excellent!
Next we walked through the code describing some of the features of each database.
Both CouchDB and Riak feature MapReduce functionality. As the data is distributed in these kinds of data stores, collecting and searching (i.e. mapping) also needs to be distributed. Reduction is a phase after map and aggregates the results. Someone made a nice analogy (and I apologize for forgetting your name) with ruby, and Harold suggested a once for SQL:

For those too young to have missed (or too old to to remember) analogies on their SATs – these read “map is to reduce” as “enumerable is to inject in ruby” and “map is to reduce” as “where clause is to aggregate functions in SQL”. Examples of aggregate functions in SQL are things like COUNT and SUM.
Implementing MapReduce functions in both data stores can be written in Javascript. In fact, it is probably worth mentioning that both use a nice RESTful interface with JSON encoded data to do their business. So using Javascript is a natural choice. Also related, Riak can process MapReduce functions in Erlang.
We used MapReduce to demonstrate the has many relationship between Docket Item and Motion. In Riak, using the Ripple ORM, this is expressed as:
ruby --
motions = Riak::MapReduce.new(Ripple.client).
add("motions").
map(motions_map_function, :keep => true).run
To take a step back, Riak’s MapReduce specification comprises of two components, an input and query. The add method allows you to add bucket and tags as the input. The map method adds to the query portion (others include link and reduce).
It’s probably worth mentioning some naming conventions. In Riak there are buckets that contain objects, identified by keys. This is similar to other key-store databases, including Amazon S3 (which you may already be familiar with and using).
The motions_map_function is this method (which is essentially Javascript):
ruby --
def motions_map_function
<<-JAVASCRIPT
function(v){
var results = [];
var motion = JSON.parse(v.values[0].data); motion["key"] = v.key;
if(motion.docket_item_key == '#{self.key}') results.push(motion);
return results;
}
JAVASCRIPT
end
The map function in Riak can actually be expanded a bit (something we should have spent more time on in the workshop). There are three inputs to the function:
javascript --
function(item, keydata, arg)
item is the value found at the current key. It has 4 attributes: bucket, key, vclock, and values. NOTE: values is an array because there could be siblings, like during a vclock conflict.keydata is the data passed in the input portion of the call.arg are any :arg => ... values passed in the actual map method.Riak reduce (not shown for this has_many example) takes in two arguments:
javascript --
function(valueList, arg)
valueList is the output of the map phasearg is what’s passed into the reduce methodBoth have an optional :keep parameter that tells Riak whether to save the results of the phase or not. For example, if you had both a map and reduce function, you would likely not keep the map results, but would keep the reduce.
Additional note, Ripple uses an attribute _type to track the klass of the object, so it knows what to instantiate when processing retrieved documents.
With CouchDB and Couch Potato as our ORM, any time you would like to view results, you first create a view spec. View specs contain map and reduce methods. Couch Potato provides some syntactic sugar that automatically creates MapReduce methods for common views.
For instance, we have a view called by_id:
ruby --
view :by_id, :key => :_id
You might notice the key is _id, this is a convention in CouchDB as the identifier for the document. As Greg pointed out, this is kinda nice, where in ActiveRecord, there’s lots of effort making docket_item.id marry up to the primary id field in the database, all without colliding with Ruby’s built-in object#id. Additionally, Couch Potato will generate a MapReduce function with this key syntax. In this case, in CouchDB the map looks like this:
javascript --
function(doc) {
if(doc.ruby_class && doc.ruby_class == 'DocketItem') {
emit(doc['_id'], 1);
}
}
And the reduce looks like this:
javascript --
function(key, values) {
return sum(values);
}
You will notice in Ripple, _type was used to identify the ruby class type, Couch Potato uses the key ruby_class. To retrieve data, we pass the view spec as an argument to a #view method:
And we have docket items!
The docket_item.motions has-many relationship to motions was demonstrated by creating a view in the Motion class.
ruby --
view :by_docket_item_id, :key => :docket_item_id
The auto-generated map function looks like this:
javascript --
function(doc) {
if(doc.ruby_class && doc.ruby_class == 'Motion') {
emit(doc['docket_item_id'], 1);
}
}
(The reduce looks the same as above)
So then in the DocketItem, we have a method that calls this passing in the docket_item’s id as a key:
ruby --
def motions
if self._id
view_spec = Motion.by_docket_item_id( :key => self._id )
CouchPotato.database.view(view_spec)
else
[]
end
end
One Riak feature I am very excited about is link walking. This feature lets you create links from one object to another, while tagging. It is similiar to having named edges in graph databases.
Consider the relationship Docket Item has many Motions. Previously we showed this with a MapReduce method. But this could have just been implemented using links. This technique was shown by first creating links in an after_save in motion.rb.
ruby --
after_save :create_links
def create_links
if self.key && self.docket_item_key
self.robject.links << Riak::Link.new("/riak/docket_items/#{self.docket_item_key}", "docket_item")
self.robject.store
d = docket_item
d.send(:robject).links << Riak::Link.new("/riak/motions/#{self.key}", "motions")
d.send(:robject).store
end
end
One curve here is robject. This is the Riak object (in Ripple, there are two namespaces, Riak and Ripple, the former is a lower-level interface to the Riak server). This allows us access to the links getter/setter. Also worth noting is the robject saves with a store method.
So let’s examine what’s going on here. Basically, we create two links after each motion is saved.

And this sort of thing would happen for any number of motions for Docket Item 1.

Link walking is preformed by starting at a document object, then indicating which bucket to enter. Riak will follow any link that connects the object to any object in the specified bucket. To restrict which links to follow, a tag parameter can be supplied when walking. This basically means “follow any link to this bucket that are tagged that”. As Riak uses REST, you can actually test this in your web browser:
This would follow any links from docket item with the specified key to the motions bucket with any tag (the first _, the last _ is a keep flag). Using Ripple, we can preform a link-walk with a simple walk as seen in the motions_via_link method:
ruby --
def motions_via_links
motions = robject.walk(:bucket => "motions").flatten
motions.map{|m| instantiate(m) }
end
Where :bucket and :tag are parameters.
To build on this, suppose we created a link when a motion passes, with a passed_motions tag (and could also create one for failed_motions).

That would allow us to do things like
This allows for very interesting slicing and dicing of how data is related without writing methods in the application layer. AND link walking can be part of a MapReduce method, way cool indeed! Here’s an example getting the all the passing motion titles.
ruby --
Riak::MapReduce.new(client).
add("motions").
link(:bucket => "motions", :tag => "passed_motions", :keep => false).
map("function(v){ return [JSON.parse(v.values[0].data).title]; }", :keep => true).run
After all this, we issued a challenge, “Find the difference between yeas and nays”. Some used CouchDB and others used Riak and everyone had their hands on creating actual MapReduce functions.
Victor was quick at the draw and came up with the following Riak solution:
ruby --
def self.delta_yeas_nays
motions = Riak::MapReduce.new(Ripple.client).
add("motions").
map(delta_map_function, :keep => false).
reduce(delta_reduce_function, :keep => true).
run
end
private
# A map function to get the motions given a docket item key
def self.delta_map_function
<<-JAVASCRIPT
function(v){
var results = [];
var motion = JSON.parse(v.values[0].data);
results.push((motion.yeas || 0) - (motion.nays || 0));
return results;
}
JAVASCRIPT
end
def self.delta_reduce_function
<<-JAVASCRIPT
function(results){
var sum = 0;
for(var i = 0; i < results.length; i++) { sum += results[i]; }
return sum;
}
JAVASCRIPT
end
And the following CouchDB solution:
ruby --
view :delta_yeas_nays,
:map => "function(doc){ emit(doc._id, (doc.yeas || 0) - (doc.nays || 0)); }",
:reduce => "function(keys, values){ return sum(values) }",
:type => :raw,
:include_docs => false,
:results_filter => lambda {|results| results['rows'].first['value'] }
As some attendees dove into TDD mode, it was quickly obvious that without database transactions, it is difficult to test without manually clearing the database. I think that a simple extension to database_cleaner to support Riak would go a long way.
Also, working on TDD for the reduce functions themselves using jspec would also be terrific.
This was a really fun workshop to put on, it really pressed our learning of NoSQL in preparation. And I’d like to thank everyone that made it out through the sogginess.
One thing not covered was embedded documents. Partially because we ran out of time in our preparation and partially because we ran out of time in the workshop.