POSTS

MongoDB: A first look

The entire subject of two talks and mentioned in several other, MongoDB was definitely a buzz at TekX this year. It's long been in favor in the tech community in Lawrence and has been used for some data crunching for a few projects at the local paper. Even with all of this exposure, I've yet to sit down and actually explore it.

That changed Friday afternoon while I sat at O'Hare waiting on my flight back to Lawrence (which subsequently got canceled). I installed Mongo earlier in the week and opened up a bunch of tabs on the various intros and tutorials available on the Mongo wiki. The rest of this article a mix of stream-of-conscious as I played around with Mongo for the first time and some of my reflections this past week.

Note on typefaces

I use both Mongo and mongo throughout this article. The first, the title-case Mongo refers to the software as a whole. Whenever you see mongo with a lowercase and in monospace, it's referring to the Mongo client program you run from the command line.

Installation

On a Mac, it's a breeze. I use Homebrew to manage software on my Mac, so a quick brew install mongodb was all I needed and a minute later I was ready to go.

Starting Up the Server

Mongo is run by the mongod process. I don't know if it's pronounced mongo-d or mon-god though. It's a fun play on words if the latter is the case.

Brew includes a basic configuration to get up and running, so I use that inside a screen instance so I can leave it running in the background while I use the mongo tool to interact with it.

Interacting with Mongo

I started out with the basic tutorial to get going. It looks like that needs some love though. It shows the version in the startup as 0.9.8. Homebrew ships with 1.4.2 and I did find a few things that were out of date. No, I' haven't been a good open source community member and submitted fixes yet.

The first thing that's different than a traditional RMDBS with Mongo is that you don't have to explicitly create a database. Pretty straight forward: from within mongo, type use <database>. This creates a brand new database for you and you're off. For the examples below, I'm using use mydb to select mydb as my database.

It's kind of nice to just be able to connect and go, but it feels odd. Not good or bad, just odd. Sort of like the first time you run git checkout inside a repository to switch branches when you're used to Subversion.

The shell feels like a Javascript console. I don't have access to the source code in my off-line mode, so I don't know but that it is. The syntax seems remarkably similar, so it's at least Javascript inspired.

Adding Records

Mongo stores documents, not rows of columns. This distinction allows Mongo to ignore schema—continuing the theme of leaving it up to the developer. Those documents can be made up any number key-values that look remarkably like JSON. Need to store a new data point, just add it as a field to a document and you're set.

Here's an example inspired by Mongo's tutorial for adding a few records:

> person = {name: "Travis Swicegood"}
> city = {city: "Lawrence", state: "KS"}
> db.things.save(person)
> db.things.save(city)

Here I created two new objects with various data attached to them, then saved them all inside the things collection. Collections in Mongo are like a table inside the SQL world. You don't have to create a collection, you just declare it on the db object, and you're set.

Comparing this to the same code in a database, I've got to say I love this. No boilerplate code to get going. I didn't have to create a database, no tables were created. I just started using them. This appeals to my laziness—err, I mean desire for efficiency, but also looks very promising to teach someone new. Every abstract idea you can remove is one less potential stumbling block for someone starting out.

Back to the data I entered. Notice that neither have the same fields. Collections inside Mongo are made up of a series of keys and values—they can be whatever you want them to be. This is perfect for lazy migrations: migrating the data as its requested instead of doing it all at once. ming, a Python wrapper around Mongo already provides this. This is especially useful for large sites with lots of data that may or may not ever been requested again.

Finding Records

Now that the records are there, finding them. The db.things object comes back now:

> db.things.find()
{ "_id" : ObjectId("4bf9a96b7d04f51b48499011"), "name" : "Travis Swicegood" }
{ "_id" : ObjectId("4bf9a96f7d04f51b48499012"), "city" : "Lawrence", "state" : "KS" }

That gives me everything. The find method takes optional parameters to filter the results. This is actually a good time to bring up the built-in help in mongo. Entering only the value of any function (i.e., without calling it) displays the implementation of the function:

> db.things.find
function (query, fields, limit, skip) {
    return new DBQuery(
        this._mongo, this._db, this, this._fullName,
        this._massageObject(query), fields, limit, skip);
}

Note: I changed the formatting so it's more easily viewable online.

The parameters are optional (like all Javascript function), so you can pass in as many or as few as you want. Filtering the results is done by providing a hash for the query parameter (the first one). For example, to find my record:

> db.things.find({name: "Travis Swicegood"})
{ "_id" : ObjectId("4bf9a96b7d04f51b48499011"),
  "name" : "Travis Swicegood" }

One thing you can't do is full-text searching. I can't ask for all of the records that begin with Travis or have a portion of my name in it. The current recommendation (at least via the wiki) is to build your own list of keywords as an array, then search that array. For example:

> var person2 = {name: "Travis Swicegood",
>                name_field: ["Travis", "Swicegood"]};
> db.things.save(person2)
> db.things.find({name_field: "Travis"})
{ "_id" : ObjectId("4bf9afa17d04f51b48499014"),
  "name" : "Travis Swicegood", 
  "name_field" : [ "Travis", "Swicegood" ] }

For something like a name, this can be useful. For full-text searching of an article, it's probably best to delegate searching off to something like Solr and let Mongo focus on storage and retrieval.

Querying for sub-objects

Of course, I had to try sub-objects to see if they would work:

> db.things.find({person: person2})
{ "_id" : ObjectId("4bf9b02b7d04f51b48499015"), 
  "person" : { "name" : "Travis Swicegood",
               "name_field" : [ "Travis", "Swicegood" ],
               "_id" : ObjectId("4bf9afa17d04f51b48499014") },
  "city" : { "city" : "Lawrence",
             "state" : "KS",
             "_id" : ObjectId("4bf9a96f7d04f51b48499012") } }

You can also query using the dot-notation to &lquot;reach through&rquot; an object and look at its children. This returns the same result as the previous query:

> db.things.find({"person.name_field": "Travis"})

Limiting returned columns

This ability to dynamically add columns to a record and definitely provides a breading ground for massive documents with lots of keys. Most of the time a small subset of those keys are all that's needed. The second parameter in find provides us with that functionality:

> db.things.find({person: person2}, {city:1})  
{ "_id" : ObjectId("4bf9b02b7d04f51b48499015"), 
  "city" : { "city" : "Lawrence",
             "state" : "KS",
             "_id" : ObjectId("4bf9a96f7d04f51b48499012") } }

Likewise, you can reach through the object and pull out a subfield:

> db.things.find({person: person2}, {"city.state":1})
{ "_id" : ObjectId("4bf9b02b7d04f51b48499015"), 
  "city" : { "state" : "KS" } }

These examples bring up a syntax thing with Mongo that I'm not crazy about: the use of the number one. It's the standard C style: 1 is true, 0 is false. I'd love to see the client and the libraries adopt an intent revealing name. Granted, this is a minor niggle, but the little things are what make a good system an amazing one.

Few issues

The docs, being that they are community run and Mongo's still relatively new, are a little loose. I've found a bunch of examples looking through them that don't work the way they were documented.

Another potential issue (or at least something you need to be aware of) is that Mongo's geospatial support isn't 100% year. They only provide 2d and the math they use assumes that 1° of longitude is the same at the poles as it is at the equator. For many applications, this isn't a huge issue, but if precision is important, Mongo's not ready for this type of use.

One thing that I'm looking forward to is Mongo's sharding. That is going to allow Mongo to scale horizontally really well. Some of the initial test results look amazing. What will be really interesting is to see how well is scales down. It's one thing to have over 300,000 ops/sec on a bigger box, another thing to be able to manage it on something like a 1gb instance on Rackspace Cloudservers.

Two Biggest Issues

First, Mongo's a master-slave system. It appears really robust, but whenever a box takes on a special role I start to get nervous. One of the promises of &lquot;NoSQL&rquot; is that it provides a tremendous amount of resilience. Any time you start to add special nodes you're taking away from that.

For example, if you're running 5 homogeneous servers and one goes down, the other 4 can pick up the slack—assuming you're not running 5 servers at peak capacity. This makes failure planning easy: figure up the amount of CPU time you need to handle your load, provision that many servers, then add enough servers to be comfortable when they start failing. Need 3 servers, provision 5 and you can have two failures before you peg your machines.

This isn't to say Mongo can't handle failures. It's current model is rebalancing the load when one of the servers goes out. mongos is the tool to read up on for handling this. Unfortunately, I haven't been able to dive into it yet. The only way to know for sure is to build up a cluster then start killing servers. Of course, this type of testing is preferred for any data storage system.

Second, the license. I'm not anti-AGPL, but there's some ambiguity. The Mongo team has addressed this both on the wiki and through an in-depth blog post. According to that, I can write up a service such as MongoHQ and as long as I don't actually change the mongod or mongos code I'm fine.

On the other hand, most of the definitions I've read of the AGPL mean that code that talks to it is subject to being hit with the AGPL. I don't have any doubts with 10gen, but if they don't always own the copyright…

Of course, those last two paragraphs are with the caveat I am not a lawyer.

I think Mongo is an amazingly compelling piece of software in the non-standard database realm. With the upcoming sharding and what I would have to imagine is an eminent fix to the geospatial queries, Mongo's definitely worth a look.