TravisSwicegood.com

9 April

The Future of Relational Database Management Systems

There's been a lot of buzz around cloud computing with Google's new App Engine environment. I'm not going to discuss the merits or issues with their new product and instead focus on the fact that they're giving developers access to the BigTable via its DataStore.

This is the second major player to release access to an internal storage system that's document based, Amazon's SimpleDB being the other. In the open-source space we have CouchDB which is a distributed database that's completely document based. I see a definite trend from creating databases with relationships to store everything in one big blob.

And this makes sense. There are three motivators for normalized databases: reducing storage space, reducing processing required to find multiple instances of data, reducing number of places to update when something changes. The first two are hardware related. Computers and disks are ridiculously fast and cheap right now. For less than $10k you can have a machine that can outpace a half million dollar machine from less than a decade ago.

That leaves having to update multiple places when you change data. This is a technology problem and one that a lot of developers will recognize. It looks eerily similar to what we've been doing with caches for a long time. We grab the data, cache it the way we want, then retrieve that and ignore the shiny RDBMS that we have.

The problem with this is the same one you face with updating data in a de-normalized database. The default way is to just timeout and grab data after its old enough that we consider it stale. It's rudimentary, but works. The next step is having your data layer smart enough to expire the cache itself when it changes. Taking that further you end up with code that expires and primes the cache whenever there's an update.

Any route you take with caching, you end up with two data layers, the raw, relational database and the cached views of it. So it seems that document databases are the next logical step. Instead of creating this layer on top of you data to cache it, just store it as blob of data.

The one thing I haven't seen yet with these document-based systems is some sort of trigger mechanism in place, though, to make changes in one place ripple through the system. That can exist at the database level or the ORM level. It just has to be transparent to the developer.

Another interesting approach is what MySQL is planning for MySQL 6. They'll have communication with memcached from within MySQL via a user-defined function (UDF). There's already a project on forge.mysql.com that brings this functionality, but its still in alpha. Something like this coupled with the use of triggers could address these issues completely by making the MySQL server simply a data storage and backup mechanism that you only touch during development.

5 comments

Interesting post. You seem to put some weight on invalidating the cache. I don't think this is so important; it's rarely business critical to expire it immediately when the data changes, so why not just leave it to expire itself with a reasonably set TTL, and refresh it only then?

Also, when something is really critical, I guess you shouldn't cache it in the first place. Maybe there are rare exceptions, such as very expensive calculations, that must always be fresh to the second, but are only changed a few times a day. Somehow I don't believe there are many cases like these.

Also, there are ways of refetching the resource without directly invalidating it in the cache layer, like incrementally versioning each resource that gets modified. For example: if you change the image with id=6, instead of invalidating the image in cache, you rather insert a new image (id=7), so that the requests will be for id=7. The image is not there, so cache it on demand. And leave the older image to stale out. It's a very crude example and it's not always applicable, but it's very efficient.
Heya Travis,
terrific post and thanks for the link love :)

Just FYI: CouchDB actually comes with a trigger mechanism that notifies your code when a database is changed. You code can then kick-off any changes you need from there. Hope I understood you correctly.

Cheers Jan
--

@ gasper_k: I think being able to trigger an invalidation of the cache is a huge plus. A classic example would be something like a customer's phone number. It rarely changes, so in theory you could put it in a cache layer forever. Except they do occasionally change. If you can trigger the invalidation just on the change, you can remove TTLs completely and let your cache fill up until it needs to start pruning. Then it can intelligently prune based on most recent access instead of an arbitrary configuration setting.

Regarding whether or not it should be cached, being able to remove a hit to the database is huge. Look at a profile of any decent size application and you'll see the majority of the time is spent talking to the database. If you magnify that by 5 or 10 servers, all talking to the same DB, all of the time, you can quickly outstrip what your DB server can handle. The only time I wouldn't cache is when the data changes so rapidly that it can't be accessed from within the cache without already being invalid.

@Jan Glad to provide some links. The CouchDB project has definitely got some cool potential. I look forward to seeing where it ends up.

Regarding the trigger mechanism, that's exactly what I'm talking about. The ability to specify a callback when something changes would be perfect. I would assume you're even doing RESTful callbacks too? That is just too sweet. I've really got to plan some time out to play with Couch more.
I guess it depends on the case. I can just as easily imagine a case where some data is frequently changed, but doesn't have to be accurate for display purposes. For example: post count in an active forum. It changes all the time, but you can't nearly afford to invalidate it upon every change -- you would effectively hog the whole server just to display an accurate post count.

In such cases, it's far more reasonable to have it live with a TTL of, say, 15 seconds, and refetch it. Very few users will ever notice it's inaccurate.

I guess it just depends on the case, but I think that cases where data is changed frequently, and doesn't have to be accurate upon every retrieval, can be found more often than cases where data is changed seldom, and have to be accurate. Not talking about banks here, of course, but as web is becoming more and more user-generated, we can all live with a few seconds of stale data when we view forums, blogs, facebook profiles and whatnot.

As for the automatic cache cleaning, I think memcached removes data upon expiration (TTL), and also if the cache is full with a LRU strategy, so the rarely-used data is removed in favour of the more often accessed.

regards,
Gasper
"The one thing I haven't seen yet with these document-based systems is some sort of trigger mechanism in place, though, to make changes in one place ripple through the system."

Why would this be required if the "Blob" store is abstracting the storage architecture?

If the data persists across two or more caches it's irrelevant to the higher level (think Google FS). Maybe I'm not getting it.

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)