Monday, October 31, 2011

MongoDB Rocks My World

I am delighted to announce that I'll be speaking at several MongoDB-related events over the next couple of months. So if you're in or near Dallas, Seattle, or Silicon Valley, I'd love it if you could make it to one of my talks. 10gen (the makers of MongoDB) do a great job of putting on regional conferences that are extremely reasonably priced (typically $30 student and $50 early bird), and there's usually a nice after-party for each one, so there's no excuse not to go.

So anyway, here are the conferences I'll be speaking at:
I also wanted to say a few words on why I enjoy using MongoDB and why it is such a good fit for a lot of the problems I face...

MongoDB is Flexible

You probably expected me to say something about MongoDB being web-scale, and while that's true, it's not the biggest reason I use it. What I like the most is that you can organize your data the way you want to without a lot of restrictions forced on you by the DBMS you're using. For those who don't know, MongoDB is what's called a 'document-oriented' database. Rather than storing "rows" in "tables" like you do in a relational database, you store "documents" in "collections."

What is a document? Well, in MongoDB is basically a JSON object. (I'm actually lying. It's technically called BSON because it's a binary format and you get a couple of extra data types for 'free' in MongoDB including datetimes, ObjectIds, and regular expressions, but it's basically JSON.) So if your RDBMS-based application is doing some queries, and every query has a one-to-many type join (or worse yet you're doing several queries to grab all the data you need to render a page), in MongoDB you would just store all the data you need in a single document, which can be retrieved by a single query. Fast and flexible.

Another way MongoDB is flexible is that it allows you to use ad-hoc queries and indexes. In some NoSQL solutions, you need to define views or map-reduce jobs to get at your data. Rather than forcing you to define all that up-front (or pay a penalty at runtime), MongoDB includes a BSON-based query language and indexes. What this means is that you don't have to clutter your data model with every possible path someone might want to use someday to access your data. For me, MongoDB strikes a sweet spot between explicit query models like Hadoop or CouchDB and a full query compiler like SQL databases.

MongoDB is Fast

Ok, so there is a little bit about "web-scale." MongoDB supports both replication and sharding, which means that it scales really well over several servers. But that's not really what I'm talking about. What I'm talking about it single-server speed. MongoDB uses a nifty little trick (mmap) to map its entire database into memory, where it can treat reads and writes to the database as reads and writes to RAM. For many common operations, this is really fast. Like tens of thousands of writes per second fast. This means that for a site like SourceForge, which is getting millions of hits a day, we're comfortable serving the majority of requests off a single MongoDB server. And by setting up replication (which we have now) and journalling (which we plan on adding Real Soon Now) with MongoDB, you can be quite safe as well. Replication also means that backups are a snap and don't require our server to go through any downtime.

So great, it's fast in production. But it's also fast (and easy to set up) for development. The MongoDB server is just a single binary application that you can run yourself from the command line, and it keeps all its data in a single directory. This is really nice for a development setup like mine, where I'm running everything needed for an Allura (the SF.net platform) on a VMWare virtual machine. It's just plain simple to set up a single node, and setting up larger systems with replication and sharding isn't *that* much harder.

MongoDB is Simple

While there are many parts of MongoDB that I'm sure took a good bit of complexity in the server to get right (shard migration, journalling, and the new aggregation framework come to mind), the basic ideas are easy to get your mind around. Some of the other NoSQL solutions do some really mind-stretchingly awesome stuff (Cassandra's write scaling and CouchDB's multi-master replication come to mind). MongoDB has some nice and interesting features, as well, but its real strength is that it does what it does really well. It's fast (performance wise and development wise), flexible, stable, and scalable. And because a) the 10gen engineers are awesome and b) it's open source, when there is a feature or technique enough people want/need, MongoDB grows that feature. But historically, it has grown that feature in a way that fits in really well with the existing feature set and doesn't require an enormous cognitive leap to grasp.

10gen

MongoDB's creators are an awesome group of folks. They're throwing conferences every 6 weeks or more frequently somewhere, they're incredibly responsive to support questions, they're actually really smart, and they've developed an incredible community around MongoDB. And just for fun, they've decided to single-handedly support the coffee mug industry. Seriously, though, the support they provide for MongoDB is nothing short of stellar, from n00b to Mongo master, and that's probably the biggest reason MongoDB has seen the kind of uptake it has.

So if you haven't ever had the chance to use MongoDB, I'd urge you to grab a copy and start playing around with it. And if you're curious, try to make it out to one of the MongoDB conferences or user groups. All the folks I've met at these are quite friendly and love to help newer users along. So yeah, I guess I'm a MongoDB fanboy. Think I've gone off the deep end? Any other features of MongoDB you love? Please let me know in the comments. I'd love to hear from you!

12 comments:

  1. Anonymous4:03 PM

    What is the best way to use mongoDB with django?
    Should it be used with the ORM?

    ReplyDelete
  2. @Anonymous: there's a lot to be said for just using the 'raw' Python driver pymongo (http://blog.fiesta.cc/post/12167731260/on-object-document-mapping is a post advocating such an approach). I am the author of Ming (http://sf.net/projects/merciless/), a library providing an object/document mapper for MongoDB, so I like it quite a bit, but I don't know about using it with Django. MongoEngine (http://mongoengine.org) seems to take a more Django-ORM approach that Ming, so you might want to look at that as well (Ming tries to look more like SQLAlchemy).

    ReplyDelete
  3. But why using a non-rel db on a framework (django) that work with object on relational db? In my opinion the best feature of django's orm is that it transform your object and save it on relational db. Isn't like that?

    ReplyDelete
  4. Well, Django's ORM can certainly give a relational data store the *appearance* of being an object store, it's still a relational datastore under the covers. This can become important particularly when you're using a large number of related objects, which can lead to a) slow running queries or b) large numbers of fast queries, either of which can end up slowing down your web app.

    You also have to worry more about things like up-front database design and migrations when dealing with relational databases, where you can get a lot further in a NoSQL solution with hand-waving. (You will *eventualy* have to deal with these issues in a big project, but you can delay them until after you're pretty satisfied with your application design).

    ReplyDelete
  5. Given that I've never worked on large django application, did you think that to speed up queries, memcached (django supports it, right?) isn't a good solution?

    I've read some discussions on a python web developers group on linkedin, that many django developers are using memcached and/or redis.

    ReplyDelete
  6. Memcached + a relational database is a reasonable solution, particularly if you already have data in a relational model or you need to share access to data with other applications which use it in different ways. But once you add memcached to the mix, what you have is actually an ad hoc NoSQL solution (memcached for performance, RDBMS for persistence). While that should work, it's more moving parts, making it easier (in my experience) to make mistakes than it would be to just use a NoSQL solution to begin with.

    I'm not saying MongoDB is a fit for every problem, I'm just saying you should check it out. It's a flexible and powerful tool to keep in your belt, and you can become proficient in it *very* quickly.

    ReplyDelete
  7. Thanks for your explanation. I've tried MongoDB a year ago with php and I think that is a good solution for web scalability.

    My reluctance to use mongodb with django comes from the fact that you can't use standard django but you must set up django-nonrel (I've heard that in a discussion on Django-MongoDB talk in MongoTorino). Anyway as I said before I'm not experienced with large django applications. So probably when I will have to encounter some of the problems you've explained in recent comments, you can say me:"I told you that!" ;)

    Anyway thanks for this great discussion!

    ReplyDelete
  8. Anonymous11:42 PM

    How do you avoid storing duplicate data in Mongo? or do you care?

    ReplyDelete
  9. Anonymous2:20 AM

    Replication (Replica sets) in MongoDB is weird. I don't feel its a optimistic design.

    ReplyDelete
  10. I have strated playing around with the mongoDB using IronPython.

    The python-drier should not be used with IronPython because he didn't survived some unit-test. And in the .NET driver there is an issue with a bulk-insert via a MongoCollection. error:

    cannot invoke explicitly implemented generic interface methods

    But the issue is already reported and I hope it will be fixed soon.
    But mongoDB has a great concept which is implemented very cool. So from my side: Try it and you won't use a relational database again.

    ReplyDelete
  11. @Davide: Certainly going with anything other than the Django ORM., you lose some of the Django special-ness (like the excellent admin tool). You probably sacrifice a lot of pluggable applications as well. Those are points I had not thought of, and I'm not a Django developer, so I guess it's not that surprising. If you have any more information about django-nonrel that could be added to the discussion, I'm sure it would be appreciated.

    @Anonymous #1: You can avoid duplicates in MongoDB by creating unique indexes, which will enforce the uniqueness constraint on insert/update.

    @Anonymous #2: I'm not sure what's weird about replica sets in MongoDB. It's pretty flexible, allowing you to even specify on a per-operation basis how many replicas get the data. Maybe you find the asynchronous nature of replication off-putting? It seems like most DBs do it similarly.

    ReplyDelete
  12. Hi! I’m the Community Manager at DZone.com and I’d like to discuss our Most Valuable Bloggers program with you. It is a free program that promotes quality developer blogs such as this one. E-mail me at lgordon {at} dzone{dot}com if you are interested! I hope to hear from you.

    ReplyDelete