Tuesday, September 25, 2012

MongoDB Schema Design at Scale

I had the recent opportunity to present a talk at MongoDB Seattle on Schema Design at Scale. It's basically a short case study on what steps the MongoDB Monitoring Service (MMS) folks did to evolve their schema, along with some quantitative performance comparisons between different schemas. Given that one of my most widely read blog posts is still MongoDB's Write Lock, I decided that my blog readers would also be interested in the quantitative comparison as well.

MongoDB Monitoring Service

First off, I should mention that I am not now, nor have I ever been, an employee of 10gen, the company behind MongoDB. I am, however, a longtime MongoDB user, and have seen a lot of presentations and use cases on the database. My knowledge of MMS's internal design comes from watching publically-available talks. I don't have any inside knowledge or precise performance numbers, so I decided to do some experiments on my own to see the impact of different schema designs they might have used to build MMS.

So what is MMS, anyway? The MongoDB Monitoring Service is a free service offered by 10gen to all MongoDB users to monitor several key performance indicators on their MongoDB installations. The way it works is this:

  • You download a small script that you run on your own servers that will periodically upload performance statistics to MMS.
  • You access reports through the MMS website. You can graph per-minute performance of any of the metrics as well as see historical trends.

Eating your own dogfood

When 10gen designed MMS, they decided that it would not only be a useful service for those who have deployed MongoDB, but that it would also be a showcase of MongoDB's performance, keeping the performance graphs updated in real time across all customers and servers. To that end, they store all the performance metrics in MongoDB documents and get by on a modest (I don't know exactly how modest) cluster of MongoDB servers.

To that end, it was extremely important in the case of MMS to use the hardware they had allocated efficiently. Since this service is available for real-time reporting 24 hours per day, they had to make design the system to be responsive even under "worst-case" conditions, avoiding anything in the design that would cause an uneven performance during the day.

Building an MMS-like system

Since I don't have access to the actual MMS software, I decided to build a system that's similar to MMS. Basically, what I wanted was a MongoDB schema that would allow me to keep per-minute counters on a collection of different metrics (we could imagine something like a web page analytics system using such a schema, for example).

In order to keep everything compact, I decided to keep a day's statistics inside a single MongoDB document. The basic schema is the following:

{
    _id: "20101010/metric-1",
    metadata: {
        date: ISODate("2000-10-10T00:00:00Z"),
        metric: "metric-1" },
    daily: 5468426,
    hourly: {
        "00": 227850,
        "01": 210231,
        ...
        "23": 20457 },
    minute: {
        "0000": 3612,
        "0001": 3241,
        ...
        "1439": 2819 }
}

Here, we keep the date and metric we're storing in a "metadata" property so we can easily query it later. Note that the date and metric name are also embedded in the _id field as well (that will be important later). Actual metric data is stored in the daily, hourly, and minute properties.

Now if we want to update this document (say, to record a hit to a web page), we can use MongoDB's in-place update operators to increment the appropriate daily, hourly, and per-minute counters. To further simplify things, we'll use MongoDB's "upsert" feature to create a document if it doesn't already exist (this prevents us from having to allocate the documents ahead-of-time). The first version of our update method, then, looks like this:

def record_hit(coll, dt, measure):
    sdate = dt.strftime('%Y%m%d')
    metadata = dict(
        date=datetime.combine(
            dt.date(),
            time.min),
        measure=measure)
    id='%s/%s' % (sdate, measure)
    minute = dt.hour * 60 + dt.minute
    coll.update(
        { '_id': id, 'metadata': metadata },
        { '$inc': {
                'daily': 1,
                'hourly.%.2d' % dt.hour: 1,
                'minute.%.4d' % minute: 1 } },
        upsert=True)

To use this to record a "hit" to our website, then, we would simply call it with our collection, the current date, and the measure being updated:

>>> record_hit(db.daily_hits, datetime.utcnow(), '/path/to/my/page.html')

Measuring performance

To measure the performance of this approach, I created a 2-server cluster on Amazon EC2: one server to run MongoDB and one to run my benchmark code to do a bunch of record_hit() calls, simulating different times of day to see the performance over multiple 24-hour periods. This is what I found:

Initial Schema Performance

Ouch! For some reason, we see the performance of our system steadily decrease from 3000-5000 writes per second to 200-300 writes per second as the day goes on. This, it turns out, happens because our "in-place" update was not, in fact, in-place.

Growing documents

MongoDB allows you great flexibility when updating your documents, even allowing you to add new fields and cause the documents to grow in size over time. And as long as your documents don't grow too much, everything just kind of works. MongoDB will allocate some "padding" to your documents, assuming some growth, and as long as you don't outgrow your padding, there's really very little performance impact.

Once you do outgrow your padding, however, MongoDB has to move your document to another location. As your documeng gets bigger, this takes longer (more bytes to copy and all that). So documents that grow and grow and grow are a real performance-killer with MongoDB. And that's exactly what we have here. Consider the first time we call record_hit during a day. After this, the document looks like the following:

{
    _id: ...,
    metadata: {...},
    daily: 1,
    hourly: { "00": 1 }, 
    minute: { ""0000": 1 }
}

Then we record a hit during the second minute of a day and our document grows:

{
    _id: ...,
    metadata: {...},
    daily: 2,
    hourly: { "00": 2 }, 
    minute: { ""0000": 1, "0001": 1 }
}

Now, even if we're only recording a single hit per minute, our document had to grow 1439 times, and by the end of the day it takes up substantially more space than it did when we recorded our first hit just after midnight.

Fixing with preallocation

The solution to the problem of growing documents is preallocation. However, we'd prefer not to preallocate all the documents at once (this would cause a large load on the server), and we'd prefer not to manually schedule documents for preallocation throughout the day (that's just a pain). The solution that 10gen decided upon, then, was to randomly (with a small probability) preallocate tomorrow's document each time we record a hit today.

In the system I designed, this preallocation is performed at the beginning of record_hit:

def record_hit(coll, dt, measure):
    if PREALLOC and random.random() < (1.0/2000.0):
        preallocate(coll, dt + timedelta(days=1), measure)
    # ... 

Our preallocate function isn't that interesting, so I'll just show the general idea here:

def preallocate(coll, dt, measure):
    metadata, id = # compute metadata and ID
    coll.update( 
       { '_id': id },
       { '$set': { 'metadata': metadata },
         '$inc': { 
             'daily': 0,
             'hourly.00': 0,
             # ...
             'hourly.23': 0,
             'minute.0000': 0,
             # ...
             'minute.1439: 0 } },
       upsert=True)

There are two important things to note here:

  • Our preallocate function is safe. If by some chance we call preallocate on a date/metric that already has a document, nothing changes.
  • Even if preallocate is never called, record_hit is still functionally correct, so we don't have to worry about the small probability that we get through a whole day without preallocating a document.

Now with these changes in place, we see much better performance:

Performance with Preallocation

We've actually improvide performance in two ways using this approach:

  • Preallocation means that our documents never grow, so they never get moved
  • By preallocating throughout the day, we don't have a "midnight problem" where our upserts all end up inserting a new document and increasing load on the server.

We do, however, have a curious downward trend in performance throughout the day (though much less drastic than before). Where did that come from?

MongoDB's storage format

To figure out the downward performance through the day, we need to take a brief detour into the actual format that MongoDB uses to store data on disk (and memory), BSON. Normally, we don't need to worry about it, since the pymongo driver converts everything so nicely into native Python types, but in this case BSON presents us a performance problem.

Although MongoDB documents, such as our minute embedded document, are represented in Python as a dict (which is a constant-speed lookup hash table), BSON actually stores documents as an association list. So rather than having a nice hash table for minute, we actually have something that looks more like the following:

minute = [
    [ "0000", 3612 ],
    [ "0001", 3241 ],
    # ...
    [ "1439", 2819 ] ]

Now to actually update a particular minute, the MongoDB server performs the something like the following operations (psuedocode, with lots of special cases ignored):

inc_value(minute, "1439", 1)

def inc_value(document, key, value)
    for entry in document:
        if entry[0] == key:
            entry[1] += value
            break

The performance of this algorithm, far from our nice O(1) hash table, is actually O(N) in the number of entries in the document. In the case of the minute document, MongoDB has to actually perform 1439 comparisons before it finds the appropriate slot to update.

Fixing the downward trend with hierarchy

To fix the problem, then, we need to reduce the number of comparisons MongoDB needs to do to find the right minute to increment. The way we can do this is by splitting up the minutes into hours. Our daily stats document now looks like the following:

{ _id: "20101010/metric-1",
  metadata: {
    date: ISODate("2000-10-10T00:00:00Z"),
    metric: "metric-1" },
  daily: 5468426,
  hourly: {
    "0": 227850,
    "1": 210231,
    ...
    "23": 20457 },
  minute: {
    "00": {        
        "0000": 3612,
        "0100": 3241,
        ...
    }, ...,
    "23": { ..., "1439": 2819 }
}

Our record_hit and preallocate routines have to change a bit as well:

def record_hit_hier(coll, dt, measure):
    if PREALLOC and random.random() < (1.0/1500.0):
        preallocate_hier(coll, dt + timedelta(days=1), measure)
    sdate = dt.strftime('%Y%m%d')
    metadata = dict(
        date=datetime.combine(
            dt.date(),
            time.min),
        measure=measure)
    id='%s/%s' % (sdate, measure)
    coll.update(
        { '_id': id, 'metadata': metadata },
        { '$inc': {
                'daily': 1,
                'hourly.%.2d' % dt.hour: 1,
                ('minute.%.2d.%.2d' % (dt.hour, dt.minute)): 1 } },
        upsert=True)

def preallocate(coll, dt, measure):
    '''Once again, simplified for explanatory purposes'''
    metadata, id = # compute metadata and ID
    coll.update( 
       { '_id': id },
       { '$set': { 'metadata': metadata },
         '$inc': { 
             'daily': 0,
             'hourly.00': 0,
             # ...
             'hourly.23': 0,
             'minute.00.00': 0,
             # ...
             'minute.00.59': 0,
             'minute.01.00': 0,
             # ...
             'minute.23.59': 0 } },
       upsert=True)

Once we've added the hierarchy and re-run our experiment, we get the nice, level performance we'd like to see:

Performance with Hierarchical Minutes

Conclusion

It's always nice to see "tips and tricks" borne out through actual, quantitative results, so this was probably the most enjoyable talk I've ever put together. The things I got out of it were the following:

  • Growing documents is a very bad thing for performance. Avoid it if at all possible.
  • Awareness of the BSON specification and data representation can actually be quite useful when diagnosing performance problems.
  • To get the best performance out of your system, you need to actually run it (or a highly representative stand-in). Actually seeing the results of performance tweaking in graphical form is incredibly helpful in targeting your efforts.

The source code for all these updates is available in my mongodb-sdas Github repo, and I welcome any feedback either there, or here in the comments. In particular, I'd love to hear of any performance problems you've run into and how you got around them. And of course, if you've got a really perplexing problem, I'm always available for consulting by emailing me at Arborian.com.

23 comments:

  1. Anonymous1:16 PM

    Rick,

    Thank you very much for putting this together. I have to say that this is one of the best blog posts I have ever read on MongoDB design and scalability. And it has just convinced me of getting rid of growing documents in my models.

    Leo

    ReplyDelete
    Replies
    1. Leo,

      Thanks so much for the comment! I'm glad you found the post useful. It's always nice to see real numbers I think (especially in scatter plots ;-) )

      Delete
  2. Great post! Implement, measure, adjust.

    ReplyDelete
  3. Anonymous9:16 AM

    Wonderful! Thank you, we need more of this kind of posts :)

    ReplyDelete
    Replies
    1. Thanks for the comment! I'm glad you liked it.

      Delete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Great insights! especially with regard to the 0(N) object traversal on updates.

    ReplyDelete
    Replies
    1. Thanks for the comment! I have to admit, the insight on O(N) object traversal is not mine; it was discovered by the 10gen MMS team when optimizing the monitoring service. It was a fascinating and weird enough 2nd order effect that I thought others would be interested as well.

      Thanks again!

      Delete
  6. Anonymous10:31 AM

    "23": { ..., "1439": 2819 }
    Is that really correct? Shouldn't it be something like:
    "23": { ..., "2339": 2819 }
    Or have I missed out something?

    ReplyDelete
    Replies
    1. In this case, I simply kept the minute of the day (numbered 0-1439) as the key of the embedded document. You could also use

      "23": { "00":..., "59": ... }

      ... which is probably what I would do in the future.

      Delete
  7. I am just curious
    if accessing 'dict' keys in mongo document is slow
    is accessing 'list' indexes slow too ?

    could we replace
    'minute': { '0000': N0, '0001': N1, ... '1439': N1439 }
    with
    'minute': [N0...N1439]

    ReplyDelete
    Replies
    1. Hi Sergey,

      Thanks for the comment! It turns out that arrays in BSON are actually stored as dicts where they keys are "1", "2", "3", etc. (surprising but true!), so you wouldn't actually get any faster using them. I"ve brought up this interesting design decision with 10gen folks multiple times, so someday we may have better-performing arrays, but for now they're just documents with a different type code.

      Delete
    2. Anonymous5:48 PM

      What about using alternative connector like Monary which instead of encapsulating into dictionaries, it relies on NumPy arrays?

      https://bitbucket.org/djcbeach/monary/wiki/Home

      Delete
    3. Hi Anon,

      Monary looks pretty cool; I hadn't heard of it before. I don't think it's really applicable to this case, however, as it's focused on reading a "stripe" of values from many documents into a single numpy array. What I'm trying to do here is to repeatedly update a single document.

      Thanks for the comment, though. Monary looks very interesting; I'm going to have to find a place to use it. :-)

      Delete
  8. Simple and insightful article, thank you Rick.
    I'm making schema changes right-away :)

    ReplyDelete
    Replies
    1. Thanks for the comment, Vivek! Glad you found it useful :)

      Delete
  9. This comment has been removed by the author.

    ReplyDelete
  10. Anonymous5:45 AM

    I want to structure my data this way but I need a counter for each value(daily, hourly, minute). Any idea of how to store that efficiently? I firstly tried this solution but I want my data to be more compact.

    ReplyDelete
    Replies
    1. Well, you can store the three counters in three collections. I don't think there's a good way to make it too much more compact, though; you don't want to "pack" the arrays since that means MongoDB can't do an in-place update, and BSON's not particularly efficient as a storage protocol (compared with relational DBs, that is).

      Delete
  11. Anonymous5:34 PM

    Great post. I have some questions for you. You are using "20101010/metric-1" as an _id. So you are using a single collection for multiple metrics. How do you query that?
    Also, having in mind that I have multiple devices with multiple metrics what's better to do. Create one collection per device and store multiple metrics or create one collection for every single metric?

    ReplyDelete
    Replies
    1. If you only want a single metric, the query is straightforward. If you need multiple metrics, you can do a regex query on _id for '^20101010/', which will be reasonably fast. If you're *always* getting the same set of metrics, however, I'd recommend storing them alongside one another in the document instead.

      The question of whether to use multiple *collections* or a single collection is pretty much a wash performance-wise. Separating your documents into different collections is only really *necessary* when you have different query patterns (and therefore a need for different indexes, sharding approaches, etc.).

      Hope this helps!

      Delete