Monday, July 09, 2012

Multi-Master Replication in MongoDB

Since starting my own consulting business, I've had the opportunity to work with lots of interesting technologies. Today I wanted to tell you about some interesting technology I got to develop for MongoDB: multi-master replication.

MongoMultiMaster usage (or tl;dr)

Before getting into the details of how MongoDB MultiMaster works, here's a short intro on how to actually use MongoMultiMaster. Included in the distribution, is a command-line tool named mmm. To use mmm, you need to set up a topology for your replication in YAML, assigning UUIDs to your servers (the value of the UUIDs doesn't matter, only that they're universal). In this case, I've pointed MMM to two servers running locally:

server_a:
  id: '2c88ae84-7cb9-40f7-835d-c05e981f564d'
  uri: 'mongodb://localhost:27019'
server_b:
  id: '0d9c284b-b47c-40b5-932c-547b8685edd0'
  uri: 'mongodb://localhost:27017'

Once this is done, you can view or edit the replication config. Here, we'll clear the config (if any exists) and view it to ensure all is well:

$ mmm -c test.yml clear-config
About to clear config on servers: ['server_a', 'server_b'], are you sure? (yN) y
Clear config for server_a
Clear config for server_b
$ mmm -c test.yml dump-config
=== Server Config ===
server_a (2c88ae84-7cb9-40f7-835d-c05e981f564d) => mongodb://localhost:27019
server_b (0d9c284b-b47c-40b5-932c-547b8685edd0) => mongodb://localhost:27017

=== server_a Replication Config
=== server_b Replication Config

Next, we'll set up two replicated collections:

$ mmm -c test.yml replicate --src=server_a/test.foo --dst=server_b/test.foo
$ mmm -c test.yml replicate --src=server_a/test.bar --dst=server_b/test.bar

And confirm they're configured correctly:

$ mmm -c test.yml dump-config
=== Server Config ===
server_a (2c88ae84-7cb9-40f7-835d-c05e981f564d) => mongodb://localhost:27019
server_b (0d9c284b-b47c-40b5-932c-547b8685edd0) => mongodb://localhost:27017

=== server_a Replication Config
=== server_b Replication Config
     - test.foo <= server_a/test.foo
     - test.bar <= server_a/test.bar

Now, let's make the replication bidirectional:

$ mmm -c test.yml replicate --src=server_b/test.foo --dst=server_a/test.foo
$ mmm -c test.yml replicate --src=server_b/test.bar --dst=server_a/test.bar

And verify that it's correct...

$ mmm -c test.yml dump-config
=== Server Config ===
server_a (2c88ae84-7cb9-40f7-835d-c05e981f564d) => mongodb://localhost:27019
server_b (0d9c284b-b47c-40b5-932c-547b8685edd0) => mongodb://localhost:27017

=== server_a Replication Config
     - test.foo <= server_b/test.foo
     - test.bar <= server_b/test.bar
=== server_b Replication Config
     - test.foo <= server_a/test.foo
     - test.bar <= server_a/test.bar

Now we can run the replicator:

$ mmm -c test.yml run

And you get a program that will replicate updates, bidrectionally, between server_a and server_b. So how does this all work, anyway?

Replication in MongoDB

For those who aren't familiar with MongoDB replication, MongoDB uses a modified master/slave approach. To use the built-in replication, you set up a replica set with one or more mongod nodes. These nodes then elect a primary node to handle all writes, with all the other servers becoming secondary, with the ability to handle reads, but without the ability to handle any writes. The secondaries then replicate all writes on the primary to themselves, staying in sync as much as possible.

There's lots of cool stuff you can do with replication, including setting up "permanent secondaries" that can never be elected primary (for backup), keeping a fixed delay between a secondary and its primary (for protection against accidental catastrophic user error), making replicas data-center aware (so you can make your writes really safe), etc. But in this article, I want to focus on one limitation of the replica set configuration: it is impossible to replicate data from a secondary; all secondaries read from the same primary.

The client problem

My client has a system where there is a "local" system and a "cloud" system. On the local system, there is a lot of writing going on, so he initially set up a replica set where the local database was set as primary, with the cloud database set as a permanent (priority-zero) secondary. Though it's a little bit weird to configure a system like this (more often, the primary is remote and you keep a local secondary), it wasn't completely irrational. There were, however, some problems:

  • The goal of high availability wasn't really maintained, as the replica set only had a single server capable of being a primary.
  • It turned out that only a small amount of data needed to be replicated to the cloud server, and replicating everything (including some image data) was saturating the upstream side of his customers' DSL connections.
  • He discovered that he really would like to write to the cloud occasionally and have those changes replicated down to the local server (though these updates would usually be to different collections than the local server was writing to frequently).

What would really be nice to have is a more flexible replication setup where you could

  • Specify which collections get replicated to which servers
  • Maintain multiple MongoDB "primaries" capable of accepting writes
  • Handle bidirectional replication between primaries

First step: build triggers

And so I decided to build a so-called multi-master replication engine for MongoDB. It turns out that MongoDB's method of replication is for the primary to write all of its operations to an oplog. Following the official replication docs on the MongoDB site and some excellent articles by 10gen core developer Kristina Chrodorow on replication internals, the oplog format, and using the oplog, along with a bit of reverse-engineering and reading the MongoDB server code, it's possible to piece together exactly what's going on with the oplog.

It turns out that the first thing I needed was a way get the most recent operations in the oplog and to "tail" it for more. In Python, this is fairly straightforward:

>>> spec = dict(ds={'$gt': checkpoint})
>>> q = conn.local.oplog.rs.find(
...     spec, tailable=True, await_data=True)
>>> q = q.sort('$natural')
>>> for op in q:
...     # do some stuff with the operation "op"

What this does is set up a find on the oplog that will "wait" until operations are written to it, similar to the tail -f command in UNIXes. Once we have the operations, we need to look at the format of operations in the oplog. We're interested in insert, update, and delete operations, each of which have a slightly different format. Insert operations are fairly simple:

{ op: "i",  // operation flag
  ts: {...} // timestamp of the operation
  ns: "database.collection" // the namespace the operation affects
  o: {...}  // the document being inserted
}

Updates have the following format:

{ op: "u",    // operation flag
  ts: {...}   // timestamp of the operation
  ns: "database.collection" // the namespace the operation affects
  o: { ... }  // new document or set of modifiers
  o2: { ... } // query specifier telling which doc to update
  b: true     // 'upsert' flag
}

Deletes look like this:

{ op: "u",    // operation flag
  ts: {...}   // timestamp of the operation
  ns: "database.collection" // the namespace the operation affects
  o: { ... }  // query specifier telling which doc(s) to delete
  b: false    // 'justOne' flag -- ignore it for now
}

Once that's all under our belt, we can actually build a system of triggers that will call arbitrary code. In MongoMultiMaster (mmm), we do this by using the Triggers class:

>>> from mmm.triggers import Triggers
>>> t = Triggers(pymongo_connection, None)
>>> def my_callback(**kw):
...     print kw
>>> t.register('test.foo', 'iud', my_callback)
>>> t.run()

Now, if we make some changes to the foo collection in the test database, we can see those oplog entries printed.

Next step: build replication

Once we have triggers, adding replication is straightforward enough. MMM stores its current replication state in a mmm collection in the local database. The format of documents in is collection is as follows:

{ _id: ..., // some UUID identifying the server we're replicating FROM
  checkpoint: Timestamp(...), // timestamp of the last op we've replicated
  replication: [
    { dst: 'slave-db.collection',  // the namespace we're replicating TO
      src: 'master-db.collection', // the namespace we're replicating FROM
      ops: 'iud' },  // operations to be replicated (might not replicate 'd')
    ...
  ]
}

The actual replication code is fairly trivial, so I won't go into it except to mention how MMM solves the problem of bidirectional replication.

Breaking loops

One of the issues that I initially had with MMM was getting bidirectional replication working. Say you have db1.foo and db2.foo set up to replicate to one another. Then you can have a situation like the following:

  • User inserts doc into db1.foo
  • MMM reads insert into db1.foo from oplog and inserts doc into db2.foo
  • MMM reads insert into db2.foo from oplog and inserts doc into db1.foo
  • Rinse & repeat...

In order to avoid this problem, operations in MMM are tagged with their source server's UUID. Whever MMM sees a request to replicate an operation originating on db1 to db1, it drops it, breaking the loop:

  • User inserts doc into db1.foo on server DB1
  • MMM reads insert into db1.foo and inserts doc into db2.foo on DB2, tagging the document with the additional field {mmm:DB1}
  • MMM reads the insert into db2.foo, sees the tag {mmm:DB1} and decides not to replicate that document back to DB1.

Conclusion & caveats

So what I've built ends up being a just-enough solution for the client's problem, but I though it might be useful to some others, so I released the code on GitHub as mmm. If you end up using it, some things you'll need to be aware include:

  • Replication can fall behind if you're writing a lot. This is not handled at all.
  • Replication begins at the time when mmm run was first called. You should be able to stop/start mmm and have it pick up where it left off.
  • Conflicts between masters aren't handled; if you're writing to the same document on both heads frequently, you can get out of sync.
  • Replication inserts a bookkeeping field into each document to signify the server UUID that last wrote the document. This expands the size of each document slightly.

So what do you think? Is MMM a project you'd be interested in using? Any glaring omissions to my approach? Let me know in the comments below!

14 comments:

  1. Very cool :) It's really nice how mongo provides all the tools you need (oplog, tailable cursors, etc) to be able to implement something like this.

    ReplyDelete
  2. I have been advised by someone who should know that secondaries can indeed replicate from other secondaries in MongoDB replica sets for more complex topologies, but there is still only a single primary capable of accepting writes.

    ReplyDelete
  3. Must ask: Why not go just with CouchDB?

    ReplyDelete
    Replies
    1. MongoDB provides a few advantages over CouchDB, including a more flexible query syntax and better performance. A full comparison between the two is a bit beyond the scope of a comment, though. For this particular use case, my client is already committed to using MongoDB (there's a little too much code written to pivot quickly at this point), including MongoDB-specific bits like gridfs.

      Incidentally, my client is actually the same person that made the decision to go with MongoDB and not CouchDB at SourceForge, though he's not with SourceForge any more.

      Delete
  4. Strictly speaking, this isn't adding multi-master to Mongo. What it *is* doing is replicating between independent replica sets, each of which still has its own single master. Not that I'm complaining - it's a very useful and practical demonstration of how to extend MongoDB.

    ReplyDelete
    Replies
    1. Thanks for the comment! While it's true that I didn't modify the mongod binary in any way, I would say that if you set up bidirectional replication in MMM for a collection, you do have a system that behaves as a multi-master system. Once you have MMM running, the replica sets aren't really "independent" anymore. But we might be splitting hairs over semantics. Anyway, thanks for taking the time to read and respond to the article!

      Delete
  5. Anonymous5:12 PM

    Really nice and informative post, thanks for this.
    I have a question though, what approach would you take for tailing the oplog if 'falling behind' is not an option ?

    ReplyDelete
    Replies
    1. Thanks for the comment.

      I'm not sure what you mean by "'falling behind' is not an option"; did you envision some mechanism for throttling writes to the system? In that case, you'd probably need to build a higher-level abstraction. MMM doesn't have any facility for doing that at present.

      Delete
    2. Anonymous10:19 PM

      Yeah, imagine throttling writes so that we may not be able to keep in sync with the oplog (it will eventually start overwriting while we are still processing some elements). I think this is what secondaries have to deal with. Do you have any suggestions on how to tackle this ? I thought about mimicking mongodb's approach. I just started studying the source code, so I'm not sure how difficult it is, perhaps their solution involves some kind of locking which I wouldn't be able to do in the app layer.

      Thanks for your reply.

      Delete
  6. Replies
    1. Thanks for the comment. Glad you enjoyed it!

      Delete
  7. Looks very promising ! I am going to try it and let you know...


    Regarding "Replication can fall behind if you're writing a lot. This is not handled at all" ..

    Did you run some experiments for the same. ? Does this also work for primary mongo db servers located in different locations?

    ReplyDelete
    Replies
    1. No, I haven't run any experiments with replicas falling behind. There is another project that seems to have more momentum here: http://blog.mongodb.org/post/29127828146/introducing-mongo-connector

      Delete
  8. What about the case where you might want many replica sets to sync with many?

    For example: Replica set A sync's to set B, B syncs to set C, and B also syncs to D. Services running on each replica set would determine if it's replication partner is up or not and establish a new replication partner (on a higher cost route). So if B were to drop out, D might connect to A and C to D. With this sort of configuration site based clusters could operate and continue to operate without some of the limitations of mongo replication sets.

    ReplyDelete