Since starting my own consulting business, I've had the opportunity to work with lots of interesting technologies. Today I wanted to tell you about some interesting technology I got to develop for MongoDB: multi-master replication.
MongoMultiMaster usage (or tl;dr)
Before getting into the details of how MongoDB MultiMaster works, here's a
short intro on how to actually use MongoMultiMaster. Included in the
distribution, is a command-line tool named mmm
. To use mmm
, you need to set up a
topology for your replication in YAML, assigning UUIDs to your servers
(the value of the UUIDs doesn't matter, only that they're universal). In this
case, I've pointed MMM to two servers running locally:
server_a: id: '2c88ae84-7cb9-40f7-835d-c05e981f564d' uri: 'mongodb://localhost:27019' server_b: id: '0d9c284b-b47c-40b5-932c-547b8685edd0' uri: 'mongodb://localhost:27017'
Once this is done, you can view or edit the replication config. Here, we'll clear the config (if any exists) and view it to ensure all is well:
$ mmm -c test.yml clear-config About to clear config on servers: ['server_a', 'server_b'], are you sure? (yN) y Clear config for server_a Clear config for server_b $ mmm -c test.yml dump-config === Server Config === server_a (2c88ae84-7cb9-40f7-835d-c05e981f564d) => mongodb://localhost:27019 server_b (0d9c284b-b47c-40b5-932c-547b8685edd0) => mongodb://localhost:27017 === server_a Replication Config === server_b Replication Config
Next, we'll set up two replicated collections:
$ mmm -c test.yml replicate --src=server_a/test.foo --dst=server_b/test.foo $ mmm -c test.yml replicate --src=server_a/test.bar --dst=server_b/test.bar
And confirm they're configured correctly:
$ mmm -c test.yml dump-config === Server Config === server_a (2c88ae84-7cb9-40f7-835d-c05e981f564d) => mongodb://localhost:27019 server_b (0d9c284b-b47c-40b5-932c-547b8685edd0) => mongodb://localhost:27017 === server_a Replication Config === server_b Replication Config - test.foo <= server_a/test.foo - test.bar <= server_a/test.bar
Now, let's make the replication bidirectional:
$ mmm -c test.yml replicate --src=server_b/test.foo --dst=server_a/test.foo $ mmm -c test.yml replicate --src=server_b/test.bar --dst=server_a/test.bar
And verify that it's correct...
$ mmm -c test.yml dump-config === Server Config === server_a (2c88ae84-7cb9-40f7-835d-c05e981f564d) => mongodb://localhost:27019 server_b (0d9c284b-b47c-40b5-932c-547b8685edd0) => mongodb://localhost:27017 === server_a Replication Config - test.foo <= server_b/test.foo - test.bar <= server_b/test.bar === server_b Replication Config - test.foo <= server_a/test.foo - test.bar <= server_a/test.bar
Now we can run the replicator:
$ mmm -c test.yml run
And you get a program that will replicate updates, bidrectionally, between
server_a
and server_b
. So how does this all work, anyway?
Replication in MongoDB
For those who aren't familiar with MongoDB replication,
MongoDB uses a modified master/slave approach. To use the built-in replication,
you set up a replica set with one or more mongod
nodes. These nodes then
elect a primary node to handle all writes, with all the other servers
becoming secondary, with the ability to handle reads, but without the ability
to handle any writes. The secondaries then replicate all writes on the primary to
themselves, staying in sync as much as possible.
There's lots of cool stuff you can do with replication, including setting up "permanent secondaries" that can never be elected primary (for backup), keeping a fixed delay between a secondary and its primary (for protection against accidental catastrophic user error), making replicas data-center aware (so you can make your writes really safe), etc. But in this article, I want to focus on one limitation of the replica set configuration: it is impossible to replicate data from a secondary; all secondaries read from the same primary.
The client problem
My client has a system where there is a "local" system and a "cloud" system. On the local system, there is a lot of writing going on, so he initially set up a replica set where the local database was set as primary, with the cloud database set as a permanent (priority-zero) secondary. Though it's a little bit weird to configure a system like this (more often, the primary is remote and you keep a local secondary), it wasn't completely irrational. There were, however, some problems:
- The goal of high availability wasn't really maintained, as the replica set only had a single server capable of being a primary.
- It turned out that only a small amount of data needed to be replicated to the cloud server, and replicating everything (including some image data) was saturating the upstream side of his customers' DSL connections.
- He discovered that he really would like to write to the cloud occasionally and have those changes replicated down to the local server (though these updates would usually be to different collections than the local server was writing to frequently).
What would really be nice to have is a more flexible replication setup where you could
- Specify which collections get replicated to which servers
- Maintain multiple MongoDB "primaries" capable of accepting writes
- Handle bidirectional replication between primaries
First step: build triggers
And so I decided to build a so-called multi-master replication engine for
MongoDB. It turns out that MongoDB's method of replication is for the primary to
write all of its operations to an oplog
. Following the official
replication docs on the MongoDB site and some excellent
articles by 10gen core developer Kristina Chrodorow on
replication internals, the oplog format, and
using the oplog, along with a bit of reverse-engineering and reading
the MongoDB server code, it's possible to piece together exactly what's going on
with the oplog.
It turns out that the first thing I needed was a way get the most recent operations in the oplog and to "tail" it for more. In Python, this is fairly straightforward:
>>> spec = dict(ds={'$gt': checkpoint}) >>> q = conn.local.oplog.rs.find( ... spec, tailable=True, await_data=True) >>> q = q.sort('$natural') >>> for op in q: ... # do some stuff with the operation "op"
What this does is set up a find
on the oplog that will "wait" until operations
are written to it, similar to the tail -f
command in UNIXes. Once we have the
operations, we need to look at the format of operations in the oplog. We're
interested in insert, update, and delete operations, each of which
have a slightly different format. Insert operations are fairly simple:
{ op: "i", // operation flag ts: {...} // timestamp of the operation ns: "database.collection" // the namespace the operation affects o: {...} // the document being inserted }
Updates have the following format:
{ op: "u", // operation flag ts: {...} // timestamp of the operation ns: "database.collection" // the namespace the operation affects o: { ... } // new document or set of modifiers o2: { ... } // query specifier telling which doc to update b: true // 'upsert' flag }
Deletes look like this:
{ op: "u", // operation flag ts: {...} // timestamp of the operation ns: "database.collection" // the namespace the operation affects o: { ... } // query specifier telling which doc(s) to delete b: false // 'justOne' flag -- ignore it for now }
Once that's all under our belt, we can actually build a system of triggers that
will call arbitrary code. In MongoMultiMaster (mmm
), we do this by using the
Triggers
class:
>>> from mmm.triggers import Triggers >>> t = Triggers(pymongo_connection, None) >>> def my_callback(**kw): ... print kw >>> t.register('test.foo', 'iud', my_callback) >>> t.run()
Now, if we make some changes to the foo
collection in the test
database, we
can see those oplog entries printed.
Next step: build replication
Once we have triggers, adding replication is straightforward enough. MMM stores
its current replication state in a mmm
collection in the local
database. The
format of documents in is collection is as follows:
{ _id: ..., // some UUID identifying the server we're replicating FROM checkpoint: Timestamp(...), // timestamp of the last op we've replicated replication: [ { dst: 'slave-db.collection', // the namespace we're replicating TO src: 'master-db.collection', // the namespace we're replicating FROM ops: 'iud' }, // operations to be replicated (might not replicate 'd') ... ] }
The actual replication code is fairly trivial, so I won't go into it except to mention how MMM solves the problem of bidirectional replication.
Breaking loops
One of the issues that I initially had with MMM was getting bidirectional
replication working. Say you have db1.foo
and db2.foo
set up to replicate to
one another. Then you can have a situation like the following:
- User inserts doc into
db1.foo
- MMM reads insert into
db1.foo
from oplog and inserts doc intodb2.foo
- MMM reads insert into
db2.foo
from oplog and inserts doc intodb1.foo
- Rinse & repeat...
In order to avoid this problem, operations in MMM are tagged with their source server's UUID. Whever MMM sees a request to replicate an operation originating on db1 to db1, it drops it, breaking the loop:
- User inserts doc into
db1.foo
on server DB1 - MMM reads insert into
db1.foo
and inserts doc intodb2.foo
on DB2, tagging the document with the additional field{mmm:DB1}
- MMM reads the insert into
db2.foo
, sees the tag{mmm:DB1}
and decides not to replicate that document back to DB1.
Conclusion & caveats
So what I've built ends up being a just-enough solution for the client's problem, but I though it might be useful to some others, so I released the code on GitHub as mmm. If you end up using it, some things you'll need to be aware include:
- Replication can fall behind if you're writing a lot. This is not handled at all.
- Replication begins at the time when
mmm run
was first called. You should be able to stop/startmmm
and have it pick up where it left off. - Conflicts between masters aren't handled; if you're writing to the same document on both heads frequently, you can get out of sync.
- Replication inserts a bookkeeping field into each document to signify the server UUID that last wrote the document. This expands the size of each document slightly.
So what do you think? Is MMM a project you'd be interested in using? Any glaring omissions to my approach? Let me know in the comments below!
Very cool :) It's really nice how mongo provides all the tools you need (oplog, tailable cursors, etc) to be able to implement something like this.
ReplyDeleteI have been advised by someone who should know that secondaries can indeed replicate from other secondaries in MongoDB replica sets for more complex topologies, but there is still only a single primary capable of accepting writes.
ReplyDeleteMust ask: Why not go just with CouchDB?
ReplyDeleteMongoDB provides a few advantages over CouchDB, including a more flexible query syntax and better performance. A full comparison between the two is a bit beyond the scope of a comment, though. For this particular use case, my client is already committed to using MongoDB (there's a little too much code written to pivot quickly at this point), including MongoDB-specific bits like gridfs.
DeleteIncidentally, my client is actually the same person that made the decision to go with MongoDB and not CouchDB at SourceForge, though he's not with SourceForge any more.
Strictly speaking, this isn't adding multi-master to Mongo. What it *is* doing is replicating between independent replica sets, each of which still has its own single master. Not that I'm complaining - it's a very useful and practical demonstration of how to extend MongoDB.
ReplyDeleteThanks for the comment! While it's true that I didn't modify the mongod binary in any way, I would say that if you set up bidirectional replication in MMM for a collection, you do have a system that behaves as a multi-master system. Once you have MMM running, the replica sets aren't really "independent" anymore. But we might be splitting hairs over semantics. Anyway, thanks for taking the time to read and respond to the article!
DeleteReally nice and informative post, thanks for this.
ReplyDeleteI have a question though, what approach would you take for tailing the oplog if 'falling behind' is not an option ?
Thanks for the comment.
DeleteI'm not sure what you mean by "'falling behind' is not an option"; did you envision some mechanism for throttling writes to the system? In that case, you'd probably need to build a higher-level abstraction. MMM doesn't have any facility for doing that at present.
Yeah, imagine throttling writes so that we may not be able to keep in sync with the oplog (it will eventually start overwriting while we are still processing some elements). I think this is what secondaries have to deal with. Do you have any suggestions on how to tackle this ? I thought about mimicking mongodb's approach. I just started studying the source code, so I'm not sure how difficult it is, perhaps their solution involves some kind of locking which I wouldn't be able to do in the app layer.
DeleteThanks for your reply.
Nice work, thanks!
ReplyDeleteThanks for the comment. Glad you enjoyed it!
DeleteLooks very promising ! I am going to try it and let you know...
ReplyDeleteRegarding "Replication can fall behind if you're writing a lot. This is not handled at all" ..
Did you run some experiments for the same. ? Does this also work for primary mongo db servers located in different locations?
No, I haven't run any experiments with replicas falling behind. There is another project that seems to have more momentum here: http://blog.mongodb.org/post/29127828146/introducing-mongo-connector
DeleteWhat about the case where you might want many replica sets to sync with many?
ReplyDeleteFor example: Replica set A sync's to set B, B syncs to set C, and B also syncs to D. Services running on each replica set would determine if it's replication partner is up or not and establish a new replication partner (on a higher cost route). So if B were to drop out, D might connect to A and C to D. With this sort of configuration site based clusters could operate and continue to operate without some of the limitations of mongo replication sets.