Friday, June 08, 2012

Schema Maintenance with Ming and MongoDB

Continuing on in my series on MongoDB and Python, this article introduces the Python MongoDB toolkit Ming and what it can do to simplify your MongoDB code and ease maintenance. If you're just getting started with MongoDB, you might want to read the previous articles in the series first:

And now that you're all caught up, let's jump right in with Ming....

Why Ming?

If you've come to MongoDB from the world of relational databases, you have probably been struck by just how easy everything is: no big object/relational mapper needed, no new query language to learn (well, maybe a little, but we'll gloss over that for now), everything is just Python dictionaries, and it's so, so fast! While this is all true to some extent, one of the big things you give up with MongoDB is structure.

MongoDB is sometimes referred to as a schema-free database. (This is not technically true; I find it more useful to think of MongoDB as having dynamically typed documents. The collection doesn't tell you anything about the type of documents it contains, but each individual document can be inspected.) While this can be nice, as it's easy to evolve your schema quickly in development, it's easy to get yourself in trouble the first time your application tries to query by a field that only exists in some of your documents.

The fact of the matter is that even if the database cares nothing about your schema, your application does, and if you play too fast and lose with document structure, it will come back to haunt you in the end. The main reason Ming was created at SourceForge was to deal with just this problem. We wanted a (thin) layer on top of pymongo that would do a couple of things for us:

  • Make sure that we don't put malformed data into the database
  • Try to 'fix' malformed data coming back from the database

So, without belaboring the point of its existence, let's jump into Ming.

Defining your schema

When using Ming, the first thing you need to do is to tell it what your documents look like. For this, Ming provides the collection function.

from datetime import datetime

from ming import collection, Field, Session
from ming import schema as S

session = Session()
MyDoc = collection(
    'user', session,
    Field('_id', S.ObjectId),
    Field('username', str),
    Field('created', datetime, if_missing=datetime.utcnow),
    ...)

There are a few of things to note above:

  • The MongoDB collection name is passed as the first argument to collection
  • The Session object is used to abstract away the pymongo connection. We will see how to configure it below.
  • Each field in our schema gets its own Field definition. Fields contain a name, a schema item (S.ObjectId, str, and datetime in this example), and optional arguments that affect the field.
  • The special if_missing keyword argument allows you to supply default arguments which will be 'filled in' by Ming. If you pass a function, as above, the function will be called to generate a default value.

Schema items bear a bit more explanation. Ming internally always works with objects from the ming.schema module, but it also provides shortcuts to ease schema definitions. The translation between shortcut and ming.schema.SchemaItem appears below:

shorthand SchemaItem Notes
None Anything
int Int
str String Unicode
float Float
bool Bool
datetime DateTime
[] Array(Anything()) Any valid array
[int] Array(Int())
{str:None} Object({str:None}) Any valid object
{"a": int} Object({"a": int}) Embedded schema

Note above that we can create complex schemas using Ming. A blog post might have the following definition, for example:

BlogPost = collection(
   'blog.post', session,
   Field('_id', S.ObjectId),
   Field('posted', datetime, if_missing=datetime.utcnow),
   Field('title', str),
   Field('author', dict(
       username=str,
       display_name=str)),
   Field('text', str),
   Field('comments', [
       dict(
           author=dict(
               username=str,
               display_name=str),
           posted=S.DateTime(if_missing=datetime.utcnow),
           text=str) ]))

Note in the schema above that author is an embedded document, and comments is an embedded array of documents.

Indexing

If we expected to do a lot of queries on user.username, we could add an index simply by updating the code above to read:

    ...
    Field('username', str, index=True)
    ...

Creating the indexes in the schema like this has the nice property that Ming will ensure that those indexes exist the first time it touches the database. We can also set a unique index on a field by using the unique optional argument:

    ...
    Field('username', str, unique=True)
    ...

Ming also support specifying compound indexes by using the Index object in the collection definition. Suppose we wished to keep a separate list of users, scoped by client_id. In this case, the schema might look more like the following:

from datetime import datetime

from ming import collection, Field, Index, Session
from ming import schema as S

session = Session()
MyDoc = collection(
    'user', session,
    Field('_id', S.ObjectId),
    Field('client_id', S.ObjectId, if_missing=None),
    Field('username', str),
    Field('created', datetime, if_missing=datetime.utcnow),
    Index('client_id', 'username', unique=True),
    ...)

In the example above, the index would be created as follows:

db.user.ensure_index([('client_id', 1), ('username', 1)], unique=True)

By default, each key in an index created by Ming is sorted in ascending order. If you want to change this, you can explicitly specify the sort order for the index:

    ...
    Index(('client_id', -1), ('username', 1), unique=True)
    ...

Connection and configuration

Once we've defined our schema, we can use it by binding the session to the appropriate MongoDB database using ming.datastore:

from ming import datastore

session.bind = datastore.DataStore(
    'mongodb://localhost:27017', database='test')

More typically, we will create our session as a named session and bind it somewhere else in our application (perhaps in our startup script):

session = ming.Session.by_name('test)

...

ming.config.configure_from_nested_dict(dict(
    test=dict(
        master='mongodb://localhost:27017', 
        database='test')
    ))

By using named schemas, you can decouple your schema definition code from the actual configuration of your database connection. This is often useful when you will be reading connection information from a configuration file, for instance.

Querying and updating

To show how Ming supports querying and updating, let's go back to our simple User schema above:

from datetime import datetime

from ming import collection, Field, Index, Session
from ming import schema as S

session = Session()
MyDoc = collection(
    'user', session,
    Field('_id', S.ObjectId),
    Field('client_id', S.ObjectId, if_missing=None),
    Field('username', str),
    Field('created', datetime, if_missing=datetime.utcnow),
    Index('client_id', 'username', unique=True),
    ...)

Now let's insert some data:

>>> import pymongo
>>> conn = pymongo.Connection()
>>> db = conn.test
>>> db.user.insert([
...     dict(username='rick'),
...     dict(username='jenny'),
...     dict(username='mark')])
[ObjectId('4fd24c96fb72f08265000000'), 
 ObjectId('4fd24c96fb72f08265000001'), 
 ObjectId('4fd24c96fb72f08265000002')]

To get the data back out, we simply use the collection's manager property m:

>>> MyDoc.m.find().all()
[{'username': u'rick', 
  '_id': ObjectId('4fd24c96fb72f08265000000'), 
  'client_id': None, 
  'created': datetime.datetime(2012, 6, 8, 19, 8, 28, 522073)}, 
 {'username': u'jenny', 
  '_id': ObjectId('4fd24c96fb72f08265000001'), 
  'client_id': None, 
  'created': datetime.datetime(2012, 6, 8, 19, 8, 28, 522195)}, 
 {'username': u'mark', 
  '_id': ObjectId('4fd24c96fb72f08265000002'), 
  'client_id': None, 
  'created': datetime.datetime(2012, 6, 8, 19, 8, 28, 522315)}]

Notice how Ming has filled in the values we omitted when creating the user documents. In this case, it's actually filling them in as they are returned from the database. We can drop down to the pymongo layer to see this by using the m.collection property on MyDoc:

>>> list(MyDoc.m.collection.find())
[{u'username': u'rick', 
  u'_id': ObjectId('4fd24c96fb72f08265000000')}, 
 {u'username': u'jenny', 
  u'_id': ObjectId('4fd24c96fb72f08265000001')}, 
 {u'username': u'mark', 
  u'_id': ObjectId('4fd24c96fb72f08265000002')}]

Now let's remove the documents we created and create some using Ming:

>>> MyDoc.m.remove()
>>> 
>>> MyDoc(dict(username='rick')).m.insert()
>>> MyDoc(dict(username='jenny')).m.insert()
>>> MyDoc(dict(username='mark')).m.insert()
>>> 
>>> MyDoc.m.collection.find_one()
{u'username': u'rick', 
 u'_id': ObjectId('4fd24f95fb72f08265000003'), 
 u'client_id': None, 
 u'created': datetime.datetime(2012, 6, 8, 19, 16, 37, 565000)}

Note that when we created the documents using Ming, we see the default values stored in the database.

Another thing to note above is that when we inserted the new documents, we didn't have to specify the table. Ming documents are actually dict subclasses, but they "remember" where they came from. To update a document, all we need to do is to call .m.save() on the document:

>>> doc = MyDoc.m.get(username='rick')
>>> import bson
>>> doc.client_id=bson.ObjectId()
>>> doc.username
u'rick'
>>> doc.client_id
ObjectId('4fd250bdfb72f08265000006')
>>> doc.m.save()

If you'd prefer to use MongoDB's atomic updates, you can use the manager method update_partial:

>>> MyDoc.m.update_partial(
...     dict(username='rick'), 
...     {'$set': { 'client_id': None}})
{u'updatedExisting': True, u'connectionId': 232, 
 u'ok': 1.0, u'err': None, u'n': 1}

More to come

There's a lot more to Ming, which I'll cover in future articles, including data polymorphism, eager and lazy data migration, [gridfs][gridfs] support, and an object-document mapper providing object-relational type capabilities.

So what do you think? Is Ming something that you would use for your projects? Have you chosen one of the other MongoDB mappers? Please let me know in the comments below.

Other announcements

If you're looking for MongoDB and Python training classes, please sign up to hear about it when I start offering them, and to get a 25% discount on registration. And if you happen to be attending the SouthEast LinuxFest, I'd love it if you'd drop by my talk on building your first MongoDB application on Saturday morning at 11:30.

5 comments:

  1. Hi, everybody like to write his own object mapper for MongoDB. I wrote mine 3 years ago: https://github.com/svetlyak40wt/pymongo-bongo but currently it is abandoned, because pymongo is good enough.

    I never used Ming, yet, but some of it's sintax looks ugly for me.

    For example, why don't use `MyDoc.get` instead of `MyDoc.m.get`? And why not `MyDoc(username='rick').insert()` instead of `MyDoc(dict(username='rick')).m.insert()`?

    ReplyDelete
    Replies
    1. Agreed with big 40.
      In the world of MongoDB ODMs, Ming is the one with ugliest language.

      We also use our own ODM based on plain pymongo + validations system

      Delete
    2. Thanks for the comments! I had intended to reply earlier, but apparently blogger ate my responses.

      The reason that you can't use .get() is because Ming collection objects are subclasses of dict, and I didn't want to hide the dict.get method.

      Passing a dict into the constructor rather than keyword arguments was for the purpose of being able to pass other keyword arguments as well, although I agree that it's kind of ugly.

      And I do disagree that pymongo is "good enough", as it doesn't provide any support for schema enforcement / documentation, but to each his own I suppose! There is also quite a bit more to Ming that I'll cover in future posts, going far beyond what pymongo provides, but I think that if Ming were nothing but schema validation it'd still be useful.

      Thanks again for the comments!

      Delete
    3. Leandro: do you have some examples of what your ODM code looks like? Ming is constantly evolving and actively maintained, so it might be nice to add some syntactic sugar as we go.

      Thanks again!

      Delete
    4. Ming being inspired to SQLAlchemy has two layers, using the lower one brings the syntax showed inside the post.

      While the foundation layer is the most flexible one you would probably more often end using the ODM layer which has declarative syntax and usage similar to SQLA ORM+Declarative.

      You can give a look at the Ming ODM tutorial for a quick introduction to the ODM layer http://merciless.sourceforge.net/odm.html

      The TurboGears2 ming support also provides a quick overview of the ODM layer: http://www.turbogears.org/2.1/docs/main/Ming.html

      Delete