Wednesday, July 04, 2012

Relational Mapping with MongoDB, Ming, and Python

Continuing on in my series on MongoDB and Python, this article delves into the object-document mapper (ODM) included with the Python MongoDB toolkit Ming. If you're just getting started with MongoDB, or with Ming, you might want to read the previous articles in the series first:

And now that you're all caught up, let's take a look at what makes up an object document mapper, anyway...

But wait! Didn't I already describe an object-document mapper in previous Ming posts? Well, yes and no. While it's true that the 'base' level of Ming allows you to map documents from MongoDB into instances of Python classes, those classes are really just slightly-glorified dict. Some of the things that I expect in a real ODM are missing, however:

  • Automatic persistence of updated documents - If you change an instance of ming.Document, it's up to you to make sure that change gets persisted back to MongoDB. This leads to more verbose code and generally more errors, as it's easy to forget to .m.save() a document when you're done.
  • An Identity Map - If you're working with several documents at once, you can get into a situation where you have two documents in memory that both represent the same document in MongoDB. This can cause consistency problems, particularly if the documents are both modified, but in different ways.
  • A Unit of Work - When you're doing several updates, it's nice to be able to "batch up" the updates and flush them to MongoDB all at once, particularly in a NoSQL database like MongoDB which has no multi-document transactions. You don't get true transactional behavior with a unit of work, but you can at least choose to skip the flush step and the database doesn't change.
  • Support for Relationships Between Documents - Maybe this is just my relational background showing through, but I like to be able to construct object graphs in RAM that aren't necessarily represented by embedded documents in MongoDB.

So, continuing in Ming's homage to SQLAlchemy, there is a second layer to Ming that provides all these features: the ming.odm package.

Defining an ODM model

Since everyone seems to like building either a blog or a wiki for their examples, we'll take a similar approach here, but with a twist. Suppose we're building a blogging platform that supports multiple sites, each of which can contain multiple blogs. We might model our data as follows:

Blogging Platform Data Model

Note in particular the 1:N relationships between Site and Blog, and between Blog and Post. Here's an example set of documents that we might store:

// Site
{ _id: ObjectId(1234....1),
  domain: 'blogs-r-us.com' }

// Blog
{ _id: ObjectId(1234...2),
  site_id: ObjectId(1234...1),
  name: 'My Awesome Blog' }

// Post
{ _id: ObjectId(1234...3),
  blog_id: ObjectId(1234...2),
  title: 'Frsit Psot',
  text: 'This is my blog. There are many...' }

Note that we've modeled everything "relationally"; a Post has a "foreign key" into Blog, which has a "foreign key" into Site. In order to actually model this in the Ming ORM, we first need a bit of setup:

from ming import Session
from ming.odm import ODMSession

doc_session = Session()
odm_session = ODMSession(doc_session=doc_session)

Because the ODM is doing so much more than the schema enforcement layer, we need a new session to tie things together. The ODMSession provide the identity map and unit of work mentioned above, as well as a connection to the regular Session (which in turn ties everything to a particular MongoDB database).

Once we have this set up, we can start defining our model. Let's look at the definition of a Site first:

from ming import schema as S
from ming.odm.declarative import MappedClass
from ming.odm.property import FieldProperty, RelationProperty

class Site(MappedClass):
    class __mongometa__:
        session = odm_session
        name = 'site'

    _id = FieldProperty(S.ObjectId)
    domain = FieldProperty(str)

    blogs = RelationProperty('Blog')

There are a couple of things to note here:

  • The Field from the schema level is replaced by a FieldProperty for ODM-level things.
  • The one-to-many relation between Site and Blog is represented by a RelationProperty. Since we haven't defined Blog yet, we'll use a string to reference Blog rather than the class itself. There's some magic involved in figuring out what to do with RelationPropertys; there will be more on that later.

Moving along, we can define Blog as follows:

from ming.odm.property import ForeignIdProperty

class Blog(MappedClass):
    class __mongometa__:
        session = odm_session
        name = 'blog'

    _id = FieldProperty(S.ObjectId)
    name = FieldProperty(str)

    site_id = ForeignIdProperty(Site)
    site = RelationProperty(Site)
    posts = RelationProperty('Post')

Here, we've introduced the ForeignIdProperty to represent our "foreign key" construct. Ming actually uses declared ForeignIdPropertys to guess what to do with RelationPropertys. A couple of things to note here:

  • We can use the Site class rather than the string "Site" since Site has been declared.
  • We don't need to specify a schema for the site_id field. Because it is a ForeignIdProperty, Ming knows to use the same validation for it that it uses for Site._id.

To finish out our model, the Post class looks similar:

class Post(MappedClass):
    class __mongometa__:
        session = odm_session
        name = 'blog'

    _id = FieldProperty(S.ObjectId)
    title = FieldProperty(str)
    text = FieldProperty(str)

    blog_id = ForeignIdProperty(Blog)
    blog = RelationProperty(Blog)

And we're done!

Using the Model to Manipulate Data

Once everything's defined, we can create some data as follows:

>>> import blog as B
>>> from ming.datastore import DataStore
>>> B.doc_session.bind = DataStore(
...     'mongodb://localhost:27017',
...     database='blog')
>>> site = B.Site(domain='blogs-r-us.com')
>>> blog = B.Blog(name='My Awesome Blog', site=site)
>>> post = B.Post(title='Frsit Psot', text='This is my blog.', blog=blog)
>>> print blog
<Blog _id=ObjectId(...)
  site_id=ObjectId(...) name='My
  Awesome Blog'>

Notice how Ming has helpfully filled in the ForeignIdProperty values based on passing object into the constructor. Now let's look at the database:

>>> list(B.doc_session.db.blog.find())
[]

Recall that one of the features of the ODM is the unit of work. To flush all our changes to the database, we need to explicitly tell the ODMSession to do so. Before we do, let's take a look at it:

>>> print B.odm_session
<session>
  <UnitOfWork>
    <new>
      <Site _id=ObjectId(...)
          domain='blogs-r-us.com'>
      <Blog _id=ObjectId(...)
          site_id=ObjectId(...) name='My
          Awesome Blog'>
      <Post text='This is my blog.'
          blog_id=ObjectId(...)
          _id=ObjectId(...) title='Frsit
          Psot'>
    <clean>
    <dirty>
    <deleted>
  <imap (3)>
    Site : ... => <Site _id=ObjectId(...)
        domain='blogs-r-us.com'>
    Blog : ... => <Blog _id=ObjectId(...)
        site_id=ObjectId(...) name='My
        Awesome Blog'>
    Post : ... => <Post text='This is my blog.'
        blog_id=ObjectId(...)
        _id=ObjectId(...) title='Frsit
        Psot'>

There are a couple of things to note here:

  • The ODMSession is tracking the new Site, Blog, and Post objects we've created in its unit of work. Since each of these objects are in the new state, they will be insert()ed when then unit of work is flush()ed.
  • The ODMSession is also tracking the objects in its identity map. The purpose of the identiy map is to make sure that if you perform two queries from MongoDB that return the same document, they will be represented by the same object in memory. More on this later.

Now let's go ahead an flush() and take a look at the database:

>>> B.odm_session.flush()
>>> list(B.doc_session.db.blog.find())
[{u'_id': ObjectId(...), u'site_id': ObjectId(...),
u'name': u'My Awesome Blog'}]

And we see that the blog has been stored back into MongoDB. We can look at the ODMSession once again and see that all the objects have now moved into the clean state:

>>> print B.odm_session
<session>
  <UnitOfWork>
    <new>
    <clean>
      <Site _id...
      <Blog _id...
      <Post text...
    <dirty>
    <deleted>
  <imap (3)>...

Now, let's try modifying the post title and looking at the session again:

>>> post.title = 'First Post'
>>> print B.odm_session
<session>
  <UnitOfWork>
    <new>
    <clean>
      <Site _id=...
      <Blog _id=...
    <dirty>
      <Post text...
    <deleted>
  <imap (3)>...

Note how the Post is now dirty. On the next flush(), it will be .save()d back to MongoDB:

>>> B.odm_session.flush()
>>> list(B.doc_session.db.post.find())
[{u'text': u'This is my blog.', u'blog_id': ObjectId(...), u'_id': Obj
ectId(...), u'title': u'First Post'}]
>>> print B.odm_session
<session>
  <UnitOfWork>
    <new>
    <clean>
      <Site _id=...
      <Blog _id=...
      <Post text=...
    <dirty>
    <deleted>
  <imap (3)>...

... and it's clean again. Similarly we can .delete() an object (e.g. post.delete()) to mark it as deleted, causing it to be remove()d on the next flush().

Querying using the ODM

Putting data into the database is all well and good, but to be useful we should be able to retrieve it is well. Doing so requires the use of the .query property of our classes, which serves the same purpose as the .m property in "base" Ming, or the .objects property in Django. First, let's clear out the ODM session and retrieve all the posts:

>>> posts = B.Post.query.find().all()
>>> posts
[<Post text=u'This is my blog.'
  blog_id=ObjectId(...)
  _id=ObjectId(...) title=u'First
  Post'>]
>>> B.odm_session
<session>
  <UnitOfWork>
    <new>
    <clean>
      <Post text=...
    <dirty>
    <deleted>
  <imap (1)>
    Post : ... 

Now let's query again to get another posts list to see the identity map in action:

>>> posts1 = B.Post.query.find().all()
>>> posts[0]
<Post text=u'This is my blog.'
  blog_id=ObjectId(...)
  _id=ObjectId(...) title=u'First
  Post'>
>>> posts1[0]
<Post text=u'This is my blog.'
  blog_id=ObjectId(...)
  _id=ObjectId(...) title=u'First
  Post'>
>>> posts[0] is posts1[0]

Interesting - if we perform a query that returns the same document, we get back the same Python object. This is important for maintaining consistency in cases where you might arrive at the same object via two different query paths. The precise guarantee that the identity map provides is as follows:

Any two references to an instance of the same class in the same session with
the same `_id` value are the *same instance*.

Now let's see what Blog this post is a part of:

>>> post.blog
<Blog _id=ObjectId(...)
  site_id=ObjectId(...) name=u'My
  Awesome Blog'>
>>> print B.odm_session
<session>
  <UnitOfWork>
    <new>
    <clean>
      <Post text...
      <Blog _id...
    <dirty>
    <deleted>
  <imap (2)>...

Behind the scenes, Ming has queried the database to retrieve the Blog whose _id value matches the blog_id ForeignIdProperty value from the Post. Likewise, the blog's posts property contains a list including the post we created:

>>> blog = post.blog
>>> blog.posts
I[<Post text=u'This is my blog.'
  blog_id=ObjectId(...)
  _id=ObjectId(...) title=u'First
  Post'>]
>>> blog.posts[0] is post
True

Automating the Session

So once of the things that I wanted in an ODM was automatic persistence of documents. In all the above examples, we actually have to manually call flush() each time we want to persist changes. To rectify this, at least in the context of web applications, Ming provides WSGI middleware that will flush at the end of each web request if everything went through fine, or clear the session if there is an error. To use the middleware, simply wrap your WSGI application in MingMiddleware:

from ming.odm.middleware import MingMiddleware

app = some_application_factory()
app = MingMiddleware(app)

To have everything work well, you should also use a ThreadLocalODMSession provided by Ming when defining your classes. The signature is the same as the ODMSession:

from ming import Session
from ming.odm import ThreadLocalODMSession

doc_session = Session()
odm_session = ThreadLocalODMSession(doc_session=doc_session)

By default, MingMiddleware will clear the session and not flush if any exception is raised during your request handling except for a webob.HTTPRedirection, since this is typically not considered an error. You can modify this behavior by passing the flush_on_errors keyword argument to the MingMiddleware constructor.

Feedback Encouraged

There are certainly still some rough edges around Ming's ODM, but it's still being actively developed and maintained, so we're always interested in feedback. So what do you think? Does the ODM add enough to Ming's functionality to convince you to try it out? Let me know in the comments below!

No comments:

Post a Comment