Tuesday, January 31, 2012

Moving along with PyMongo

In a previous post, I introduced MongoDB and its Python driver pymongo. Here, I continue in the same vein, describing in more detail how you can become productive using pymongo.

Connecting to the Database

In the last post, I gave a brief overview of how to connect to a database hosted on the MongoLab service using pymongo, so I won't go into detail on that here. Instead, I will mention a few options you'll want to be aware of.

Some of the connection options set the default policy for the safety of your data:

  • safe MongoDB by default operates in a "fire-and-forget" mode where all data-modifying operations are optimistically assumed to succeed. Turning on safe mode changes this, waiting for a response from the database server indicating that an operation has succeeded or failed before returning from a data-modifying operation.
  • journal In version 1.8, MongoDB introduced journaling to provide single-server durability. MongoDB's flexible approach to balancing safety and performance, however, means that if your application wants to make sure its data has really been saved, you need to wait for the journal to be written.
  • fsync This is the really, really safe option. With or without a journal, this will wait until your data is on the physical disk before it returns from update operations.
  • w Before journaling, MongoDB used (and still uses) replication to ensure the durability of your data. The w parameter allows you to control how many servers (or which set of servers) your update has been replicated to before returning from a data-modifying operation. Be aware that this parameter can significantly slow down your updates, particularly if you are requiring them to be replicated to different datacenters.
  • read_preference This option allows you to specify how you'd like to handle queries. By default, even in a replica set configuration, all your queries will be routed to the primary server in the replica set to ensure strong consistency. Using read_preference you can change this policy, allowing your queries to be distributed across the secondaries of your replica set for increased performance at the cost of moving to "eventual consistency."

One thing that's nice about the pymongo connection is that it's automatically pooled. What this means is that pymongo maintains a pool of connections to the mongodb server that it reuses over the lifetime of your application. This is good for performance since it means pymongo doesn't need to go through the overhead of establishing a connection each time it does an operation. Mostly, this happens automatically. you do, however, need to be aware of the connection pooling, however, since you need to manually notify pymongo that you're "done" with a connection in the pool so it can be reused.

Now that all that is out of the way, the easiest way to connecto to a MongoDB database from python is below (assuming you are running a MongoDB server locally, and that you have installed ipython and pymongo):

In [1]: import pymongo

In [2]: conn = pymongo.Connection()

Inserting Documents

Inserting documents begins by selecting a database. To create a database, you do... well, nothing, actually. The first time you refer to a database, the MongoDB server creates it for you automatically. So once you have your database, you need to decide which "collection" in which to store your documents. To create a collection, you do... right - nothing. So here we go:

In [3]: db = conn.tutorial

In [4]: db.test
Out[4]: Collection(Database(Connection('localhost', 27017), u'tutorial'), u'test')

In [5]: db.test.insert({'name': 'My Document', 'ids': [1,2,3], 'subdocument': {'a':2}})
Out[5]: ObjectId('4f25bcffeb033049af000000')

One thing to note here is that the insert command returned us an ObjectId value. This is the value that pymongo generated for the _id property, the "primary key" of a MongoDB document. We can also manually specify the _id if we want (and we don't have to use ObjectIds):

In [6]: db.test.insert({'_id': 42, 'name': 'My Document', 'ids': [1,2,3], 'subdocument': {'a':2}})
Out[6]: 42

A note on document structure

A note here is order about the kind of documents you can insert into a collection. Technically, the type of document is described by the BSON spec, but practically you can think of it as JSON plus a few extra types. Some of the types you should be aware of are:

  • primitive types Python ints, strings, and floats are automatically handled appropriately by pymongo.
  • object Objects are represented by pymongo as regular python dicts. Technically, in BSON, objects are ordered dictionaries, so if you need or want to rely on the ordering you can use the bson module, included with pymongo, to create such objects. Objects have strings as their keys and can have any valid BSON type as their values.
  • array Arrays are represented by pymongo as Python lists. Again, any valid BSON type can be used as the values, and the values in an array do not need to be of the same type.
  • ObjectId ObjectIds can be thought of as globally unique identifiers. They are often used to generate "safe" primary keys for collections without the overhead of using a sequence generator as is often done in relational databases.
  • Binary Strings in BSON are stored as UTF-8 - encoded unicode. To store non-unicode data you must use the bson.Binary wrapper around a Python string.

Documents in MongoDB also have a size limit (16MB in the latest version), which is enough for many use-cases, but is something you'll need to be aware of when designing your schemas.

Batch inserts

MongoDB and pymongo also allow you to insert multiple documents with a single API call (and a single trip to the server). This can significantly speed up your inserts, and is useful for things like batch loads. To perform a multi-insert, you simply pass a list of documents to the insert() method rather than a single document:

In [7]: db.test.insert([{'a':1}, {'a':2}])
Out[7]: [ObjectId('4f25c0aceb033049af000001'), ObjectId('4f25c0aceb033049af000002')]

You may have noticed that the structure of the documents inserted in this snippet differ (significantly!) from the documents inserted before. MongoDB does not make any requirements that documents share structure with one another. Analogous to Python's dynamic typing, MongoDB is a dynamically typed database, where the structure of the document is stored along with the document itself. Practically, I've found it useful to group similarly structured documents into collections, but it's definitely not a hard-and-fast rule.

Querying

OK, now that you've got your data into the database, how do you get it back out again? That's the function of the find() method on collection objects. With no parameters, it will return all the documents in a collection as a Python iterator:

In [8]: db.test.find()
Out[8]: <pymongo.cursor.Cursor at 0x7f2910068b90>

In [9]: list(db.test.find())
Out[9]: 
[{u'_id': ObjectId('4f25bce9eb033049a0000000'),
  u'ids': [1, 2, 3],
  u'name': u'My Document',
  u'subdocument': {u'a': 2}},
 {u'_id': ObjectId('4f25bcffeb033049af000000'),
  u'ids': [1, 2, 3],
  u'name': u'My Document',
  u'subdocument': {u'a': 2}},
 {u'_id': 42,
  u'ids': [1, 2, 3],
  u'name': u'My Document',
  u'subdocument': {u'a': 2}},
 {u'_id': ObjectId('4f25c0aceb033049af000001'), u'a': 1},
 {u'_id': ObjectId('4f25c0aceb033049af000002'), u'a': 2}]

Most of the time, however, you'll want to select particular documents to return. You do this by providing a query as the first parameter to find. Queries are represented as BSON objects as well, and are similar to query-by-example that you may have seen in other database technologies. For instance, to retrieve all documents that have the name 'My Document', you would use the following query:

In [13]: list(db.test.find({'name':'My Document'}))
Out[13]: 
[{u'_id': ObjectId('4f25bce9eb033049a0000000'),
  u'ids': [1, 2, 3],
  u'name': u'My Document',
  u'subdocument': {u'a': 2}},
 {u'_id': ObjectId('4f25bcffeb033049af000000'),
  u'ids': [1, 2, 3],
  u'name': u'My Document',
  u'subdocument': {u'a': 2}},
 {u'_id': 42,
  u'ids': [1, 2, 3],
  u'name': u'My Document',
  u'subdocument': {u'a': 2}}]

MongoDB also allows you to 'reach inside' embedded documents using the dot notation. Here are some examples of how you can use that:

In [22]: db.testq.insert([
   ....:         { 'a': { 'b': 1 }, 'c': [{'d': 1}, {'d':2}, {'d':3}]},
   ....:         { 'a': { 'b': 2 }, 'c': [{'d': 2}, {'d':3}, {'d':4}]},
   ....:         { 'a': { 'b': 3 }, 'c': [{'d': 3}, {'d':4}, {'d':5}]}
   ....:         ])
Out[22]: 
[ObjectId('4f25c89beb033049af000009'),
 ObjectId('4f25c89beb033049af00000a'),
 ObjectId('4f25c89beb033049af00000b')]

In [23]: # reach inside objects

In [24]: list(db.testq.find({'a.b': 1}))
Out[24]: 
[{u'_id': ObjectId('4f25c89beb033049af000009'),
  u'a': {u'b': 1},
  u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]

In [25]: list(db.testq.find({'a.b': 2}))
Out[25]: 
[{u'_id': ObjectId('4f25c89beb033049af00000a'),
  u'a': {u'b': 2},
  u'c': [{u'd': 2}, {u'd': 3}, {u'd': 4}]}]

In [26]: # find objects where *any* value in an array matches 

In [27]: list(db.testq.find({'c': {'d':1}}))
Out[27]: 
[{u'_id': ObjectId('4f25c89beb033049af000009'),
  u'a': {u'b': 1},
  u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]

In [28]: # reach into an array

In [29]: list(db.testq.find({'c.d': 2}))
Out[29]: 
[{u'_id': ObjectId('4f25c89beb033049af000009'),
  u'a': {u'b': 1},
  u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]},
 {u'_id': ObjectId('4f25c89beb033049af00000a'),
  u'a': {u'b': 2},
  u'c': [{u'd': 2}, {u'd': 3}, {u'd': 4}]}]

In [30]: list(db.testq.find({'c.1.d': 2}))
Out[30]: 
[{u'_id': ObjectId('4f25c89beb033049af000009'),
  u'a': {u'b': 1},
  u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]

One other thing that's important to be aware of is that you can return only a subset of the fields in a query. (By default, _id is always returned unless you explicitly suppress it.) Here is an example:

In [31]: list(db.testq.find({'c.1.d':2}, {'c':1}))
Out[31]: 
[{u'_id': ObjectId('4f25c89beb033049af000009'),
  u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]

In [32]: # we can also reach inside when specifying which fields to return

In [33]: list(db.testq.find({'c.1.d':2}, {'a.b':1}))
Out[33]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}}]

We can also restrict the number of results returned from a query by skipping some documents and limiting the total number returned:

In [66]: list(db.testq.find())
Out[66]: 
[{u'_id': ObjectId('4f25c89beb033049af000009'),
  u'a': {u'b': 1},
  u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]},
 {u'_id': ObjectId('4f25c89beb033049af00000a'),
  u'a': {u'b': 2},
  u'c': [{u'd': 2}, {u'd': 3}, {u'd': 4}]},
 {u'_id': ObjectId('4f25c89beb033049af00000b'),
  u'a': {u'b': 3},
  u'c': [{u'd': 3}, {u'd': 4}, {u'd': 5}]}]

In [67]: list(db.testq.find().skip(1).limit(1))
Out[67]: 
[{u'_id': ObjectId('4f25c89beb033049af00000a'),
  u'a': {u'b': 2},
  u'c': [{u'd': 2}, {u'd': 3}, {u'd': 4}]}]

The query and advanced query pages in the MongoDB docs provide the full query syntax, which includes inequalities, size and type operations, and more.

Indexing

Like any database, MongoDB can only perform so well by scanning collections for matches. To provide better performance, MongoDB can use indexes on its collections. The normal method of specifying an index in MongoDB is by "ensuring" its existence, in keeping with the pattern of having things spring into existence when they're needed. To create an index on our 'test' collection, for instance, we would do the following:

In [13]: db.test.drop()

In [14]: db.test.ensure_index('a')
Out[14]: u'a_1'

In [15]: db.test.insert([
   ....: {'a': 1, 'b':2}, {'a':2, 'b':3}, {'a':3, 'b':4}])
Out[15]: 
[ObjectId('4f28261deb033053bc000000'),
 ObjectId('4f28261deb033053bc000001'),
 ObjectId('4f28261deb033053bc000002')]

In [16]: db.test.find({'a':2})
Out[16]: <pymongo.cursor.Cursor at 0x24f7b90>

In [17]: db.test.find_one({'a':2})
Out[17]: {u'_id': ObjectId('4f28261deb033053bc000001'), u'a': 2, u'b': 3}

Ok, well that's not actually all that exciting. However, MongoDB provides an explain() method that lets us see whether our index is getting used:

In [18]: db.test.find({'a':2}).explain()
Out[18]: 
{u'allPlans': [{u'cursor': u'BtreeCursor a_1',
   u'indexBounds': {u'a': [[2, 2]]}}],
 u'cursor': u'BtreeCursor a_1',
 u'indexBounds': {u'a': [[2, 2]]},
 u'indexOnly': False,
 u'isMultiKey': False,
 u'millis': 0,
 u'n': 1,
 u'nChunkSkips': 0,
 u'nYields': 0,
 u'nscanned': 1,
 u'nscannedObjects': 1}

There are several important things to note about the result here. The most important is the cursor type, BtreeCursor a_1. This means that it's using an index, and in particular the index named a_1 that we created above. If the field is not indexed, we get a different query plan from MongoDB:

In [19]: db.test.find({'b':2}).explain()
Out[19]: 
{u'allPlans': [{u'cursor': u'BasicCursor', u'indexBounds': {}}],
 u'cursor': u'BasicCursor',
 u'indexBounds': {},
 u'indexOnly': False,
 u'isMultiKey': False,
 u'millis': 0,
 u'n': 1,
 u'nChunkSkips': 0,
 u'nYields': 0,
 u'nscanned': 3,
 u'nscannedObjects': 3}

Here, MongoDB is using a BasicCursor. For you SQL experts out there, this is equivalent to a full table scan, and is very inefficient. Note also that when we queried the indexed field, nscanned and nscannedObjects were both one, meaning that it only had to check one object to satisfy the query, whereas in the case of our unindexed field, we had to check every document in the collection.

MongoDB has an extremely fast query that it can use in some cases where it doesn't have to scan any objects, only the index entries. This happens when the only data you're returning from a query is part of the index:

In [36]: db.test.find({'a':2}, {'a':1, '_id':0}).explain()
Out[36]: 
...
 u'indexBounds': {u'a': [[2, 2]]},
 u'indexOnly': True,
 u'isMultiKey': False,
...

Note here the indexOnly field is true, specifying that MongoDB only had to inspect the index (and not the actual collection data) to satisfy the query.

Another thing that's nice about the MongoDB index system is that it can use compound indexes (indexes that include more than one field) to satisfy some queries. In this case, you should specify the direction of each field since MongoDB stores indexes as B-trees. An illustration is probably best. First, we'll drop our a_1 index and ensure a new index on a and b, both ascending:

In [44]: db.test.drop_index('a_1')

In [45]: db.test.ensure_index([('a', 1), ('b', 1)])
Out[45]: u'a_1_b_1'

Now, let's see what happens when we query by just a:

In [55]: db.test.find({'a': 2}).explain()
Out[55]: 
...
 u'cursor': u'BtreeCursor a_1_b_1',
 u'indexBounds': {u'a': [[2, 2]],
  u'b': [[{u'$minElement': 1}, {u'$maxElement': 1}]]},
...

MongoDB's optimizer here is "smart" enough to use the compound (a,b) index to query just the a value. What if we query just the b value?

In [56]: db.test.find({'b': 2}).explain()
Out[56]: 
{u'allPlans': [{u'cursor': u'BasicCursor', u'indexBounds': {}}],
 u'cursor': u'BasicCursor',
...

Oops! That doesn't work because the index is sorted with keys (a,b). Key order also becomes important when we want to sort our results:

In [64]: db.test.find().sort([ ('a', 1), ('b', 1)]).explain()
Out[64]: 
...
 u'cursor': u'BtreeCursor a_1_b_1',
 u'indexBounds': {u'a': [[{u'$minElement': 1}, {u'$maxElement': 1}]],
  u'b': [[{u'$minElement': 1}, {u'$maxElement': 1}]]},
...

In [65]: db.test.find().sort([ ('a', 1), ('b', -1)]).explain()
Out[65]: 
{u'allPlans': [{u'cursor': u'BasicCursor', u'indexBounds': {}}],
 u'cursor': u'BasicCursor',
...

So if we sort in the same order as our index, we can use the B-tree index to sort the results. If we sort in a different order, we can't, so MongoDB has to actually load the entire result set into RAM and sort it there. (MongoDB can actually use an index for the exact reverse of our sort order as well, so [(a, -1), (b, -1)] would have worked just fine.) In a collection of 3 documents, this isn't important, but as your data grows, this can become quite slow, in some cases actually returning an error because the result set is too large to sort in RAM.

Deleting data

Deleting data in MongoDB is fairly straightforward. All you need to do is to pass a query to the remove() method on the collection, and MongoDB will delete all documents from the collection that match the query. (Note that deletes can still be slow if you specify the query inefficiently, as they use indexes just like queries do.)

In [72]: list(db.testq.find())
Out[72]: 
[{u'_id': ObjectId('4f25c89beb033049af000009'),
  u'a': {u'b': 1},
  u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]},
 {u'_id': ObjectId('4f25c89beb033049af00000a'),
  u'a': {u'b': 2},
  u'c': [{u'd': 2}, {u'd': 3}, {u'd': 4}]},
 {u'_id': ObjectId('4f25c89beb033049af00000b'),
  u'a': {u'b': 3},
  u'c': [{u'd': 3}, {u'd': 4}, {u'd': 5}]}]

In [73]: db.testq.remove({'a.b': {'$gt': 1}})

In [74]: list(db.testq.find())
Out[74]: 
[{u'_id': ObjectId('4f25c89beb033049af000009'),
  u'a': {u'b': 1},
  u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]

Updating data

In many cases, an update in MongoDB is as straightforward as calling save() on a python dict:

In [76]: doc = db.testq.find_one({'a.b': 1})

In [77]: doc['a']['b'] = 4

In [78]: db.testq.save(doc)
Out[78]: ObjectId('4f25c89beb033049af000009')

In [79]: list(db.testq.find())
Out[79]: 
[{u'_id': ObjectId('4f25c89beb033049af000009'),
  u'a': {u'b': 4},
  u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]

save() can also be used to insert documents if they don't exist yet (this check is done by checking the dict to be saved for the _id key).

Unlike some other NoSQL solutions, MongoDB allows you to do quick, in-place updates of documents using special [modifier operations][mongodb-modifier]. For instance, to set a field to a particular value, you can do the following:

In [81]: db.testq.update({'a.b': 4}, {'$set': {'a.b': 7}})

In [82]: list(db.testq.find())
Out[82]: 
[{u'_id': ObjectId('4f25c89beb033049af000009'),
  u'a': {u'b': 7},
  u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]

MongoDB provides several different modifiers you can use to update documents in place, including the following (for more details see updates):

  • $inc Increment a numeric field (generalized; can increment by any number)
  • $set Set certain fields to new values
  • $unset Remove a field from the document
  • $push Append a value onto an array in the document
  • $pushAll Append several values onto an array
  • $addToSet Add a value to an array if and only if it does not already exist
  • $pop Remove the last (or first) value of an array
  • $pull Remove all occurrences of a value from an array
  • $pullAll Remove all occurrences of any of a set of values from an array
  • $rename Rename a field
  • $bit Bitwise updates

There's a lot more that I could cover, but hopefully that whets your appetite to learn more about MongoDB. In future posts, I'll discuss how to use GridFS (a "filesystem" on top of MongoDB) and MongoDB's various aggregation options, as well as how you can use Ming to simplify certain operations.

11 comments:

  1. Thanks for this detailed & accurate intro, Rick.

    ReplyDelete
  2. Thanks for the comment. Glad you liked it!

    ReplyDelete
  3. One point of clarification: I mentioned that BSON documents are ordered dicts. This is true, but unfortunately you can't count on the order remaining consistent, as the keys in a document may be re-sorted alphabetically if the document needs to be moved (as it does when it outgrows its padding). This behavior is documented in https://jira.mongodb.org/browse/SERVER-2592 .

    I should also mention that you can use bson.son.SON (in pymongo) or collections.OrderedDict (in Python 2.7+ standard library) to create documents with a particular order, though as I mentioned, if the documents grow, the keys may be reordered. Thanks to Bernie Hackett for these tips.

    ReplyDelete
  4. Really useful info for a beginner like me! Thank you very much! :)

    ReplyDelete
    Replies
    1. Hi Vincent,

      Glad you found the tutorial useful, and thanks for the comment!

      Delete
  5. Anonymous6:52 PM

    Thanks for the article. I'm a bit surprised that it can only sort what will fit in RAM, though.

    ReplyDelete
    Replies
    1. Thanks for the comment! I was also surprised when I discovered the sorting limitation. Index and schema design can help a lot with that, however.

      Delete
  6. Nicely written, useful tutorial. Thanks

    ReplyDelete
    Replies
    1. Glad you liked it, and thanks for the comment!

      Delete
  7. Rick,
    I am stuck on the syntax for updating. I want to include a variable and embedded doc. In your example on updating above, it would be like this:

    ordinal = 1
    [{u'_id': ObjectId('4f25c89beb033049af000009'),
    u'a': [{u'x': 4, u'y': 5, u'z': 6}, {u'x': 7, u'y': 8, u'z': 9},{u'x': 1, u'y': 2, u'z': 3}]
    u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]

    so what I am trying to do is increment say 'y' from the second embeded doc with:

    db.col.update({'_id':'ObjectId('4f25c89beb033049af000009')'}, {'$inc':{'a'[ordinal]'y':1}},upsert=True,multi=False)

    I can't figure out the proper syntax to put 'y' after the ordinal variable that selects the embedded doc.

    Can you help me?

    ReplyDelete
  8. What you want is this:

    db.col.update(
    {'_id': ObjectId('4f25c89beb033049af000009')},
    { '$inc': { 'a.1.y': 1 } })

    ReplyDelete