In a previous post, I introduced MongoDB and its Python driver pymongo. Here, I continue in the same vein, describing in more detail how you can become productive using pymongo.
Connecting to the Database
In the last post, I gave a brief overview of how to connect to a database hosted on the MongoLab service using pymongo, so I won't go into detail on that here. Instead, I will mention a few options you'll want to be aware of.
Some of the connection options set the default policy for the safety of your data:
- safe MongoDB by default operates in a "fire-and-forget" mode where all data-modifying operations are optimistically assumed to succeed. Turning on safe mode changes this, waiting for a response from the database server indicating that an operation has succeeded or failed before returning from a data-modifying operation.
- journal In version 1.8, MongoDB introduced journaling to provide single-server durability. MongoDB's flexible approach to balancing safety and performance, however, means that if your application wants to make sure its data has really been saved, you need to wait for the journal to be written.
- fsync This is the really, really safe option. With or without a journal, this will wait until your data is on the physical disk before it returns from update operations.
- w Before journaling, MongoDB used (and still uses) replication to ensure the durability of your data. The w parameter allows you to control how many servers (or which set of servers) your update has been replicated to before returning from a data-modifying operation. Be aware that this parameter can significantly slow down your updates, particularly if you are requiring them to be replicated to different datacenters.
- read_preference This option allows you to specify how you'd like to handle queries. By default, even in a replica set configuration, all your queries will be routed to the primary server in the replica set to ensure strong consistency. Using read_preference you can change this policy, allowing your queries to be distributed across the secondaries of your replica set for increased performance at the cost of moving to "eventual consistency."
One thing that's nice about the pymongo connection is that it's automatically pooled. What this means is that pymongo maintains a pool of connections to the mongodb server that it reuses over the lifetime of your application. This is good for performance since it means pymongo doesn't need to go through the overhead of establishing a connection each time it does an operation. Mostly, this happens automatically. you do, however, need to be aware of the connection pooling, however, since you need to manually notify pymongo that you're "done" with a connection in the pool so it can be reused.
Now that all that is out of the way, the easiest way to connecto to a MongoDB database from python is below (assuming you are running a MongoDB server locally, and that you have installed ipython and pymongo):
In [1]: import pymongo In [2]: conn = pymongo.Connection()
Inserting Documents
Inserting documents begins by selecting a database. To create a database, you do... well, nothing, actually. The first time you refer to a database, the MongoDB server creates it for you automatically. So once you have your database, you need to decide which "collection" in which to store your documents. To create a collection, you do... right - nothing. So here we go:
In [3]: db = conn.tutorial In [4]: db.test Out[4]: Collection(Database(Connection('localhost', 27017), u'tutorial'), u'test') In [5]: db.test.insert({'name': 'My Document', 'ids': [1,2,3], 'subdocument': {'a':2}}) Out[5]: ObjectId('4f25bcffeb033049af000000')
One thing to note here is that the insert command returned us an ObjectId
value. This is the value that pymongo generated for the _id
property, the
"primary key" of a MongoDB document. We can also manually specify the _id if we
want (and we don't have to use ObjectIds):
In [6]: db.test.insert({'_id': 42, 'name': 'My Document', 'ids': [1,2,3], 'subdocument': {'a':2}}) Out[6]: 42
A note on document structure
A note here is order about the kind of documents you can insert into a collection. Technically, the type of document is described by the BSON spec, but practically you can think of it as JSON plus a few extra types. Some of the types you should be aware of are:
- primitive types Python ints, strings, and floats are automatically handled appropriately by pymongo.
- object Objects are represented by pymongo as regular python
dicts. Technically, in BSON, objects are ordered dictionaries, so if you need
or want to rely on the ordering you can use the
bson
module, included with pymongo, to create such objects. Objects have strings as their keys and can have any valid BSON type as their values. - array Arrays are represented by pymongo as Python lists. Again, any valid BSON type can be used as the values, and the values in an array do not need to be of the same type.
- ObjectId ObjectIds can be thought of as globally unique identifiers. They are often used to generate "safe" primary keys for collections without the overhead of using a sequence generator as is often done in relational databases.
- Binary Strings in BSON are stored as UTF-8 - encoded unicode. To store
non-unicode data you must use the
bson.Binary
wrapper around a Python string.
Documents in MongoDB also have a size limit (16MB in the latest version), which is enough for many use-cases, but is something you'll need to be aware of when designing your schemas.
Batch inserts
MongoDB and pymongo also allow you to insert multiple documents with a single API
call (and a single trip to the server). This can significantly speed up your
inserts, and is useful for things like batch loads. To perform a multi-insert,
you simply pass a list of documents to the insert()
method rather than a single
document:
In [7]: db.test.insert([{'a':1}, {'a':2}]) Out[7]: [ObjectId('4f25c0aceb033049af000001'), ObjectId('4f25c0aceb033049af000002')]
You may have noticed that the structure of the documents inserted in this snippet differ (significantly!) from the documents inserted before. MongoDB does not make any requirements that documents share structure with one another. Analogous to Python's dynamic typing, MongoDB is a dynamically typed database, where the structure of the document is stored along with the document itself. Practically, I've found it useful to group similarly structured documents into collections, but it's definitely not a hard-and-fast rule.
Querying
OK, now that you've got your data into the database, how do you get it back out
again? That's the function of the find()
method on collection objects. With no
parameters, it will return all the documents in a collection as a Python
iterator:
In [8]: db.test.find() Out[8]: <pymongo.cursor.Cursor at 0x7f2910068b90> In [9]: list(db.test.find()) Out[9]: [{u'_id': ObjectId('4f25bce9eb033049a0000000'), u'ids': [1, 2, 3], u'name': u'My Document', u'subdocument': {u'a': 2}}, {u'_id': ObjectId('4f25bcffeb033049af000000'), u'ids': [1, 2, 3], u'name': u'My Document', u'subdocument': {u'a': 2}}, {u'_id': 42, u'ids': [1, 2, 3], u'name': u'My Document', u'subdocument': {u'a': 2}}, {u'_id': ObjectId('4f25c0aceb033049af000001'), u'a': 1}, {u'_id': ObjectId('4f25c0aceb033049af000002'), u'a': 2}]
Most of the time, however, you'll want to select particular documents to return. You do this by providing a query as the first parameter to find. Queries are represented as BSON objects as well, and are similar to query-by-example that you may have seen in other database technologies. For instance, to retrieve all documents that have the name 'My Document', you would use the following query:
In [13]: list(db.test.find({'name':'My Document'})) Out[13]: [{u'_id': ObjectId('4f25bce9eb033049a0000000'), u'ids': [1, 2, 3], u'name': u'My Document', u'subdocument': {u'a': 2}}, {u'_id': ObjectId('4f25bcffeb033049af000000'), u'ids': [1, 2, 3], u'name': u'My Document', u'subdocument': {u'a': 2}}, {u'_id': 42, u'ids': [1, 2, 3], u'name': u'My Document', u'subdocument': {u'a': 2}}]
MongoDB also allows you to 'reach inside' embedded documents using the dot notation. Here are some examples of how you can use that:
In [22]: db.testq.insert([ ....: { 'a': { 'b': 1 }, 'c': [{'d': 1}, {'d':2}, {'d':3}]}, ....: { 'a': { 'b': 2 }, 'c': [{'d': 2}, {'d':3}, {'d':4}]}, ....: { 'a': { 'b': 3 }, 'c': [{'d': 3}, {'d':4}, {'d':5}]} ....: ]) Out[22]: [ObjectId('4f25c89beb033049af000009'), ObjectId('4f25c89beb033049af00000a'), ObjectId('4f25c89beb033049af00000b')] In [23]: # reach inside objects In [24]: list(db.testq.find({'a.b': 1})) Out[24]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}] In [25]: list(db.testq.find({'a.b': 2})) Out[25]: [{u'_id': ObjectId('4f25c89beb033049af00000a'), u'a': {u'b': 2}, u'c': [{u'd': 2}, {u'd': 3}, {u'd': 4}]}] In [26]: # find objects where *any* value in an array matches In [27]: list(db.testq.find({'c': {'d':1}})) Out[27]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}] In [28]: # reach into an array In [29]: list(db.testq.find({'c.d': 2})) Out[29]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}, {u'_id': ObjectId('4f25c89beb033049af00000a'), u'a': {u'b': 2}, u'c': [{u'd': 2}, {u'd': 3}, {u'd': 4}]}] In [30]: list(db.testq.find({'c.1.d': 2})) Out[30]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]
One other thing that's important to be aware of is that you can return only a
subset of the fields in a query. (By default, _id
is always returned unless
you explicitly suppress it.) Here is an example:
In [31]: list(db.testq.find({'c.1.d':2}, {'c':1})) Out[31]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}] In [32]: # we can also reach inside when specifying which fields to return In [33]: list(db.testq.find({'c.1.d':2}, {'a.b':1})) Out[33]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}}]
We can also restrict the number of results returned from a query by skip
ping
some documents and limit
ing the total number returned:
In [66]: list(db.testq.find()) Out[66]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}, {u'_id': ObjectId('4f25c89beb033049af00000a'), u'a': {u'b': 2}, u'c': [{u'd': 2}, {u'd': 3}, {u'd': 4}]}, {u'_id': ObjectId('4f25c89beb033049af00000b'), u'a': {u'b': 3}, u'c': [{u'd': 3}, {u'd': 4}, {u'd': 5}]}] In [67]: list(db.testq.find().skip(1).limit(1)) Out[67]: [{u'_id': ObjectId('4f25c89beb033049af00000a'), u'a': {u'b': 2}, u'c': [{u'd': 2}, {u'd': 3}, {u'd': 4}]}]
The query and advanced query pages in the MongoDB docs provide the full query syntax, which includes inequalities, size and type operations, and more.
Indexing
Like any database, MongoDB can only perform so well by scanning collections for matches. To provide better performance, MongoDB can use indexes on its collections. The normal method of specifying an index in MongoDB is by "ensuring" its existence, in keeping with the pattern of having things spring into existence when they're needed. To create an index on our 'test' collection, for instance, we would do the following:
In [13]: db.test.drop() In [14]: db.test.ensure_index('a') Out[14]: u'a_1' In [15]: db.test.insert([ ....: {'a': 1, 'b':2}, {'a':2, 'b':3}, {'a':3, 'b':4}]) Out[15]: [ObjectId('4f28261deb033053bc000000'), ObjectId('4f28261deb033053bc000001'), ObjectId('4f28261deb033053bc000002')] In [16]: db.test.find({'a':2}) Out[16]: <pymongo.cursor.Cursor at 0x24f7b90> In [17]: db.test.find_one({'a':2}) Out[17]: {u'_id': ObjectId('4f28261deb033053bc000001'), u'a': 2, u'b': 3}
Ok, well that's not actually all that exciting. However, MongoDB provides an
explain()
method that lets us see whether our index is getting used:
In [18]: db.test.find({'a':2}).explain() Out[18]: {u'allPlans': [{u'cursor': u'BtreeCursor a_1', u'indexBounds': {u'a': [[2, 2]]}}], u'cursor': u'BtreeCursor a_1', u'indexBounds': {u'a': [[2, 2]]}, u'indexOnly': False, u'isMultiKey': False, u'millis': 0, u'n': 1, u'nChunkSkips': 0, u'nYields': 0, u'nscanned': 1, u'nscannedObjects': 1}
There are several important things to note about the result here. The most
important is the cursor type, BtreeCursor a_1
. This means that it's using an
index, and in particular the index named a_1
that we created above. If the
field is not indexed, we get a different query plan from MongoDB:
In [19]: db.test.find({'b':2}).explain() Out[19]: {u'allPlans': [{u'cursor': u'BasicCursor', u'indexBounds': {}}], u'cursor': u'BasicCursor', u'indexBounds': {}, u'indexOnly': False, u'isMultiKey': False, u'millis': 0, u'n': 1, u'nChunkSkips': 0, u'nYields': 0, u'nscanned': 3, u'nscannedObjects': 3}
Here, MongoDB is using a BasicCursor
. For you SQL experts out there, this is
equivalent to a full table scan, and is very inefficient. Note also that when we
queried the indexed field, nscanned
and nscannedObjects
were both one,
meaning that it only had to check one object to satisfy the query, whereas in the
case of our unindexed field, we had to check every document in the collection.
MongoDB has an extremely fast query that it can use in some cases where it doesn't have to scan any objects, only the index entries. This happens when the only data you're returning from a query is part of the index:
In [36]: db.test.find({'a':2}, {'a':1, '_id':0}).explain() Out[36]: ... u'indexBounds': {u'a': [[2, 2]]}, u'indexOnly': True, u'isMultiKey': False, ...
Note here the indexOnly
field is true, specifying that MongoDB only had to
inspect the index (and not the actual collection data) to satisfy the query.
Another thing that's nice about the MongoDB index system is that it can use
compound indexes (indexes that include more than one field) to satisfy some
queries. In this case, you should specify the direction of each field since
MongoDB stores indexes as B-trees. An illustration is probably best. First, we'll
drop our a_1
index and ensure a new index on a
and b
, both ascending:
In [44]: db.test.drop_index('a_1') In [45]: db.test.ensure_index([('a', 1), ('b', 1)]) Out[45]: u'a_1_b_1'
Now, let's see what happens when we query by just a
:
In [55]: db.test.find({'a': 2}).explain() Out[55]: ... u'cursor': u'BtreeCursor a_1_b_1', u'indexBounds': {u'a': [[2, 2]], u'b': [[{u'$minElement': 1}, {u'$maxElement': 1}]]}, ...
MongoDB's optimizer here is "smart" enough to use the compound (a,b) index to
query just the a
value. What if we query just the b
value?
In [56]: db.test.find({'b': 2}).explain() Out[56]: {u'allPlans': [{u'cursor': u'BasicCursor', u'indexBounds': {}}], u'cursor': u'BasicCursor', ...
Oops! That doesn't work because the index is sorted with keys (a,b). Key order also becomes important when we want to sort our results:
In [64]: db.test.find().sort([ ('a', 1), ('b', 1)]).explain() Out[64]: ... u'cursor': u'BtreeCursor a_1_b_1', u'indexBounds': {u'a': [[{u'$minElement': 1}, {u'$maxElement': 1}]], u'b': [[{u'$minElement': 1}, {u'$maxElement': 1}]]}, ... In [65]: db.test.find().sort([ ('a', 1), ('b', -1)]).explain() Out[65]: {u'allPlans': [{u'cursor': u'BasicCursor', u'indexBounds': {}}], u'cursor': u'BasicCursor', ...
So if we sort in the same order as our index, we can use the B-tree index to sort
the results. If we sort in a different order, we can't, so MongoDB has to
actually load the entire result set into RAM and sort it there. (MongoDB can
actually use an index for the exact reverse of our sort order as well, so
[(a, -1), (b, -1)]
would have worked just fine.) In a collection
of 3 documents, this isn't important, but as your data grows, this can become
quite slow, in some cases actually returning an error because the result set is
too large to sort in RAM.
Deleting data
Deleting data in MongoDB is fairly straightforward. All you need to do is to pass
a query to the remove()
method on the collection, and MongoDB will delete all
documents from the collection that match the query. (Note that deletes can still
be slow if you specify the query inefficiently, as they use indexes just like
queries do.)
In [72]: list(db.testq.find()) Out[72]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}, {u'_id': ObjectId('4f25c89beb033049af00000a'), u'a': {u'b': 2}, u'c': [{u'd': 2}, {u'd': 3}, {u'd': 4}]}, {u'_id': ObjectId('4f25c89beb033049af00000b'), u'a': {u'b': 3}, u'c': [{u'd': 3}, {u'd': 4}, {u'd': 5}]}] In [73]: db.testq.remove({'a.b': {'$gt': 1}}) In [74]: list(db.testq.find()) Out[74]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 1}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]
Updating data
In many cases, an update in MongoDB is as straightforward as calling save()
on
a python dict:
In [76]: doc = db.testq.find_one({'a.b': 1}) In [77]: doc['a']['b'] = 4 In [78]: db.testq.save(doc) Out[78]: ObjectId('4f25c89beb033049af000009') In [79]: list(db.testq.find()) Out[79]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 4}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]
save()
can also be used to insert documents if they don't exist yet (this check
is done by checking the dict to be saved for the _id
key).
Unlike some other NoSQL solutions, MongoDB allows you to do quick, in-place updates of documents using special [modifier operations][mongodb-modifier]. For instance, to set a field to a particular value, you can do the following:
In [81]: db.testq.update({'a.b': 4}, {'$set': {'a.b': 7}}) In [82]: list(db.testq.find()) Out[82]: [{u'_id': ObjectId('4f25c89beb033049af000009'), u'a': {u'b': 7}, u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]
MongoDB provides several different modifiers you can use to update documents in place, including the following (for more details see updates):
- $inc Increment a numeric field (generalized; can increment by any number)
- $set Set certain fields to new values
- $unset Remove a field from the document
- $push Append a value onto an array in the document
- $pushAll Append several values onto an array
- $addToSet Add a value to an array if and only if it does not already exist
- $pop Remove the last (or first) value of an array
- $pull Remove all occurrences of a value from an array
- $pullAll Remove all occurrences of any of a set of values from an array
- $rename Rename a field
- $bit Bitwise updates
There's a lot more that I could cover, but hopefully that whets your appetite to learn more about MongoDB. In future posts, I'll discuss how to use GridFS (a "filesystem" on top of MongoDB) and MongoDB's various aggregation options, as well as how you can use Ming to simplify certain operations.
Thanks for this detailed & accurate intro, Rick.
ReplyDeleteThanks for the comment. Glad you liked it!
ReplyDeleteOne point of clarification: I mentioned that BSON documents are ordered dicts. This is true, but unfortunately you can't count on the order remaining consistent, as the keys in a document may be re-sorted alphabetically if the document needs to be moved (as it does when it outgrows its padding). This behavior is documented in https://jira.mongodb.org/browse/SERVER-2592 .
ReplyDeleteI should also mention that you can use bson.son.SON (in pymongo) or collections.OrderedDict (in Python 2.7+ standard library) to create documents with a particular order, though as I mentioned, if the documents grow, the keys may be reordered. Thanks to Bernie Hackett for these tips.
Really useful info for a beginner like me! Thank you very much! :)
ReplyDeleteHi Vincent,
DeleteGlad you found the tutorial useful, and thanks for the comment!
Thanks for the article. I'm a bit surprised that it can only sort what will fit in RAM, though.
ReplyDeleteThanks for the comment! I was also surprised when I discovered the sorting limitation. Index and schema design can help a lot with that, however.
DeleteNicely written, useful tutorial. Thanks
ReplyDeleteGlad you liked it, and thanks for the comment!
DeleteRick,
ReplyDeleteI am stuck on the syntax for updating. I want to include a variable and embedded doc. In your example on updating above, it would be like this:
ordinal = 1
[{u'_id': ObjectId('4f25c89beb033049af000009'),
u'a': [{u'x': 4, u'y': 5, u'z': 6}, {u'x': 7, u'y': 8, u'z': 9},{u'x': 1, u'y': 2, u'z': 3}]
u'c': [{u'd': 1}, {u'd': 2}, {u'd': 3}]}]
so what I am trying to do is increment say 'y' from the second embeded doc with:
db.col.update({'_id':'ObjectId('4f25c89beb033049af000009')'}, {'$inc':{'a'[ordinal]'y':1}},upsert=True,multi=False)
I can't figure out the proper syntax to put 'y' after the ordinal variable that selects the embedded doc.
Can you help me?
What you want is this:
ReplyDeletedb.col.update(
{'_id': ObjectId('4f25c89beb033049af000009')},
{ '$inc': { 'a.1.y': 1 } })
Hi
ReplyDeleteIf document is assigned to a variable, for example var a = {"Name":"Raj","City":Hyd} and I want to print or use a.Name using pymongo, getting an error.
One more thing findOne() is not working but find_One() is working. Why is such a difference in calling function.
In Python, to access the keys of a dictionary, you have to use the notation a['Name'], not a.Name.
ReplyDeleteAlso, this is a fairly old post, and some thing have changed in pymongo since it was published. I believe that the use of find_one() in pymongo instead of findOne() is one such example.