Wednesday, January 18, 2012

Getting Started with MongoDB and Python

If you've been following this blog for a while, you've seen me mention MongoDB more than once. One exciting thing for me is that I'll be co-teaching a tutorial at PyCon this year on Python and MongoDB that will cover MongoDB, PyMongo, and Ming. So to hopefully whet your appetite for learning more at the tutorial, I thought I'd write a few posts covering MongoDB, PyMongo, and Ming from a beginner's perspective.

What is MongoDB?

From MongoDB.org:

MongoDB (from "humongous") is a scalable, high-performance, open source NoSQL database.

Well, that's not all that enlightening, so I'll expand a bit here on MongoDB's features...

MongoDB is a document database

MongoDB is a document database, which means that instead of storing "rows" in "tables" like you do in a relational database, you store "documents" in "collections." Documents are basically JSON objects (technically BSON. This is to be distinguished from other NoSQL-type databases such as key-value stores (e.g. Tokyo Cabinet), column family stores (e.g. Cassandra) or column stores (e.g. MonetDB).

MongoDB has a flexible query language

This is one thing that makes MongoDB a pleasure to work with, particularly if you come from another NoSQL database where querying is either restrictive (key-value stores which can only be queried by key) or cumbersome (something like CouchDB that requires you to write a map-reduce query). MongoDB has a BSON-based query language that's a bit more restrictive than SQL, that you can still use to get a lot done.

Here's an example of a simple MongoDB query that we use at SourceForge to find all the blog posts for a project:

blog_post.find({'state':'published','app_config_id':{'$in':app_config_ids}})

There are also several other operators like '$lt', '$nin', '$not', and '$or' that allow you to construct quite complex queries, though you are somewhat restricted from what you can do in SQL (even with a single table).

MongoDB is fast and scalable

A single MongoDB node is able to comfortably serve 1000s of requests per second on cheap hardware. When you need to scale beyond that, you can use either replication (keeping several copies of the data on different servers) or sharding (partitioning the data across servers). MongoDB even includes logic to automatically load-balance your shards as your database and load increase.

Getting Started with MongoDB

While MongoDB is fairly straightforward to install on (64-bit) systems, there are also a couple of companies that provide a free tier of MongoDB hosting, MongoLab and MongoHQ that are great for getting started. I've been using, for no particular reason, MongoLab for my own things and I can recommend them, and it's what I have experience with, so that's what I'll cover here.

Let's assume you sign up for a MongoLab account. Once you've done this, you can create a database using their web-based control panel and click on it, you'll note the connection info at the top of the page:

(Your server name and port number may be different.) At this point, most tutorials would tell you to install and launch the 'mongo' command-line tool to begin exploring your database. We'll skip that here and use the python driver PyMongo directly. I like to use virtualenv myself and ipython, so that's the approach I'll take here:

$ virtualenv mongo
... install messages ...
$ source mongo/bin/activate
(mongo) $ pip install pymongo ipython
... install messages ...
(mongo) $ ipython
... banner message ...

Now that we're in ipython, we'll go ahead and connect to the database and create a document.

In [1]: import pymongo

In [2]: conn = pymongo.Connection('mongodb://tutorial-test:u3ZYh136@ds029187.mongolab.com:29187/tutorial-test')

In [3]: db = conn['tutorial-test']

In [4]: db.test_collection.insert({})
Out[4]: ObjectId('4f16f5c7eb03306a92000000')

In [5]: db.test_collection.find()
Out[5]: <pymongo.cursor.Cursor at 0x7fbb9006f350>

In [6]: list(db.test_collection.find())
Out[6]: 
[{u'_id': ObjectId('4f16f5c7eb03306a92000000')}]

Well, that's it for now. I'll be posting several followup articles in this series that will go into more detail on how to do various queries and updates using PyMongo, the MongoDB python driver, as well as how to effectively use Ming, so stay tuned!

11 comments:

  1. FYI I deleted the user and DB I used in the post, so don't go trying any funny business ;-).

    ReplyDelete
  2. Warning to others

    blog_post.find({'state':'published','app_config_id':{'$in':app_config_ids}})

    I just wasted an hour to find this syntax does not work.

    I have a collection "articles" with a field "title" and one document with title
    "Hadoop Development Environment OS X"

    these work

    temp = articles.find({"title": "Hadoop Development Environment OS X"});

    temp = articles.find({"title":{"$in":["Hadoop Development Environment OS X"]}})

    This returns nothing.
    keys = ["Hadoop"]
    temp = articles.find({"title":{"$in":keys}})

    i.e all I can get back os a perfect match not a partial match.

    Either I am misunderstanding, but I worked form an internet example the author said works fine, or there is a problem with the driver.

    I am fairly confident I did not misunderstand the mongo docs.

    ReplyDelete
    Replies
    1. Thanks for the comment. I'm sorry if the example was confusing. The example I used was looking for blog articles with an *exact match* with one of the app_config_ids passed in.

      $in is always looking for an exact match in a list. If you want to find a partial match (such as a prefix), you need to use either $regex or a compiled python regular expression. For instance, if you're trying to find the articles starting with 'Hadoop', the query would be

      articles.find({'title': {'$regex': '^Hadoop.*'}})

      Again, sorry for the confusion. I'll try to be more explicit in the future. Thanks again for commenting.

      Delete
  3. Ah, I worked out the "perfect match" just after posting this. My fault for trying to do things in a rush. Then I spent a while looking for how to use regexes.

    that did the job and got me a bit further on.


    Many Thanks.

    ReplyDelete
  4. Cool rundown, thanks Rick! In case anyone who is learning MongoDB finds it useful, I just launched a free tool called querymongo.com that translates MySQL syntax into MongoDB syntax. Hope someone can use it to get up to speed faster!

    ReplyDelete
    Replies
    1. Cool tool, Bob. Thanks for the comment!

      Delete
  5. Anonymous10:27 PM

    Sorry for such a real basic question here, but I'm trying to find some real world examples of how people actually get a pile of documents (physical documents like word docs or excel sheets) into a MongoDB collection. I've read a lot of articles that demonstrate how you can manually code information with JSON syntax using the doc ID, first name field and value, last name filed and value, etc. But, if I've got a folder full of say 10,000 word docs with customer info in each one and I want to be able to query that pile of docs and pull up say a result set that contains all customers from Iowa, how would I do that? How are all those documents parsed into JSON and then dumped into document objects into the collection? Is there some ETL type of program that does that? (and if so, what would it be?) I've googled like crazy trying to find an answer to that, but come up with zilch.

    ReplyDelete
    Replies
    1. MongoDB "documents," as it seems you've discovered, are completely different from Word documents. Since MongoDB stores structured data (JSON/BSON), extracting the structure you're looking for in a pile of unstructured Word documents is something you'd have to configure manually, maybe with an ETL tool (I don't really know much about the market there) or some manual coding.

      For Excel documents, the story is a little better, as at least Excel has rows and columns, which can be exported as a CSV file. CSV files can then be imported using the mongoimport (I think that's the name) tool that comes with the MongoDB distribution. Of course, what you end up with there is documents that look a whole lot like table rows, since they actually came from spreadsheet rows.

      Hope this helps!

      Delete
  6. Anonymous1:21 PM

    Rick, thanks for replying so quickly. Would you mind sharing, what method have you used in your real world projects to get the information into MongoDB? What physical form did the original info that you had to deal with come in and how did you dump it in the collection?

    ReplyDelete
    Replies
    1. My experience with MongoDB has generally used it as the main data store for a new application. Migrations, when they happened, were typically Python scripts that extracted data from one source and inserted it into the database. In some cases it was CSV data, in others it was result sets from queries in MySQL or Postgres databases.

      Delete