Just a little Python: GridFS: The MongoDB Filesystem

Friday, May 25, 2012

GridFS: The MongoDB Filesystem

In some previous posts on mongodb and python and pymongo, I introduced the NoSQL database MongoDB and how you can use it from Python. This post goes beyond the basics of MongoDB and pymongo to give you a taste for MongoDB's take on filesystems, GridFS.

Why a filesystem?

If you've been doing MongoDB for a while, you may have heard about the 16 MB document size limit. When I started using MongoDB (around version 0.8), the limit was actually 4 MB. What this means is that everything is working just fine, your system is screaming fast, until you try to create a document that's 4.001 MB and boom nothing works any more. For us at SourceForge, what that meant was that we had to restructure our schema and use less embedding.

But what if it's not something that can be restructured? Maybe your site allows users to upload large attachments of unknown size. In such cases you probably can get away with using a Binary field type and crossing your fingers, but a better solution, in my opinion, is to actually store the contents your upload in a series of documents (let's call them "chunks") of limited size. Then you can tie them all together with another document that specifies all the file metadata.

GridFS to the rescue

Well, that's exactly what GridFS does, but it does it with a nice API with a few more bells and whistles than you'd probably build on your own. It's important to note that GridFS, implemented in all the MongoDB language drivers, is a convention and an api, not something that's provided natively by the server. As far as the server is concerned, it's all just collections and documents.

The GridFS schema

GridFS actually stores your files in two collections, named by default fs.files and fs.chunks, although you can change the fs to something else if you'd like. The fs.files collection is where reading or writing a file begins. A typical fs.files document looks like the following (credit):

{
  // unique ID for this file
  "_id" : <unspecified>,
  // size of the file in bytes
  "length" : data_number,
  // size of each of the chunks.  Default is 256k
  "chunkSize" : data_number,
  // date when object first stored
  "uploadDate" : data_date,
  // result of running the "filemd5" command on this file's chunks
  "md5" : data_string
}

The fs.chunks collection contains all the data for your files:

{
  // object id of the chunk in the _chunks collection
  "_id" : <unspecified>,
  // _id of the corresponding files collection entry
  "files_id" : <unspecified>,
  // chunks are numbered in order, starting with 0
  "n" : chunk_number,
  // the chunk's payload as a BSON binary type
  "data" : data_binary,
}

In the Python gridfs package (included with the pymongo driver), several other fields are inserted as well:

filename: This is the 'human' name for the file, which may be path-delimited to simulate directories.
contentType: This is the mime-type of the file
encoding: This is the unicode encoding used for text files.

You can also add in your own attributes to files. At SourceForge, we used things like project_id or forum_id to allow the same filename to be uploaded to multiple places on the site without worrying about namespace collisions. To keep your code future-proof, you should put any custom attributes inside an embedded metadata document, just in case the gridfs spec expands to incorporate more fields.

Using GridFS

So with all that out of the way, how to you actually use GridFS? It's actually pretty straightforward. The first thing you need is a reference to a GridFS filesystem:

>>> import pymongo
>>> import gridfs
>>> conn = pymongo.Connection()
>>> db = conn.gridfs_test
>>> fs = gridfs.GridFS(db)

Basic reading and writing

Once you have the filesystem, you can start putting stuff in it:

>>> with fs.new_file() as fp:
...     fp.write('This is my new file. It is teh awezum!')

Let's examine the underlying collections to see what actually happened:

>>> list(db.fs.files.find())
[{u'length': 38,
  u'_id': ObjectId('4fbfa7b9fb72f096bd000000'),
  u'uploadDate': datetime.datetime(2012, 5, 25, 15, 39, 37, 55000),
  u'md5': u'332de5ca08b73218a8777da69293576a',
  u'chunkSize': 262144}]
>>> list(db.fs.chunks.find())
[{u'files_id': ObjectId('4fbfa7b9fb72f096bd000000'),
  u'_id': ObjectId('4fbfa7b9fb72f096bd000001'),
  u'data': Binary('This is my new file. It is teh awezum!', 0),
  u'n': 0}]

You can see that there's really nothing surprising or mysterious happening there; it's just mapping the filesystem metaphor onto MongoDB documents. In this case, our file was small enough that it didn't need to be split into chunks. We can force split it by specifying a small chunkSize when creating the file:

>>> with fs.new_file(chunkSize=10) as fp:
...     fp.write('This is file number 2. It should be split into several chunks')
...
>>> fp
<gridfs.grid_file.GridIn object at 0x1010f5950>
>>> fp._id
ObjectId('4fbfa8ddfb72f0971c000000')
>>> list(db.fs.chunks.find(dict(files_id=fp._id)))
[{... u'data': Binary('This is fi', 0), u'n': 0},
 {... u'data': Binary('le number ', 0), u'n': 1},
 {... u'data': Binary('2. It shou', 0), u'n': 2},
 {... u'data': Binary('ld be spli', 0), u'n': 3},
 {... u'data': Binary('t into sev', 0), u'n': 4},
 {... u'data': Binary('eral chunk', 0), u'n': 5},
 {... u'data': Binary('s', 0), u'n': 6}]

Now, if we actually want to read the file as a file, we'll need to use the gridfs api:

>>> with fs.get(fp._id) as fp_read:
...     print fp_read.read()
...
This is file number 2. It should be split into several chunks

Treating it more like a filesystem

There are several other convenience methods bundled into the GridFS object to give more filesystem-like behavior. For instance, new_file() takes any number of keyword arguments that will get added onto the fs.files document being created:

>>> with fs.new_file(
...     filename='file.txt', 
...     content_type='text/plain', 
...     my_other_attribute=42) as fp:
...     fp.write('New file')
...
>>> fp
<gridfs.grid_file.GridIn object at 0x1010f59d0>
>>> db.fs.files.find_one(dict(_id=fp._id))
{u'contentType': u'text/plain',
 u'chunkSize': 262144,
 u'my_other_attribute': 42,
 u'filename': u'file.txt',
 u'length': 8,
 u'uploadDate': datetime.datetime(2012, 5, 25, 15, 53, 1, 973000),
 u'_id': ObjectId('4fbfaaddfb72f0971c000008'), u'md5':
 u'681e10aecbafd7dd385fa51798ca0fd6'}

Better would be to encapsulate my_other_attribute into the metadata property:

>>> with fs.new_file(
...     filename='file2.txt', 
...     content_type='text/plain', 
...     metadata=dict(my_other_attribute=42)) as fp:
...     fp.write('New file 2')
...
>>> db.fs.files.find_one(dict(_id=fp._id))
{u'contentType': u'text/plain',
 u'chunkSize': 262144,
 u'metadata': {u'my_other_attribute': 42},
 u'filename': u'file2.txt',
 u'length': 10,
 u'uploadDate': datetime.datetime(2012, 5, 25, 15, 54, 5, 67000),
 u'_id':ObjectId('4fbfab1dfb72f0971c00000a'),
 u'md5': u'9e4eea3dec28d8346b52f18086437ac7'}

We can also "overwrite" files by filename, but since GridFS actually indexes files by _id, it doesn't get rid of the old file, it just versions it:

>>> with fs.new_file(filename='file.txt', content_type='text/plain') as fp:
...     fp.write('Overwrite the so-called "New file"')
...

Now, if we want to retrieve the file by filename, we can use get_version or get_last_version:

>>> fs.get_last_version('file.txt').read()
'Overwrite the so-called "New file"'
>>> fs.get_version('file.txt', 0).read()
'New file'

Since we've been uploading files with a filename property, we can also list the files in gridfs:

>>> fs.list()
[u'file.txt', u'file2.txt']

We can also remove files, of course:

>>> fp = fs.get_last_version('file.txt')
>>> fs.delete(fp._id)
>>> fs.list()
[u'file.txt', u'file2.txt']
>>> fs.get_last_version('file.txt').read()
'New file'

Note that since only one version of "file.txt" was removed, we still have a file named "file.txt" in the filesystem.

Finally, gridfs also provides convenience methods for determining if a file exists and for quickly writing a short file into grifs:

>>> fs.exists(fp._id)
False
>>> fs.exists(filename='file.txt')
True
>>> fs.exists({'filename': 'file.txt'}) # equivalent to above
True
>>> fs.put('The quick brown fox', filename='typingtest.txt')
ObjectId('4fbfad74fb72f0971c00000e')
>>> fs.get_last_version('typingtest.txt').read()
'The quick brown fox'

So that's the whirlwind tour of GridFS. I'd love to hear how you're using GridFS in your project, or if you think it might be a good fit, so please drop me a line in the comments.

20 comments:

Tim Van Steenburgh2:24 PM
Great intro to GridFS, thanks Rick. While I'd heard of GridFS, I'd never paid attention to what it was, how it worked, or how to use it. Your post explains it all very well - cool stuff!
ReplyDelete
Replies
Anonymous6:27 PM
the truth is, gridfs is not production ready. You will have insane problems and it will reduce performance on your main collections by 70%.

Either use 2 mongocluster just for gridfs or just dont :) and use file system.
ReplyDelete
Replies
Brent6:30 PM
After playing with GridFS a bit, people start to wonder how they can serve files to web browsers from it. I've written a Python server for this, at https://bitbucket.org/btubbs/khartoum/.
ReplyDelete
Replies
Aristarkh12:06 PM
Shameless self-promotion: http://xm.x-infinity.com/2012/04/as-were-to-move-our-terabytes-of-files.html
ReplyDelete
Replies
Anonymous9:24 AM
what's the benefit of using this over using the server's filesystem? wouldn't an association be simpler and faster then trying to fit a file in to the db? unless ofcourse you want to copy the db and ship it off to another server, but then again there are many file sync options out there..
pls enlighten me
ReplyDelete
Replies
David5:20 PM
How many files do you store? We currently store hundreds of millions of mostly small files on a single server. However, file systems don't handle this well.

We're looking at moving to CouchDB or Mongo. We eventually want to be able to store billions of small files.
ReplyDelete
Replies
Wes9:19 PM
Hey Rick -
Does gridFS require you to have an underlying nfs layer to provide storage, or does it store everything locally on each server? Are you aware of a limit in terms of the number of servers that can be a part of the same database with Mongo and gridFS? If we expand a network out to be 100 servers wide I think we would have issues with replica sets since mongo's limit is 12 members I thought?

I guess I'm just thinking through practical applications for a large scale website.
ReplyDelete
Replies
KSC2:23 PM
This tutorial is very helpful. This really helped me getting started with GridFS. Thank a lot.

btw can you let me know how did you get colors in python shell. Thanks.
ReplyDelete
Replies
Amir1:57 PM
Hi Rick, very informative post, thanks, I have a django app that processes a file and yields N segments (segments are encrypted) of the file each to be stored on a different server, each segment's size being from KBs to 70MBs. do you think using GridFS would be a wise choice for such application?
ReplyDelete
Replies
Rick Copeland2:56 AM
If you're thinking you'll be storing the file's segments as 'chunks' in gridfs, it would only really work well if all the segments were the same size (and it looks like that's not the case). You could still store the segments as independent files in gridfs, though.
ReplyDelete
Replies

Add comment

Useful Resources

Interested in practical MongoDB programming?

MongoDB Applied Design Patterns
is available now, both in ebook and dead-tree form. In it, you'll see how to use MongoDB effectively in fields from real-time analytics to content management systems and more. The examples are all in Python, so readers of this blog should have no problem picking it all up.

Want to learn MongoDB using Python?

I just released an 84-page ebook MongoDB with Python and Ming to help you get started. In it, I cover everything from installing MongoDB for the first time, basic pymongo usage, MongoDB aggregation including MapReduce and the new aggregation framework, and GridFS. You'll also learn about Ming, the object-document mapper we built at SourceForge to accelerate our development beyond what we could do with PyMongo.

Want more personalized training?

I'm available for customized onsite Python and MongoDB training classes. You can sign up here for more information on this and other classes I'll be offering in the future including online and public training.

Just a little Python

Friday, May 25, 2012

GridFS: The MongoDB Filesystem

Why a filesystem?

GridFS to the rescue

The GridFS schema

Using GridFS

Basic reading and writing

Treating it more like a filesystem

20 comments:

Search

Useful Resources

Interested in practical MongoDB programming?

Want to learn MongoDB using Python?

Want more personalized training?

Pages

Rick's Resources

FeedBurner FeedCount

Email

Labels

Links

Blog Archive

Email

Popular Posts