Bulk-like inserts with Django

The Django ORM is really great. It allows you to design and code your web application in a very simple and readable way.

Some projects have special needs, or are very demanding on the DB, and in that case the ORM can become a bottleneck. Fortunately, Django has different ways to avoid getting in your way.

Inserting thousands of entries in the DB

After the Overmind 0.1 release, I have implemented the caching of provider images, so that the node creation form loads instantly. Once implemented though, a problem quickly became apparent: Amazon’s EC2 has nearly 3000 AMI images. Saving all of them using the Django ORM is extremely slow. It takes 10 minutes!

While Overmind will get a work queue and the user won’t have to wait after creating a new cloud provider, 10 minutes is way beyond reasonable. Then again, I really don’t want to start entering SQL code in this early stage of Overmind development, as we all know that “Premature optimization is the root of all evil”[1].

Fortunately, Django has a nifty and often overlooked feature: transactions.

When you look for a solution to bulk inserts with Django, most answers recommend SQL statements and when someone proposes using transactions, they answer that they are not real bulk inserts and they won’t bring very big performance improvements. Let us put that to the test.

Benchmarking transactions

The tests consists in creating an EC2 provider in Overmind, and meassure how long it takes to complete the action. In the process, the nearly 3000 EC2 images will be imported into the local Overmind SQLite DB.

The simple Image model:

class Image(models.Model):
    '''OS image model'''
    image_id = models.CharField(max_length=20)
    name     = models.CharField(max_length=30)
    provider = models.ForeignKey(Provider)
    favorite = models.BooleanField(default=False)

    def __unicode__(self):
        return self.name

    class Meta:
        unique_together  = ('provider', 'image_id')

The test has been run 3 times with and without transactions. To that end, the only change was adding a decorator to the Provider. import_images() method:

...
    @transaction.commit_on_success()
    def import_images(self):
...

Results

without transactions: 600 seconds (10 minutes)
with transactions: 16 seconds
Improvement factor: 37.5
Time reduction: 97.3%

 

The results are outstanding. This is not a trivial improvement, it takes only 2.7% of the time it needed before!

If you want more performance, there are some custom solutions for real bulk saving, and if all else fails, Django even lets you enter custom raw SQL statements. django.db.transaction though, should be a beautifully simple and effective solution for a lot of projects.


References
[1]: Famous quote by Donald Knuth

Advertisements

8 thoughts on “Bulk-like inserts with Django

  1. Great post. I use the transaction.commit_manually which allows me to manally commit since I am pushing a job on the queue and the job starts by reading the object from the database.

  2. Miquel says:

    Thanks Yashh.
    Yes, there are other ways to use transactions. I left the rest out to keep the post concise.

    There are 4 possibilities.
    – Adding transactions to your middleware: ‘django.middleware.transaction.TransactionMiddleware’,
    All your views will only commit on success.
    – transaction.commit_on_success
    – transaction.commit_manually
    and, starting in Django 1.3 as a context manager:
    – with transaction.commit_on_success():

    Can you elaborate a bit on the way you use commit_manually and how it benefits queued jobs?

  3. So I work at pinterest.com which is manages a lot of images. When a image is uploaded / scraped we compute one thumbnail and write to database. After the db write we put the job on the queue and then write to activity table and then return Html / json.

    During this time I saw a bunch of errors in worker log, where our task typically starts by reading image from database. You might be wondering why reading from database instead you can just pass data to the queue. We do a bunch of stuff on the image like resize, short url, email other users, update counts etc. So having a ORM object is good. In this process I saw errors like image does not exist. I had to catch that error and retry the task after 15 secs. After digging in a bit I found that task is being picked up before the transaction is committed. So once I write the image and activity to db, I commit manually and then put the task on to queue which solved the error. Hope that helps. Let me know if you need any more insights.

  4. Thanks for sharing Yashh.

    Very interesting. In your case, you use transaction.commit_manually not to improve performance but to avoid a particular “race condition”.

    Great to know!

  5. Matthew Silvera says:

    # transaction.commit_manually example
    # https://gist.github.com/740059

    import os, sys, traceback
    from django.db import transaction
    
    @transaction.commit_manually
    class Command(BaseCommand):
        help = '''import lots and lots of stuff but don't lock the db'''
        def handle(self, *args, **options):
            with open(file) as file_handle:
                count = 0
                for line in file_handle:
                    try:
                        sid = transaction.savepoint()
                        dangerous_import(line)
                        transaction.savepoint_commit(sid)
                        count += 1
                        if count == 1000:
                            transaction.commit()
                            count = 0
                    except:
                        print '***Failed: %s' % line
                        traceback.print_exc(file=sys.stdout)
                        transaction.savepoint_rollback(sid)
            transaction.commit()
    
  6. Cool! that command could come in handy. Great addition to the post too.

    PS: I took the liberty of adding tags to your code so that it keeps indentation.

Comments are closed.