Faster Updates with MongoDB 2.4

You may not be ready to make the move to MongoDB 2.6 yet, but still have an issue getting the best performance from bulk update operations. MongoDB 2.6 has a new Bulk API, but use that API with a 2.4 server and it’ll be just as slow as not using the API. This means that slow updates, especially with a MongoDB replica set, are still a problem with MongoDB 2.4. At MongoHQ we have strategies for dealing with that performance problem.

The key to increasing speed on updates is to note how MongoDB gives a lot of control over how database operations are acknowledged by a server, and more still when dealing with a replica set. For a standalone server, that ranges from not checking (no longer the default, thankfully), through acknowledgment that the operation has been acted on and up to confirming the operation has been written to the journal. In a replica set you can also set the number of servers, or select a majority, that the operation is replicated to. For everything but the not checking option, the getLastError command is called with parameters that reflect how concerned the client is with the progress of the write – the writeconcern – the more concern, the longer it will take for the various write operations to complete or fail.

Bulk Insertion

MongoDB has been able to bulk insert records for some time. Bulk inserts reduce the waiting by attempting to insert all the documents at once; if there’s an invalid document or some other error occurs, the rest of the inserts are discarded. By doing it all at once though, there’s only one wait on getLastError. Bulk inserts are triggered by passing an array of documents to the insert method so, for example, in a test application we could create an array of simple documents and just insert that into the collection:

for i in 1..1000 do 
    newdoc= { "myid" => i, "name" => "bar#{i}" }
    docarray << newdoc
end

@collection.insert(docarray)

The insert method returns a single ObjectId when it inserts a single document and so, when handed an array for bulk insert, it returns an array of ObjectIds. There are two options for handling errors available. :continue_on_error defaults to false, but when set to true, if there is an error inserting a document from the insert array, it will continue processing the rest of the array. To discover which documents failed to be inserted, another option :collect_on_error aggregates the failed documents into a set which is also returned by the method.

Bulking Up Updates

But for updates, there is no such convenience. Each update operation will send a getLastError and wait for the results. The higher the writeconcern the longer those waits will be. There is a way to approximate the all-or-nothing bulk insert behavior, but this is where you need to understand your update and data. To make use of this technique you need to know that you can repeatedly apply your update and the resulting records will always be the same – that is to say that your update operations must be idempotent. If we look at the some of the update operators MongoDB supports, we can see that the result multiply applying $set and $unset would always be the same, but the result of doing the same with $inc or $rename would not be. Array operators would also not be idempotent either – multiple applications of $pop would, eventually, lead to an empty array and $push would accept new values for the array forever (for limited values of forever). Also remember that if you are using a query, the record found by that query could, unless it is very specific, vary over multiple calls and thus not be idempotent.

But, if your update operation is idempotent then we can look at optimising an update. For simplicity, we’ll use that example insert to create a set of records then look at performance of various update strategies. First thing we’ll do is open the database and set it with a default writeconcern of 2. With a replica set that means the operation has to be applied on the primary and confirmed on a secondary server.

require 'mongo'
@mongo_client = Mongo::MongoClient.from_uri(ENV["MONGOHQ_URL"],{ w:2})

Now, we’ll create 1000 records as above and then update each one, one at a time.

for i in 1..@counts do 
    @collection.update({ "myid" => i },{ "$set" => { "name" => "foo#{i}" }  } )
end

And in our test, running from a Heroku instance, that takes nearly 5 seconds to complete, around 200 operations per second. If we take the writeconcern down to 1, purely by adding { :w => 1 } to the update options, that performance improves slightly, taking 3.7 seconds, 270 ops per second.

Now though, we want to see the effect of a writeconcern of 0. This will mean no call to getLastError on each call, so we’ll do it ourselves and only do that with a writeconcern of 2, like so:

  for i in 1..@counts do 
     @collection.update({ "myid" => i },{ "$set" => { "name" => "foo#{i}" }  }, { :w => 0 } )
  end
  error=@db.get_last_error( { :w=>2 })

And if we run this benchmark code, it completes in 0.3 seconds of real time, which would equate to 3000 operations per second. Obviously with only a thousand records, this is somewhat high; if we push the number of records up to ten thousand, we get 100 ops/sec with w at 2, 126 ops/sec with w at 1 and 550 ops/sec with our w at 0.

It’s a great improvement on performance for updates, but it comes with strings attached. You can only use the technique with an idempotent update, because if there is an error, you won’t know which updates failed and will have to reapply the entire set of operations again and if it’s not idempotent, then it would begin corrupting any data already updated.

Of course, with MongoDB 2.6’s new Bulk API and write protocol, the most reliable way to get faster bulk updates is to move to MongoDB 2.6 and use that API. In the next part of this series, we’ll look at what 2.6 has to offer in detail, give examples of 2.6 bulking in action in various languages.

Written by Dj Walker-Morgan

Content Curator at Compose, Dj has been both a developer and writer since Apples came in ][ flavors and Commodores had Pets.

  • Zardosht Kasheff

    Interesting and useful hack. I wonder, with TokuMX, would a multi-statement transaction allow this optimization to extend to non-idempotent updates? I think you can run all the statements in a multi-statement transaction, commit the transaction, and then call getLastError with the appropriate write concern. getLastError will return when the entire transaction has satisfied the appropriate write concern.