Improve Performance by Removing Query Logic

You know what the following is?

db.blog_posts.find({status: {$in: ["draft", "waiting on proof", "approved"]}}).sort({_id: -1})

If you said “a slow query in MongoDB”, then you are correct. It violates the tenet that MongoDB is most efficient with just one operator per query. In MongoDB, $in and $sort are both operators. Only one operator can be used with an indexed query. MongoDB uses simple, single b-tree indexes and cannot find the union of two indexes.  For a quick brush-up on MongoDB performance, see MongoDB Indexing Best Practices.

Let’s assume the defined problem we are solving is finding the latest unpublished blog posts.  How can we rewrite the query to use just a single operator?

The Fix

The fix here is to move the logic from your query to your documents.  Instead of looking for an array of values in status, add a new attribute called ‘unpublished_created_at’. At the application level, if your document is one of the unpublished statuses, set the value of this new attribute to your created_at, else set to null if document is published. Then create a sparse index on ‘unpublished_created_at.’

Your new query would look like:

db.blog_posts.find().sort({unpublished_created_at: -1})

Why a date instead of a boolean?

Booleans have low cardinality (they have just two possible values: true and false).  Indexing a low-cardinality field gains us little. Booleans do not make good index-able attributes.  Indexes bring much more benefit when the indexed attribute can take on many values, like dates can.

We’ve combined the logic of “unpublished” and “_id: -1″ into one simple field. The logic in the query is pre-computed into the field, and the database does not re-compute on each query.  When optimizing for scale, the performance boost will be significant.

  • Alexio Cassani

    The nosql world requires a totally new way of thinking and design software domain models and articles like this one are very useful to increase the knowledge of doing it! Thank you guys|

  • Diego Woitasen

    I don’t agree at all with this post. It’s correct that if you have the possibility you have work on your documents to have simpler queries. But it’s possible to have an indexes query when you have two operators. I have that working in producction and I’m sure that works because explain() says that (o explain is laying to me).

    • mattparlane

      One problem with your method is that you need an index on {status:1,_id:-1}, as well as the mandatory index on {_id:1} — you’re effectively indexing on _id twice.

      With the method in this post, you only need an index on {unpublished_created_at: -1} (plus the mandatory index), so you have less redundancy.

      • Diego Woitasen

        Yes, I agree… but sometimes things get more complex than expected :)

    • John

      My guess is that the explain isn’t telling you what you think it is. It is most likely either telling you multiple indexes it used in the plans (but it only chose one of those plans) or it is telling you that the seconds field in the index it did [MIN, MAX] as the range which will not help the query plan at all, that is basically just telling that field to accept any value. The fact is that an index can’t be used to do ranges on multiple values.

  • Diego Mastalli

    The proposed solution can be useful in some circumstances to optimize and save resources but of course a compound index is a good and standard solution. The NoSQL word is a denormalized word and mongoDB use memory mappad files. MongoDB performance are really good when you have enough memory for your data and indexes. MongoDB comes out of the box with tools for scale reads and writes by replica and sharding. You must be careful on your schema desing (short fileds suggested by MongoDB and avoid indexes not useful) but finally you have to choose MongoDB for its horizontal scale capabailities.

  • Amrith Kumar

    Really? This seems like an awful lot of work for what should be a relatively simple query? Not to mention all the complexity of denormalization, anomalies, and maintaining that new boolean … http://bit.ly/15BC2kV

    • john page

      See Ilearnasigoalong.blogspot.co.UK for an explanation of how this problem and solution are exactly the same in an RDBMS and this isn’t a NoSQL issue.

      • Amrith Kumar

        I don’t believe your observations accurately reflect a general issue with RDBMS’, maybe just something to do with the testing you conducted.

        See complete response at http://bit.ly/16XvdrA

  • Alex

    This should just be a fix of this problem. Mongo needs a better handling of this!!

    • John

      This isn’t really a thing that needs to be fixed by mongodb. This is a general database and data model issue. You have the same issues in a relation database. You are doing 2 range queries and an index can’t be ordered on 2 ranges. So you have to find ways to either index the most cardinal thing or change the data model to fit the problem.