Improve Performance Using Separate Collections & Other Rants About Data Lifetimes

GTO

Last week, we proposed using a query schema to match your query requirements. While doing this, we moved query logic to the document by precomputing the fastest query possible. Readers’ responses ranged from “yep, that is how it is done” to “SQL is better, and that is crazy”. Go back and look at the comments.

With SQL, you had a guiding mantra: “store data once”. With NoSQL, there are many methods to solve the same problem, and without a guiding mantra, everyone chooses different routes. With a new week, we have a new solution. For our next solution, we are fixing the same example using separate collections for our data.

What was our prior example?

Last week, we proposed the unoptimized query:

> db.blog_posts.find({status: {$in: ["draft", "approved", "waiting for proof"]}}).sort({_id: -1})

The query finds unpublished blog posts sorted by _id. The query also violates MongoDB’s indexing constraints and fails hard at scale.

We are creating partitions of data

This blog posts example is simple to understand. It represents many types of data found in the real world. Fundamentally, our example has data partitions: published and unpublished. In the data’s lifetime, it moves through a publishing workflow. Any new document starts as something that will be modified and flexible, with high writes. The final state of the document requires it to be static and reads must be fast and responsive.

Blog Post Data Performance

The blog posts lifetime takes the data through different performance expectations. A simple publishing site with a few thousand posts does not hit the scale required to think about lifetime. On the other hand, when dealing with large publishing datasets, transitioning data through the lifetime is important for matching performance to expectations.

An unpublished blog post is not the same type of data as a published blog post. So, we can store it in another location.

Separate collections & separate databases

With MongoDB, and most other NoSQL databases, using separate databases is as easy as using separate collections. MongoDB cannot perform JOINS, so there is little gain from combining collections, other than organizational. Because of the nature of the published versus unpublished data, you could separate them not only by collection, but by database like the following:

fast_blog_posts.published – your fast, collection running on an SSD backed database, and replicated on a 3-node replica set.

slow_blog_posts.unpublished – your slower, bloated collection with ideas of failed attempts at publishing. As the premier posts rise to a published state, they are moved to the published collection. This can run on a single-server cost-efficient environment.

How does this solve the original problem?

The original problem was converting the following into a performant MongoDB query:

> db["fast_blog_posts"].published.find({status: {$in: ["draft", "approved", "waiting for proof"]}}).sort({_id: -1})

Instead of the above query for unpublished posts, you would actually run the following query:

> db["slow_blog_posts"].unpublished.find().sort({_id: -1})

This works because the separate collections partition data. The only data in the unpublished collection is unpublished blog posts. Previously, we were expecting a “status” key to partition the data.

MongoHQ has different databases for different needs. We have fast SSD-backed databases for the published blog posts, and we have standard performance databases for the unpublished blog posts. We have plans to match every data requirement.

  • Sagish

    Thanks for an educational post

  • Abishek.R.Srikaanth

    I get to learn something new every time. Thanks for this

  • Shaun Befort

    Is the newly published post COPIED or MOVED? If it’s copied, you now have data duplication with two copies of the same exact post. Which posts gets updated when a change needs to be made, the unpublished one or the published one? Sounds like an advertisement against NoSQL databases to me. With SQL, it’s simple. Just index two fields.