Schema-Less Is (Usually) a Lie

People are fond of categorizing trendy database engines by what they’re not … “schema-less” is just one example. Overly broad, negative definitions are not the most helpful thing in the world (my whiteboard is technically schema-less and NoSQL). “Schema less” goes the extra mile — not only is it overly broad, but it’s normally not even true. MongoDB users deal with schemas and our customers frequently run into problems that need to be fixed with schema changes. As a result, we tend to think about schema in two parts, the “data-schema” and the “query-schema”.

Graciously borrowed from under creative commons

The Query Schema

MongoDB-based applications have another flavor of schema, we tend to call it the “query schema”. The query schema exists in both the application code and the DB engine. Indexes on collections in MongoDB are very much “schema” and are one of the most important concepts to master for an application running at scale. While a fungible data schema can work well, a query schema should be mostly static and ensure that queries match up to indexes as precisely as possible. Getting this portion of a schema wrong can result in all kinds of pain down the road since indexes are intensive to build on large data sets and successful sharding needs a solid base to build on top of.

This is, incidentally, one of the reasons 10gen recommends scaling MongoDB vertically as much as practical. The need to shard early (less than 100GB of data) is normally an indication of a screwed-up query schema, and a sign that sharding will probably be incredibly painful.

Is MongoDB really schema-less?

There are very few databases that are actually schema-less. The term arguably works for Solr/Lucene since fields to be indexed can be defined at the “document” level, but nearly every other data store has a schema defined somewhere. The differentiator between databases is almost always which bits of the schema live in application code vs the database engine. Quality SQL databases have very strong schema support in the database engine. Dynamo type DBs (Cassandra, Riak, etc) mostly push data/query schema to the application level. MongoDB sits between the two and seems to have struck a balance that give developers a lot of power.

  • William P. Riley-Land

    It tickles me when I hear that servers built on Mongo are schema less. Mongoose is great at defining and enforcing schemata in the application layer, for example.

    • mark

      Stay away from Mongoose. You’ll save yourself some headache. Worst decision our team has ever made (that along with, don’t believe all the github stars). Use the official node mongodb driver which is FASTER and simpler.

      • William P. Riley-Land

        My friend, Mongoose is awesome. Where query performance is suffering, it’s quite easy to use the “lean” option to pretty much match native performance. That being said I wouldn’t say most query sizes are large enough to noticeably effect performance for most applications. (100s of documents per page load is not an issue, in my experience.)

        In any case, Mongoose provides extra features over the native driver, and of course those features have overhead, but Mongoose provides plenty of ways to bypass its extra features selectively, like “Model::update,” for example.

        And the whole concept of middleware for your data layer is killer.

        • Mike McNeil

          Agreed, lifecycle hooks = awesome. Mongoose (as well as the best parts of Hibernate) have had a strong influence on the next generation of Waterline (the ORM for Sails)

      • Mike McNeil

        I concur with William. Also, has worked a treat for us and our customers (although I get your pain– it’s important to normalize the socket syntax with the controllers you use to handle HTTP code)

  • Jason Crawford

    Yes. A database can have a rigid schema or a flexible one, and it can be implicit or explicit, but there’s always a schema.

  • Joe

    Good comments all around!