Many databases have a wealth of “knowledge” that can be unearthed with a little bit of digging. If you have a database with 5GB of data, there’s not much to it, just bring up a copy of the database, perform your analysis, and tear it back down. But what if you have hundreds of GB you want to analyze? On demand copies are prohibitively slow to get going, and you really don’t want to hammer an active production database with long running queries, map reduces, etc.
We’ve seen customers tackle this problem a few different ways. Depending on your data set one of these may work for you … most of these come down to “maintaining a fresh copy of the data”:
Develop an incremental ETL process
ETL means “Extract, Transform, Load”, and is a generic acronym for the process of pulling data out of your canonical data store (probably the MongoDB instance backing your transactional application, if you’re reading this), transforming the data into an analytics appropriate schema, and loading it into a separate database.
This kind of setup requires some development work, but it’s a great excuse to really think hard about the structure and value of the data you’re collecting. At a high level, coming up with a decent ETL process means thinking hard about the life cycle of various entries in your DB, and determining how to “capture” changes to your live data for later processing.
With Mongo, the best way of tracking writes will depend largely on the workload. You can do anything from marking documents as dirty, to querying over ranges of ObjectID values, to monitoring the built in oplog with a tool like MongoRiver. Many ETL processes work better with the simpler batch loads since transformations and writes can be intensive. Sometimes, though, it makes sense to manage a live synced transformation of your data.
Tracking writes will minimize the hit against the production DB during the “Extract” portion of the process. Rather than querying the full database every time you need to run some analysis, you can save yourself some pain by only transferring the documents that have changed. If you can flag documents as being dirty with an ‘updated_at’ field and a ‘deleted_at’ field, then you can run incremental updates of your analytics database after you take one full snapshot. If you want to get fancy, you can store this changeset information in something non-mongo as well, i.e. keep a list of changed document ids in redis etc to reduce the performance impact.
Periodically refresh an analytics DB (mongodump / mongorestore)
MongoDB’s dump / restore process will do incremental writes in some scenarios, which makes it a reasonable choice for maintaining an “offline” copy of your data and refreshing it periodically. If you have regular mongodumps (from a backup system, for example), this may be a good option. It’s workable on just about any database, including sharded clusters, though running a new dump is slow and can impact production DB performance.
This does have the upside of being a very simple solution, and may be an “easy win” with properly configured backups. And to ease the performance problems you can target specific collections in order to get a subset of your data.
Run a dedicated slave for reporting
Some analytics queries should be run against an always-in-sync secondary replica set member. You can lessen the impact on your performance either by using your current secondary, or creating a specific secondary member (This gets complicated if you are sharded, as you will have to do this for each shard). This is rather hacky, and makes it difficult to add extra indexes for easier analytics. It is, however, the least developer intensive option since it relies on normal MongoDB features.
Schema: Build the document you want to query
With the right kind of Schema, MongoDB is pretty good at analytics. If you do take the time to figure out an ETL process, you should think hard about how you want to query your docs. We covered this last year in First Steps of an Analytics Platform with MongoDB.
Bonus option: Use Hadoop with one of the above
Consider using the MongoDB Hadoop Connector with one of the above strategies. You can run the connector against an analytics slave for source data, or run against online members with well optimized indexes. You can also use Hadoop as part of the ETL process, mentioned above, rather than writing everything from scratch. If you have in-house Hadoop expertise this is worth investigating.
There are a few options, and this is a question we get asked about fairly frequently. You can learn surprising things about your data if you can analyze it well, and that’s an important problem to think about as your data grows.