Many of you know I publish a newsletter monthly. One thing I love about it is that after almost a decade of writing it regularly, the list has grown considerably. And I’m always surprised at how many former colleagues are actually reading it.
So that is a really gratifying thing. Thanks to those who are, and if you’re not already on there, signup here.
Join 28,000 others and follow Sean Hull on twitter @hullsean.
Recently a CTO & former customer of mine reached out. He asked:
“I’m interested to hear your thoughts on the pros and cons of using a json column to embed data (almost like a poor-man’s Mongo) vs having a separate table for the bill of materials.”
Interesting question. Here are my thoughts.
1. Be clean or muddy?
In my view, these type of design decisions are always about tradeoffs.
The old advice was normalize everything to start off with. Then as you’re performance tuning, denormalize in special cases where it’ll eliminate messy joins. The special cases would then also need to be handled at insert & update time, as you’d have duplication of data.
NoSQL & mongo introduce all sorts of new choices. So too Postgres with json column data.
We know that putting everything in one table will be blazingly fast, as you don’t need to join. So reads will be cached cleanly, and hopefully based on single ID or a small set of ID lookups.
2. Go relational
For example you might choose MySQL or Postgres as your datastore, use it for what it’s good at. Keep your data in rows & columns, so you can later query it in arbitrary ways. That’s the discipline up front, and the benefit & beauty down the line.
I would shy away from the NoSQL add-ons that some relational vendors have added, to compete with their newer database cousins. This starts to feel like a fashion contest after a while.
3. Go distributed
If you’d like to go the NoSQL route, for example you could choose Mongodb. You’ll gain advantages like distributed-out-of-the-box, eventually consistent, and easy coding & integration with applications.
Downside being you’ll have to rearrange and/or pipeline to a relational or warehouse (redshift?) if & when you need arbitrary reports from that data. For example there may be new reports & ways of slicing & dicing the data that you can’t forsee right now.
4. Hazards of muddy models
Given those two options, I’m erring against the model of muddying the waters. My feeling is that features like JSON blobs in Postgres, and the memcache plugin in MySQL are features that the db designers are adding to compete in the fashion show with the NoSQL offerings, and still keep you in their ecosystem. But those interfaces within the relational (legacy?) databases are often cumbersome and clunky compared to their NoSQL cousins like Mongo.
5. Tradeoffs of isolation
Daniel Abadi and Jose Faleiro published an interesting article on a very related topic Why MongoDB, Cassandra, HBase, DynamoDB, and Riak will only let you perform transactions on a single data item.
The upshot is that in databases you can choose *TWO* of these three characteristics. Fairness, Isolation & Throughput.
Relational databases sacrifice throughput for fairness & isolation. Distributed databases sacrifice isolation to bring you gains in throughput & horizontal scalability of writes.
That’s a lot of big words to say one simple thing.
You can’t have it both ways.