Should we be muddying the relational waters? Use cases for MySQL & Mongodb

muddy sewer tunnels

Many of you know I publish a newsletter monthly. One thing I love about it is that after almost a decade of writing it regularly, the list has grown considerably. And I’m always surprised at how many former colleagues are actually reading it.

So that is a really gratifying thing. Thanks to those who are, and if you’re not already on there, signup here.

Join 28,000 others and follow Sean Hull on twitter @hullsean.

Recently a CTO & former customer of mine reached out. He asked:

“I’m interested to hear your thoughts on the pros and cons of using a json column to embed data (almost like a poor-man’s Mongo) vs having a separate table for the bill of materials.”

Interesting question. Here are my thoughts.

1. Be clean or muddy?

In my view, these type of design decisions are always about tradeoffs.  

The old advice was normalize everything to start off with.  Then as you’re performance tuning, denormalize in special cases where it’ll eliminate messy joins.  The special cases would then also need to be handled at insert & update time, as you’d have duplication of data.

NoSQL & mongo introduce all sorts of new choices.  So too Postgres with json column data.  

We know that putting everything in one table will be blazingly fast, as you don’t need to join.  So reads will be cached cleanly, and hopefully based on single ID or a small set of ID lookups.  

Also: Is the difference between dev & ops a four-letter word?

2. Go relational

For example you might choose MySQL or Postgres as your datastore, use it for what it’s good at.  Keep your data in rows & columns, so you can later query it in arbitrary ways.  That’s the discipline up front, and the benefit & beauty down the line.

I would shy away from the NoSQL add-ons that some relational vendors have added, to compete with their newer database cousins. This starts to feel like a fashion contest after a while.

Related: Is automation killing old-school operations?

3. Go distributed

If you’d like to go the NoSQL route, for example you could choose Mongodb. You’ll gain advantages like distributed-out-of-the-box, eventually consistent, and easy coding & integration with applications.

Downside being you’ll have to rearrange and/or pipeline to a relational or warehouse (redshift?) if & when you need arbitrary reports from that data.  For example there may be new reports & ways of slicing & dicing the data that you can’t forsee right now.

Read: Do managers underestimate operational cost?

4. Hazards of muddy models

Given those two options, I’m erring against the model of muddying the waters.  My feeling is that features like JSON blobs in Postgres, and the memcache plugin in MySQL are features that the db designers are adding to compete in the fashion show with the NoSQL offerings, and still keep you in their ecosystem.  But those interfaces within the relational (legacy?) databases are often cumbersome and clunky compared to their NoSQL cousins like Mongo.

Also: Is the difference between dev & ops a four-letter word?

5. Tradeoffs of isolation

Daniel Abadi and Jose Faleiro published an interesting article on a very related topic Why MongoDB, Cassandra, HBase, DynamoDB, and Riak will only let you perform transactions on a single data item.

The upshot is that in databases you can choose *TWO* of these three characteristics. Fairness, Isolation & Throughput.

Relational databases sacrifice throughput for fairness & isolation. Distributed databases sacrifice isolation to bring you gains in throughput & horizontal scalability of writes.

That’s a lot of big words to say one simple thing.

You can’t have it both ways.

Also: Is the difference between dev & ops a four-letter word?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

  • Akmal Chaudhri

    I have been tracking NoSQL databases for several years, collecting publicly available data on skills and vendors. The NoSQL market is still tiny. Considerations and summary of data in Section 2 of this very large slide deck: Slides regularly updated with new data as I find it.

    • Sean Hull

      Thx Akmal. Great deck. Did you author all of those titles on the bookshelf slide? Quite impressiive!

      • Akmal Chaudhri

        Sean, I was editor of those books. However, I did co-author an IBM red book a few years ago.

        • Sean Hull

          Impressive none the less.

          Do you have a blog, Akmal?

          • Akmal Chaudhri

            Sean, unfortunately I don’t. I contribute to the NoSQL forum on LinkedIn and many Facebook forums on Big Data and NoSQL (mainly in Asia). I write occasionally for O’Reilly Radar: And several longer reports for O’Reilly will be published soon. I may start a blog, but feel other people, such as yourself, have more interesting things to say 🙂

          • Sean Hull

            Thx Akmal. I do appreciate that.

            Wel do you have any topic ideas? Love to hear those.

          • Akmal Chaudhri

            Sean, one area that would be really good to cover, IMHO, is energy/efficiency. What cost savings can be achieved or not? For example, I may need 300 servers in order to achieve the throughput I am looking for with product A and 50 with product B. What impact does this have on my running costs, such as electricity, air con, so on? Would be interesting to see something like this. Do you know of any such detailed studies in the NoSQL space? Thanks.

          • Sean Hull

            I don’t know of such a study.

            In the physical server world, I can imagine measuring that. But I’m wondering how Amazon might go about that with it’s cloud virtualization.

          • Akmal Chaudhri

            Ok Sean. No worries. ‘ll reach out to you on LinkedIn and stay connected, so if you hear of something. Cheers.