Tag Archives: nosql

5 ways startups misstep on scalability


Join 27,000 others and follow Sean Hull on twitter @hullsean.

1. Ignoring the database

Yes, your internet site sits on top of a database. Have you forgotten to take care of it?

Like a garden, it must be watered & tended. And a gardener will always scold you for leaving plants to wither. I guess as a database administrator by background, I’ve seen a lot of this. But truth be told it is in large part the cause of slowness.


Are you writing them? If you’re using an ORM middleware, you may be leaving this heavy lifting to a library. These are inefficient. Avoid the Vietnam of computer science.


Now that you’re writing SQL, are you testing & tuning each one? Think of it like turning apps off on your phone that you’re not using. Saves memory, battery, and general headache.


Now that you’ve gotten in the habit of caring for your database, keep at it. Monitor its health regularly. NewRelic as a service or Cacti, Ganglia or Collectd if you’d like to roll your own. Real data can reap real benefits.

Read: Are SQL Databases Dead

2. Shortage of caching

You’ve heard it before, we’ll say it again, make sure you’re caching. But where?

Content Network
Amazon has it’s CloudFront, Rackspace uses Akamai. There are many choices but the results are the same. Static assets such as html & css files, images & video content all part all part of the page you are serving, get dished out closer to your users. It’s like only asking them to go to a corner deli for a soda, rather than the closest supermarket.

Webserver tier

There are many things you can do to cache at the webserver. In particular you can configure to tell browsers to cache objects. One example is Cache-Control. That means longer time-to-live, so objects don’t expire by default. You can always expire them manually. There are also ways to compress objects as well. See How to cache websites & boost speed.

Between webserver & database

Are you using Memcache or redis? Caching here can reduce load on your database by as much as 10x. That’s like buying you 10x free servers, or one large one that costs 10x the price!

Most languages such as PHP provide libraries to interact with memcache. Whenever you make a call out to your database, first check memcache. If you find your key, fetch the value & done. Otherwise grab the answer from the database, and pop it into memcache.

At the database

Databases of all kinds, be they postgres, Oracle, or MySQL have a query cache. Be sure you’ve enabled & tuned yours. Also check that your buffer cache is sizeable enough to fit most frequently hit data. A hit ratio may provide you a cheap guestimate on this.

Related: Why a four letter word divides dev and ops

3. Missing metrics collection

In a recent article Why Scalability is big business I talked about collecting metrics. These are invaluable.

If you’re a home owner or renting, and want to know what you spent on energy in the past year, what do you do? You look at your heating bills for the winter months. Similarly, collecting real data on all your servers, like with cacti, or a service like NewRelic allows you to do the same thing with your servers & infrastructure.

Real hindsight, and real visibility helps everyone from operations teams, to business units evaluating past problems.

Also: Why a killer title can make or break your content efforts

4. Not building feature flags

Tractor trailers use two tires on every axil. If one fails, you are still on the road. Planes use redundant engines. Having switches built into your application to turn off non-essential features may seem abstract when your deadlines for features are looming.

But operational switches for your devops team should be seen as good foundation, and solid bedrock to build on. It means you can do the maintenance that you will need to do, and do it without interrupting customers. It also means when your site gets hammered, and we hope that day will come, you can adjust the dials, and not go down.

Related: Is Amazon RDS Difficult to Manage

5. Building on a single database

Various NoSQL databases like MongoDB, Cassandra & Hbase are distributed out of the box. Keep in mind though they make various tradeoffs to achieve this.

Meanwhile the vast majority of web applications are still built on reliable relational databases. But they don’t scale seamlessly. Build a read-only mode into your application and you’ll thank yourself for years to come. This means you can browse, even while the master database is offline. What’s more it means you can scale more easily.

Avoid solutions that try to scale writes across multiple servers. Partitioning aka sharding is terribly complex to get right, both in planning & layout. Lets not forget how do we piece together a puzzle of 8 shards with 8 pieces to a backup. Recipe for trouble. There are some new cluster options for MySQL, such as Galera. Oracle has it’s own take. But in the end you’ll do better to get a bigger box for your central datastore and keep it central.

Related: How to Deploy on Amazon with Vagrant

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

What is eventually consistent and will it work for you?

aws dynamodb

In the tech world we’re fond of inventing all sorts of technical terms, that are admittedly kind of confusing.

Join 13,500 others and follow Sean Hull on twitter @hullsean.

I was attending an excellent talk recently called Data at Scale, part of the Database Month series that Eric Benari hosts. In it Mark Uhrmacher presented some phenomenal solutions which worked for flash site ideeli. It allowed them to support their incredible business model, where 15% of traffic would happen in 15 minutes everyday. As he called it a “self-imposed denial of service attack”. Interesting analogy.

What occurred to me though, is that a lot of companies and startups struggle to understand which database solutions will work for them, and what the strengths and weaknesses of each are, and further what tradeoffs they’ll grapple with.

One concept that we hear a lot is “eventually consistent”. Many of the new NoSQL databases achieve their speed & availability this way. But what’s it all about?

Let’s change a smartphone contact

I’m sure you have a smartphone in your pocket, and for demonstration sake I’ll use the iphone configured with iCloud.

Let’s go ahead and dial up your *OWN* contact card. Click “Edit” and go ahead and change something. Let’s change your title to “rock star”. Now click “Done”. We’ll wait a minute. Now go to your desktop and open up Contacts. Scroll through to your contact and verify that the Title field now shows “rock star”.

How does all this happen? When you click the “Done” button, the iphone sends changes up to iCloud. iCloud then lets your laptop know a change has happened and those then sync up.

Now let’s run through the same exercise, but change it in two places. We’ll change the smartphone contact to “Founder” and the desktop Contacts record title to “Consultant”. Wait a little bit and you’ll notice they will both eventually show “Consultant”.

Also: Why a killer title can make or break your content efforts

How long were laptop & phone out of sync?

As you probably noticed, the iCloud seems to lean in favor of the desktop client. It’s not clear to me what rules it uses here, nor does it seem to be configurable. Nevertheless eventually both the desktop and smartphone with have the same contact card for you. Quite a feat of magic!

Read: Why high availability is so very hard to deliver

Handling collisions

There is only one *YOU* and presumably your digital rolodex reflects that too. You have one and only one contact card. Or do you? As far as these digital tools are concerned there are actually THREE! One on your desktop, one in iCloud and one on your phone. Each time you change in any of those places, it syncs *UP* to iCloud and then down to the other devices.

Collisions happen if you make changes in two places. Imagine if you’re a road warrior and your laptop was offline for some days, or your smartphone for that matter. In those cases that syncing would happen much later, and collisions more likely.

Related: Why the twitter IPO made a shocking admission on scalability

In the high frequency world of online databases

With online databases, all of this becomes vastly more complex. Web based applications may have 100,000 simultaneous users. Some may be coming from IMEA while others the Americas. It gets pretty darn complex when you have databases in each of those regions.

We deploy applications this way, so one datacenter, say the East Coast region one version, can fail, but all the others still operate. They can still change data, read and write, without being impacted by the New York outage.

Once that datacenter is restored, the databases will then sync up and reconcile missing data.

Also: Why a killer title can make or break your content efforts

MariaDB and Amazon RDS read replicas

MySQL and it’s variants of MariaDB, Percona and Amazon RDS can do something like this with read-replicas. The read-only copies of the database are asynchronous and take time to catch up to changes. You can have the read-only copies in different regions.

This improves availability for browsing your application, but not for making changes. In other words MySQL can use this method to scale reads but not writes. That’s why I recommend your applications also support a browse only mode which means availability won’t be impacted if your authoritative master dies.

Although you can try to do the same for writes by sharding your MySQL instances, this starts to get very messy very fast. Imagine backing up 10 shards, 10x the complexity, and even more when you want to go and do a restore.

Read: Why devops talent is in short supply

Amazon’s Dynamo DB

Amazon’s DynamoDB is a technology based around the original Dynamo whitepaper which attempts to solve a whole class of problems by easing eventually consistent constraints.

What you get is more availability, it’s hard for the whole cluster to go down. That’s great for applications because they can continue to operate if one or more nodes fails. It also scales writes, which is a sort of holy grail in the database world as it’s typically hard to do.

But remember all this comes at a cost. Traditionally scaling writes is hard to do because all changes are kept in one place. You maintain a single authoritative master. If you want to imagine why this matters, think back to our smartphone example. We changed our contact card on our phone and our desktop at the same time. One of those two changes won the battle. But that’s a case where we’re not overly concerned.

If you imagine a bank doing the same thing, and you wire $1000 via phone and desktop, you can quickly see that there is a whole class of applications that won’t be happy with eventually consistent. Your web application may be one of those. Or it may not. Consider carefully before you go with Amazon RDS or DynamoDB as your datastore.

Read: Why startups need more than great developers to achieve scalability

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

A handy guide for PHP and MongoDB Web Development

PHP and MongoDBWhat makes a beginner’s guide handy is when it speaks to your intuition. It anticipates the burning questions that follow from a newbie trying to grasp new concepts and it quickly answers them. PHP and MongoDB Web Development – Beginner’s Guide is one such guide.

I hadn’t heard of Packt Publishing or Rubayeet Islam before picking up this title and I must say I’m impressed. Based in Birmingham, with offices in Mumbai, part of Packt’s business model is to give part of the royalties earned from its books to the open source projects they cover.

I already had a working knowledge of MongoDB, mostly from an operational perspective. If you are new to MongoDB you’ll certainly appreciate how this book is structured. it cuts to the chase, diving right into the nuts and bolts of installing the pieces you’ll need such as database and drivers, and getting your first application running.

From there they take you through a basic web application step-by-step, with chapters on session management, MapReduce and GridFS. EVery time I flip through the pages of a technical book, I find I always have questions in the background; ‘what about performance?’ or ‘How do I troubleshoot these pieces as I’m building them?’

What I liked about this book is that almost as quickly as I’d formulate some question about performance, I’d happen upon answers in the book, as if it knew what would come to my mind at each point of the

I was thinking about tuning and application performance and then found chapter 9 which discusses MongoDB’s explain facility, similar to that of MySQL. From there they cover index creation, hints, and finally profiling. These are all important topics for a developer, ones that he or she should have in mind while building their applications. So I was happy to see good coverage of that even in a self-avowed beginners’ guide.

Building apps that talk to both MySQL and MongoDB

Another interesting chapter was one introducing the idea of building an application that can talk to both MySQL and MongoDB and using those two datastores for different purposes. Again while I’m reading it I start thinking about operational concerns, and I start asking how one would support such an architecture. And then just like clockwork, Islam answers that very question.

He explains the challenges around data consistency and operational support in detail. It’s a great way to introduce a topic without necessarily pushing that adoption per se. Islam is clearly an experienced programmer, with much reasoned advice to share.

The book had great utility but I do have a few complaints.

First off the font is a little funky, and hard to read after a while. In that same vein, some of the screenshots are very wide and as such were zoomed down. This made those tiny and not very readable. Also the screenshots aren’t really consistent, some are black on white and some white text on black terminal which ended up being impossible to read.

Lastly I would have liked to see more use case discussions. Particularly, when should I consider a NoSQL database like MongoDB over a relational database? Which types of applications are really well suited? Which aren’t? What about versus other NoSQL’s? The same with GridFS. There was some caution there after the material was introduced but more discussion about what applications it is well suited for would be useful.

Those few complaints aside, the book is overall very good and perhaps the publishers will consider improving the type and diagrams in the next edition. It definitely sticks to it’s cover page motto “Learn by doing: less theory, more results”.

Relational Database – What is it and why is it important?

A relational database is the warehouse of your data.  Your crown jewels.  It’s your excel spreadsheet or filing cabinet writ large.  You use them everyday and may not know it.  Your smartphone stores it’s contact database in a relational database, most likely sqlite – the ever present but ever invisible embedded database platform.  Your online bank at Citibank or Chase stores all your financial history, statements, contact info, personal data and so forth, all in a relational database.

  • organized around records
  • data points are columns in a table
  • relationships are enforced with constraints
  • indexing data brings hi-speed access
  • SQL is used to get data in and out of the db
  • triggers, views, stored procs & materialized views may also be supported

Like excel, relational databases are organized around records.  A record is like a 3×5 card with a number of different data points on it.  Say you have 3×5 cards for your addressbook.  Each card holds one address, phone number, email, picture, notes and so forth.  By organizing things nicely on cards, and for each card predictable fields such as first name, last name, birthday etc, you can then search on those data points.  Want all the people in your addressbook with birthday of July 5th, no problem.

While relational databases have great advantages, they require a lot of work to get all of your information into neatly organized files.  What’s more the method for getting things into and out of them – SQL is a quirky and not very friendly language.  What’s more relational databases have trouble clustering, and scaling horizontally.  NOSql database have made some headway in these departments, but at costs to consistency and reliability of data.

As servers continue to get larger, it becomes rarer that a single web-facing database really needs more than one single server.  If it’s tuned right, that is.  Going forward and looking to the future, the landscape will probably continue to be populated by a mix of traditional relational databases, new nosql type databases, key-value stores, and other new technologies yet to be dreamed up.

Sean Hull asks on Quora – What is an rdbms and why are they important?

NOSQL Database – What is it and why is it important?

NOSQL is a sort of all-encompassing term which includes very simple key/value databases like Memcache, along with more sophisticated non-relational databases such as Mongodb and Cassandra.

Relational databases have been around since the 70’s so they’re a very mature technology.  In general they support transactions allowing you to make changes to your data in discrete, controlled manner, they support constraints such as uniqueness, primary and foreign keys, and check constraints.  And furthermore they use SQL or so-called Simplified Query Language to access ie fetch data, and also modify data by inserting, updating or deleting records.

SQL though is by no means simple, and developers over the years have taken a disliking to it like the plague.  For good reason.  Furthermore RDBMS’ aka relational database management systems, don’t horizontally scale well at all.  To some degree you can get read-only scalability with replication, but with a lot of challenges.  But write-based scaling has been much tougher a problem to solve.  Even Oracle’s RAC (formerly Parallel Server) also known as Real Application Clusters, faces a lot of challenges keeping it’s internal caches in sync over special data interconnects.  The fact is changes to your data – whether it’s on your iphone, desktop addressbook or office directory, those changes take time to propagate to various systems.  Until that data is propagated, you’re looking at stale data.

Enter NOSQL databases like MongoDB which attempt to address some of these concerns.  For starters data is not read/written to the database using the old SQL language, but rather using an object-oriented method which is developers find very convenient and intuitive.  What’s more it supports a lot of different type of indexing for fast lookups of specific data later.

But NOSQL databases don’t just win fans among the development side of the house, but with Operations too, as it scales very well.  MongoDB for instance has clustering built-in, and promises an “eventually consistent” model to work against.

To be sure a lot of high-profile companies are using NOSQL databases, but in general they are in use for very specific needs.  What’s more it remains to be seen whether or not many of those databases as they grow in size, and the needs for which they are put stretch across more general applications, if they won’t need to be migrated to more traditional relational datastores later.

Sean Hull asks on Quora – What is NOSQL and why is it important?