Tag Archives: uptime

Some irresistible reading for March – outages, code, databases, legacy & hiring

via GIPHY

I decided this week to write a different type of blog post. Because some of my favorite newsletters are lists of articles on topics of the day.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

Here’s what I’m reading right now.

1. On Outages

While everyone is scrambling to figure out why part of the internet went down … wait is S3 is part of the internet, really? While I’m figuring out if it is a service of Amazon, or if Amazon is so big that Amazon *is* the internet now…

Let’s look at s3 architectural flaws in depth.

Meanwhile Gitlab had an outage too in which they *gasp* lost data. Seriously? An outage is one thing, losing data though. Hmmm…

And this article is brilliant on so many levels. No least because Matthew knows that “post truth” is a trending topic now, and uses it his title. So here we go, AWS Service status truth in a post truth world. Wow!

And meanwhile the Atlantic tries to track down where exactly are those Amazon datacenters?

Also: Is Amazon too big to fail?

2. On Code

Project wise I’m fiddling around with a few fun things.

Take a look at Guy Geerling’s Ansible on a Mac playbooks. Nice!

And meanwhile a very nice deep dive on Amazon Lambda serverless best practices.

Brandur Leach explains how to build awesome APIs aka ones that are robust & idempotent

Meanwhile Frans Rosen explains how to 0wn slack. And no you don’t want this. ūüôā

Related: 5 surprising features in Amazon’s serverless Lambda offering

3. On Hiring & Talent

Are you a rock star dev or a digital nomad? Take a look at the 12 best international cities to live in for software devs.

And if you’re wondering who’s hiring? Well just about everyone!

Devs are you blogging? You should be.

Looking to learn or teach… check out codementor.

Also: why did dev & ops used to be separate job roles?

4. On Legacy Systems

I loved Drew Bell’s story of stumbling into home ownership, attempting to fix a doorbell, and falling down a familiar rabbit hole. With parallels to legacy software systems… aka any older then oh say five years?

Ian Bogost ruminates why nothing works anymore… and I don’t think an hour goes by where I don’t ask myself the same question!

Also: Are we fast approaching cloud-mageddon?

5. On Databases

If you grew up on the virtual world of the cloud, you may have never touched hardware besides your own laptop. Developing in this world may completely remove us from understanding those pesky underlying physical layers. Yes indeed folks containers do run in “virtual” machines, but those themselves are running on metal, somewhere down the stack.

With that let’s not forget that No, databases are not for containers… but a healthy reminder ain’t bad..

Meanwhile Larry’s mothership is sinking…(hint: Oracle) Does anybody really care? Now’s the time to revisit Mike Wilson’s classic The difference between god and Larry Ellison.

Read: Are SQL Databases Dead?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Why Dropbox didn’t have to fail

dropbox outage dec 2015

Dropbox is currently experiencing a *major* outage. See the dropbox status page to get an update.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

I’ve written about outages a lot before. Are these types of major failures avoidable? Can we build better, with redundant services so everything doesn’t fall over at once?

Here’s my take.

1. Browse only mode

The first thing Dropbox can do to be more resilient is to build a browsing only mode into the application. Often we hear about this option for performaing maintenance without downtime. But it’s even more important during a real outage like Dropbox is currently experiencing.

Not if but *when* it happens, you don’t have control over how long it lasts. So browsing only can provide you with real insurance.

For a site like Dropbox it would mean that the entire website is still up and operating. Customers can browse their documents, view listings of files & download those files. However they would not be able to *add* or change files during the outage. Thus only a very small segment of customers is interrupted, and it becomes a much smaller PR problem to manage.

Facebook has experienced outages of service. People hardly notice because they’ll often only see a message when they try to comment on someone’s wall post, send a message or upload a photo. The site is still operating, but not allowing changes. That’s what a browsing only mode affords you.

A browsing only mode can make a big difference, keeping most of the site up even when transactions or publish are blocked.

Drupal is an open source platform that powers big publishing sites like Adweek, hollywoodreporter.com & economist.com. It supports a browsing only mode out of the box. An outage like this one would only stop editors from publishing new stories temporarily. It would be a huge win to sites that get 50 to 100 million with-an-m visitors per month.

Also: Is Amazon too big to fail

2. Redundancy

There are lots of components to a web infrastructure. Two big ones are webservers & databases. Turns out Dropbox could make both tiers redundant. How do we do it?

On the database side, you can take advantage of Amazon’s RDS & either read-replicas or Multi-AZ. Each have different service characteristics, so you’ll need to evaluate your app to figure out what works best.

You can also host MySQL, Percona or Mariadb direclty on Amazon instances yourself & then use replication.


Using redundant components like placing webservers and databases in multiple regions, Dropbox could avoid a major outage like they’re experiencing this weekend.

Wondering about MySQL versus RDS? Here are some uses cases.

Now that you’re using multiple zones & regions for your database the hard work is completed. Webservers can be hosted in different regions easily, and don’t require complicated replication to do it.

Related: Are SQL databases dead?

3. Feature flags

On/off switches are something we’re all familiar with. We have them in the fuse box in our house or apartment. And you’ll also find a bigger larger shutoff in the basement.

Individual on/off switches are valuable because they allow us to disable inessential features. We can build them into heavier parts of a website, allowing us to shutdown features in an emergency. Host components in multiple availability zones for extra piece of mind.

Read: 5 Things toxic to scalability

4. Simian armies

Netflix has taken a more progressive & proactive approach to outages. They introduce their own! Yes that’s right they bake redundancy & automation right into all of their infrastructure, then have a loose canon piece of software called Chaos Monkey that periodically kills servers. Did I hear that right? Yep it actually nocks components offline, to actively test the system for resiliency.

Take a look at the Netflix blog for details on intentional load & stress testing.

Also: When hosting data on Amazon turns bloodsport

5. Multiple clouds

If all these suggestions aren’t enough for you, taking it further you could do what George Reese of enstratus recommends and use multiple cloud providers. Not being dependant on one company could help in many situations, not just the ones described here.

Basic Amazon EC2 best practices require building redundancy into your infrastructure. Virtual servers & on-demand components are even less reliable than commodity hardware we’re familiar with. Because of that, we must use Amazon’s automation to insure us against expected failure.

Also: Why I like Etsy’s site performance report

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

10 reaons active-active is hard and how to solve it

Multi-master replication provides redundant copies of your most important business assets. What’s more it allows applications to scale out, which is perfect for cloud hosting solutions like Amazon Web Services.

But when you decide you need to scale your write capacity, you may be considering active-active setup. This is dangerous, messy and prone to failure. We’ve outlined how.

Click through to the end for multi-master solutions that work with MySQL.

Reason 1 – auto_increment introduces new problems

o MySQL’s auto_inc settings make it more difficult to change servers around in your overall replication topology

o Using auto_inc settings can cause MySQL to introducing gaps in your primary keys which is a waste of space

o Such a solution would require all tables to have auto_inc primary keys

NEXT: Reason 2 – MySQL replication is brittle to start with

Want more? Grab our Scalable Startups monthly for more tips and special content. Here’s a sample

No iPhones Were Harmed in the Creation of this Outage

Apple’s recent iMessage outage had some users confused. What do you mean I can’t text my favorite cat photos?? How can Apple do this to me!?!?

What happened?

Apple provides services to everyone who uses it’s platform. iCloud for example stores your contacts, calendar, photos, apps and documents in the cloud. No more syncing to itunes to make sure all your stuff is backed up. It’s automatic in the cloud. Yes or course unless iCloud is down.

Same goes for iMessage. Apple has quietly introduced this, as a more feature rich version of text messaging. It’s great until the service isn’t available. What gives?

All these services are backed magically or not so magically by computer servers. These computers sit in datacenters, managed by operations teams, and to some degree with automation. All the things that brought down AWS & AirBNB & Reddit with it could also take out Apple. A serious storm like Sandy also presents real risks.

[quote]
iMessage is a text and SMS replacement service for iPhones & iPads. It is more feature rich, offering device synchronization, group texting & return receipt. But in a very big way it is also an attempt for Apple to muscle into the market and further extend it’s platform reach.
[/quote]

100% uptime ain’t easy

Even for firms that promise insanely good uptime, five nines remains very very hard to achieve in practice.

For starters all the components behind your service, need to be redundant. Multiple load balancers, webservers, caching servers, and of course databases that hold all your business assets.

But as the repeated AWS outages attest, even redundancy here isn’t enough. You also need to use multiple cloud providers. Here you can mirror across clouds so even an outage in one won’t bring down your business.

What about in the world of messaging? Well you can bet your customers don’t likely know or care about high availability, uptime, or any of these other web operations buzzwords. But they sure understand when they can’t use their service. It may give companies like Apple pause as they try to stretch themselves into areas outside their core business of iphones, ipads, and the IOS platform itself.

iMessage – messaging standards power play

When I first upgraded to an iPhone 4S, the first thing I noticed was the light blue bubbles when texting certain people. Why was that, I wondered? I quickly found out about iMessage, which was conveniently configured, to replace my old and trusty text messaging.

Texts or SMS work across all phones, smartphone or not, and apple or not. But open standards don’t lend themselves well to market muscle and dominance. So it makes sense that Apple would be pushing into this space. I met more than one blackberry owner who loved using bbm to keep in touch with colleagues. It’s like your own private club. And that muscle further strengthens Apple’s platform overall. Just take a look at how the Android Ecosystem is broken if you need an example of what not to do.

The flip side is it means you have more to manage. More servers, more services, more dimensions to your business. More frequent outages that can tarnish your reputation.

[quote]
A lot complaining and publicity like the iMessage outage received, may just be an indication that you’re big enough for people to care.
[/quote]

Alternatives abound…

There is huge competition in the messaging space. The outage and it’s publicity further underline this fact.

For example on the iPhone for messaging there is ChatOn, Whatsapp, LINE, SKYPE & wechat just to name a few.

Interestingly, while researching this article, I downloaded WhatsApp to give it a try. Only 99 cents, why not. Turns out that they had not one, but two outages, just a week ago. Seems Apple isn’t the only one experiencing growing pains.

A lot of complaining and publicity could be a sign that you’re big enough for people to care!

Read this far? Grab our Scalable Startups monthly.

Zero Downtime – What is it and why is it important?

For most large web applications, uptime is of foremost importants. ¬†Any outage can be seen by customers as a frustration, or opportunity to move to a competitor. ¬†What’s more for a site that also includes e-commerce, it can mean real lost sales.

Zero Downtime describes a site without service interruption. ¬†To achieve such lofty goals, redundancy becomes a critical requirement at every level of your infrastructure. ¬†If you’re using cloud hosting, are you redundant to alternate availability zones and regions? ¬†Are you using geographically distributed load balancing? ¬†Do you have multiple clustered databases on the backend, and multiple webservers load balanced.

All of these requirements will increase uptime, but may not bring you close to zero downtime. ¬†For that you’ll need thorough testing. ¬†The solution is to pull the trigger on sections of your infrastructure, and prove that it fails over quickly without noticeable outage. ¬†The ultimate test is the outage itself.

Sean Hull on Quora: What is zero downtime and why is it important?

Feature Flags – What are they and why are they important?

Feature flags are switches that developers architect into their web applications to allow a feature to be turned on or off.  It is simple sounding in description, but harder to implement or enable after the fact.

These switches allow the systems team to operationalize new application functionality.  It allows the ability to turn hot button features on or off as needed.  This can be bring a tremendous power and flexibility to the operations team for deployments where traffic patterns and site usage patterns cannot be known in advance.   It can increase uptime and availability of the overall site, by minimizing the impact any new feature might have.

Feature flags can also be implemented as feature dials, allowing the feature to be exposed to a percentage of users, select users, or some other meaningful way to turn it up or down gradually.

Sean Hull asks on Quora: What are feature flags and why are they important?