Why Healthcare.gov desperately needs techops

healthcare.gov logo

Join 15,000 others and follow Sean Hull on twitter @hullsean.

1. Tech-what? A quick education

Techops is operational excellence. It’s the handoff when code is complete. These are the folks who are up at 2am when a website is down. They manage servers, keep the pipes clean, and the hackers out. They also help plan for capacity needs, and may help with load testing too.

If you’re new to technology, imagine a movie set. Story writers (programmers) have already done their part. The producers (venture capital folks) have financed the project. The director (architect) is there trying to put the vision together. But the folks who manage everything on set, from sound guys, camera guys, lighting people, and all the coordination, this is operations. In web application deployments it is devops, sysops or techops.

Also: Why the Twitter IPO is afraid of scalability

2. In contrast with Obama election campaign

Notice how phenomenally well Obama for America project was run. Like a finely tuned machine. Harper Reed and team pulled off one of the most data backed election campaigns in history.

That project used AWS cloud technologies to the fullest, from devops tools like Puppet and Asgard, collaboration tools like Campfire & Github, and superb monitoring & instrumentation tools NewRelic and Chartbeat.

Clearly Obama knows how to run an election. Something is drastically different with the healthcare.gov project. Too many cooks in the kitchen, perhaps?

Read: Why your startup needs professional techops

3. A failure in capacity planning

Many popular news outlets covered the outage, but most pointed to “bugs”, which caused the outage. But when a site dies under load, while it’s working in test & Q/A, that’s a failure of load testing, and capacity planning.

I would wager a good bet, database tuning would definitely help as it’s the most common and prevalent cause of

Read this: What four letter word divides dev and ops?

4. More testing & more Agility needed

Modern software projects take advantage of continuous integration & agile methods. That is they make small incremental changes. Developers build unit tests, and the code is always in a working state. There is no multi-month dev cycle, where your current software is in doubt.

Reports indicate that the healthcare.gov software was being designed & developed using this old and most agree inferior method of software development, the waterfall method. New Yorker criticises it in Don’t go chasing waterfalls.

Read: Why devops talent is in short supply

5. Caching is desperately needed

All high performance, high scale websites need to take advantage of various types of caching as I’ve discussed in detail before. From browser caching, to page & object caching on the server side.

Hayden James investigated in depth, and found healthcare.gov severely lacking. Again this is a huge failure in techops, sysops or devops. It’s not a bug, and not something the developers are responsible to deliver.

Read: Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

When fat fingers take down your business

apple sad mac fail

Join 14,000 others and follow Sean Hull on twitter @hullsean.

Github goes nuclear

I was flipping through reddit last night, and hit this crazy story. strange pushes on GitHub. For those who don’t know, github houses source code. It’s version control for the software world. Lots of projects use it, to keep track of change management.

Jenkins is a continuous integration platform. Someone working on the project accidentally did a force push up to the server. They overwrote not only their own work, but the work of hundreds of other plugins unrelated to his own project.

This is like doing a demolition to put up a new building, and taking down all the buildings on your block and the next. Not very neighborly, to say the list. They’re still at the time of this writing, doing cleanup, and digging through the rubble.

Read: Why DynamoDB can increase availability

How to kill a database

I worked a startup a few years back that had an interesting business model. Users would sit and watch videos, and get paid for their time. Watch the video, note the code, enter the code, earn cash. Somehow the advertisers had found a way to make this work.

The whole infrastructure ran on Amazon EC2 servers, and was managed by Rightscale. Well it was actually managed by an west coast outsourcing shop, whose specialty was managing deployments on Righscale.

The site kept it’s information in a MySQL database. They had various scripts to spinup slaves, remaster, switch roles and so forth. Of course MySQL can be finicky and is prone to throwing surprises your way from time to time.

One time this automation failed in a big way, switching over production customers to a database that took way way way too long to rebuild. As their automation didn’t perform checksums to bulletproof the setup it couldn’t know that all the data wasn’t finished moving!

Customers sure did notice though when the site fell over. Yes this was a failure of automation. But not of the Rightscale platform, but of the outsourcing firm managing the process, checking the pieces and components and ensuring the computer systems did their thing to completion. Huge fail!

Read: Why devops talent is in short supply

Your website will fail

Sites big and small fail. Hopefully these stories illustrate that fact. I’ve said over and over why perfect availability is a pipe dream.

At the end of the day, the difference between the successful sites and the sloppy ones isn’t failure and perfection. It’s *how* they fail, and how they get back up on their feet. What type of planning did they do for disaster recovery like many firms in NYC did before and after Sandy.

Also: Why startups need both devs and ops for scalability

Reducing failure

So instead of thinking about eliminating failure, let’s think about *reducing* it from happening, and when it does, reducing the fallout. One thing you can do is signup for scalable startups where we share tips once a month on the topic. Meanwhile try to put these best practices into play.

1. Test your DR plan by running real life fire drills
2. Use more than one hosting provider, data center or cloud provider
3. Give each op or end user the least privileges they need to do their job
4. Embrace a culture of caution in operations
5. Check, double check and triple check those fat fingers!

Read this: Why a four letter word divides dev and ops

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters