Tag Archives: weboperations

Why startups need techops

devops divide

I was at a talk recently on node.js. Even if I’m not working with a technology directly, it’s exciting to see what’s out there, and node.js is bringing some hyper fast performance to a certain category of web applications.

During the keynote, the speaker mentioned a service to deploy applications on. I can’t name names unfortunately but it was a cloud solution on top of which you could deploy your application. Go this route
and you can do without an operations team. Avoid overhead of hiring ops, he claimed. And hey, then you can hire more developers!

To be fair I’ve heard much of the same thing at DBA or linux conferences. I can’t count the number of stories that start with “what some idiot developer did that took down our production systems…”.

Yes, it seems dev & ops are still just a tad bit adverserial.

Join 13,000 others and follow Sean Hull on twitter @hullsean.

1. My little known origins as a developer

Many colleagues and clients I’ve met in the New York City startup industry know me primarily as an operations & scalability guy. I tune databases, infrastructure and components to make things lightening fast.

I spent my earliest years at university on the computer lab operations staff. We watched and managed, made sure level zero backups were taken care of, and moved the tapes. Directly after college, I started at a software firm. I did C++ GUI development on the Mac, using the toolbox libraries with Metrowerks Codewarrior. I built split windows, and scroll bars, and displayed rows of data with nice resizable columns. All this wasn’t built into the class library, so for a lot of it we needed to roll our own solution.

We always had a long list of features coming from the business units. I also fielded many support calls, often from the windows platform as the code there hadn’t been managed and built as carefully. But that too was instructive as you could feel the pain of customers day-to-day challenges. It also illustrated the tradeoffs between new code and features, and existing bug fixes and support.

Also: Why generalists are better at scaling the web

2. A trip through the dot-com bubble as Oracle DBA

Through a circuitous path, I moved to New York in the mid-nineties and joined a startup. I had the opportunity to wear a lot of hats there, and apply my computer lab and Linux operating systems experience to the challenge of managed Oracle. I got a lot more involved with operations quick.

As the dot-com bubble grew, I saw a hot and growing demand for Oracle DBAs as most startups used Oracle, but the talent was in short supply. In one startup 80 million dollars was on the line as performance hobbled the website, and investors feared the worst.

Read: Why the Twitter IPO made a shocking admission about scalability

3. Different priorities & mandates

I remember working at Starmedia a media darling at the time. I was analyzing the database & server systems, and finding some code & jobs running during peak daytime hours. Management claimed that could not be the case. Yet for the next days and weeks I saw the same jobs running. I held strong and spoke truth to power as they say. That’s not always easy when you have a lot of investors, screaming CTOs and 100+ hour weeks. But eventually the source of the job was located, and disabled. And the website returned to it’s speedy self.

These experiences though do underline in my mind the different priorities and focus that developers and operations staff have.

Techops, system administrators & DBAs are typically averse to change. They fight it tooth and nail. That isn’t because they like to be curmudgeons though. They are typically very concerned about the business, but from a dramatically different perspective of stability, and reliability, even at 2am in the morning. They are concerned about the longevity of data, consistency, and durability of it.

Developers on the other hand have a different mandate. They are responsible for new business features, solutions to business requirements. Rapid prototyping & reactive or agile is embraced because it means you can deliver quicker to the business.

Crucially, both of these folks care very much for the business. Just with very different priorities.

Check this: Why AirBNB didn’t have to fail

4. Can developers do operations for you?

In a lot of small startups, the initial phase is obviously on building a product. That’s the build phase, and not surprisingly you hire a lot of developers. As you should. But as you grow you may find the operational tasks that are defaulting to one or more developers are taking more and more of their time. As your customer base grows and you’ve seen your first few spikes, it’s time to start thinking about hiring for a real ops role.

In summary, yes they can, but perhaps not well.

Related: How to hire a developer that you can work with

5. Volume discount, made to order or instant coffee

You may choose to go with instant coffee, by bringing someone in-house. You may find the right talent is hard to find. I wrote about this: Why techops and DBAs are in short supply.

Alternatively you may prefer a volume discount from one of the larger remote DBA or managed support solutions such as Oracle’s, Pythian or Percona. These guys all provide great service, but keep in mind how big of a fish you are. You’ll likely work through a ticketing system, and in some cases different engineers will look at your systems at different times. You will likely need either a very hands-on technical CTO or other in-house person to take ownership, and manage things closely.

The third option is a made-to-order coffee. Yes you pay more for Toby’s, Blue Bottle, or Ninth Street Espresso but you get what you pay for as they say. A boutique shop or independent consultant will provide a lot more hand holding, help your internal staff get up to speed, and communicate intimately about the process. If you’re a more non-technical CTO, or you’re very busy running the business, this solution may make a lot of sense for you.

Also: Why cloud detractors need a history lesson

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Cloud Computing – Disciplined Deployments

With traditional managed hosting solutions, we have best practices, we have business continuity plans, we have disaster recovery, we document our processes and all the moving parts in our infrastructure.  At least we pay lip service to these goals, though from time to time we admit to getting side tracked with bigger fish to fry, high priorities and the emergency of the day.  We add “firedrill” to our todo list, promising we’ll test restoring our backups.  But many times we find it is in the event of an emergency that we are forced to find out if we actually have all the pieces backed up and can reassemble them properly.

** Original article — Intro to EC2 Cloud Deployments **

Cloud Computing is different.  These goals are no longer be lofty ideals, but must be put into practice.  Here’s why.

  1. Virtual servers are not as reliable as physical servers
  2. Amazon EC2 has a lower SLA than many managed hosting providers
  3. Devops introduces new paradigm, infrastructure scripts can be version controlled
  4. EC2 environment really demands scripting and repeatability
  5. New flexibility and peace of mind

Unreliable Servers

EC2 virtual servers can and will die.  Your spinup scripts and infrastructure should consider this possibility not as some far off anomalous event, but a day-to-day concern.  With proper scripts and testing of various scenarios, this should become manageable.  Use snapshots to backup EBS root volumes, and build spinup scripts with AMIs that have all the components your application requires.  Then test, test and test again.

Amazon EC2’s SLA – Only 99.95%

The computing industry throws around the 99.999% or five-nines uptime SLA standard around a lot.  That amounts to less than six minutes of downtime.  Amazon’s 99.95% allows for 263 minutes of downtime.  Greater downtime merely gets you a credit on your account.  With that in mind, repeatable processes and scripts to bring your infrastructure back up in different availability zones or even different datacenters is a necessity.  Along with your infrastructure scripts, offsite backups also become a wise choice.  You should further take advantage of availability zones and regions to make your infrastructure more robust.  By using private IP addresses and network, you can host a MySQL database slave in a separate zone, for instance.  You can also do GDLB or Geographically Distributed Load Balancing to send customers on the west coast to that zone, and those on the east coast to one closer to them.  In the event that one region or availability zone goes out, your application is still responding, though perhaps with slightly degraded performance.

Devops – Infrastructure as Code

With traditional hosting, you either physically manage all of the components in your infrastructure, or have someone do it for you.  Either way a phone call is required to get things done.  With EC2, every piece of your infrastructure can be managed from code, so your infrastructure itself can be managed as software.  Whether you’re using waterfall method, or agile as your software development lifecycle, you have the new flexibility to place all of these scripts and configuration files in version control.  This raises manageability of your environment tremendously.  It also provides a type of ongoing documentation of all of the moving parts.  In a word, it forces you to deliver on all of those best practices you’ve been preaching over the years.

EC2 Environment Considerations

When servers get restarted they get new IP addresses – both private and public.  This may affect configuration files from webservers to mail servers, and database replication too, for example.  Your new server may mount an external EBS volume which contains your database.  If that’s the case your start scripts should check for that, and not start MySQL until it finds that volume.  To further complicate things, you may choose to use software raid over a handful of EBS volumes to get better performance.

The more special cases you have, the more you quickly realize how important it is to manage these things in software.  The more the process needs to be repeated, the more the scripts will save you time.

New Flexibility in the Cloud

Ultimately if you take into consideration less reliable virtual servers, and mitigate that with zones and regions, and automated scripts, you can then enjoy all the new benefits of the cloud.

  • autoscaling
  • easy test & dev environment setup
  • robust load & scalability testing
  • vertically scaling servers in place – in minutes!
  • pause a server – incurring only storage costs for days or months as you like
  • cheaper costs for applications with seasonal traffic patterns
  • no huge up-front costs