Category Archives: Web Operations

Is automation killing old-school operations?

puppet logo

Join 27,000 others and follow Sean Hull on twitter @hullsean.

I was shocked to find this article on ReadWrite: The Truth About DevOps: IT Isn’t Dead; It’s not even Dying. Wait a second, do people really think this?

Truth is I have heard whispers of this before. I was at a meetup recently where the speaker claimed “With more automation you can eliminate ops. You can then spend more on devs”. To an audience of mostly developers & startup founders, I can imagine the appeal.

1. Does less ops mean more devs?

If you’re listening to a platform service sales person or a developer who needs more resources to get his or her job done, no one would be surprised to hear this. If we can automate away managing the stack, we’ll be able to clear the way for the real work that needs to be done!

This is a very seductive perspective. But it may be akin to taking on technical debt, ignoring the complexity of operations and the perspective that can inform a longer view.

chef logo

Puppet Labs’ Luke Kanies says “Become uniquely valuable. Become great at something the market finds useful.”. I couldn’t agree more.

Read: Are SQL Databases Dead?

2. What happens when developers leave?

I would argue that ops have a longer view of product lifecycle. I for one have been brought in to many projects after the first round of developers have left, and teams are trying to support that software five years after the first version was built.

That sort of long term view, of how to refresh performance, and revitalize code is a unique one. It isn’t the “building the future” mindset, the sexy products, and disruptive first mover “we’re changing the world” mentality.

It’s a more stodgy & conservative one. The mindset is of reliability, simplicity, and long term support.

Also: How to hire a developer that doesn’t suck

3. What’s your mandate?

From what I’ve seen, devs & ops are divided by a four letter word.

That word I believe is “risk”. Devs have a mandate from the business to build features & directly answer to customer requests today. Ops have a mandate to reliability, working against change and thinking in terms of making all that change manageable.

Different mandates mean different perspectives.

Related: What is Devops & why is it important?

4. Can infrastructure live as code?

Puppet along with infrastructure automation & configuration management tools like Chef offer the promise of fully automated infrastructure. But the truth is much much more complex. As typical technology stacks expand from load balancer, webserver & database, to multiple databases, caching server, search server, puppet masters, package repositories, monitoring & metrics collection & jump boxes we’re all reaching a saturation point.

Yes automation helps with that saturation, but ultimately you need people with those wide ranging skills, to manage the complex web of dependencies when things fail.

And fail they will.

Check out: Why are MySQL DBA’s and ops so hard to find?

5. ORM’s and architecture

If you aren’t familiar, ORM’s are a rather dry sounding name for a component that is regularly overlooked. It’s a middleware sitting between application & database, and they drastically simplify developers lives. It helps them write better code and get on with the work of delivering to the business. It’s no wonder they are popular.

But as Ward Cunningham elloquently explains, they are surely technical debt that eventually must get paid. Indeed.

There is broad agreement among professional DBA’s. Each query should be written, each one tuned, and each one deployed. Just like any other bit of code. Handing that process to a library is doomed to failure. Yet ORM’s are still evolving, and the dream still lives on.

And all that because devs & ops have a completely different perspective. We need both of them to run modern internet applications. Lets not forget folks. :)

Read this: Do managers and CTO’s underestimate operational costs?

Want more? Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is Hunter Walk right about operations & startups?

The.Rohit - Flickr

The.Rohit – Flickr

Join 26,000 others and follow Sean Hull on twitter @hullsean.

Hunter Walk blogged recently about the importance of building great operations teams. And while he was speaking primarily about business operations, the startup technical operations teams are equally difficult to get right.

1. performance & scalability

As your grows like Birchbox, your customer growth curve may begin to look like a hockey stick. That’s a good problem to have. Will your web application be able to keep up with the onslaught of traffic those customers bring?

Getting performance and scalability just right, will mean fewer site crashes during those key moments when all eyes are on your site.

Also: Is top operations talent hard to find?

2. Operations is key to architecture

Developers will always have strong opinions on architecture. However they may be heavily influenced by their own mandate, features, deliverability & deadlines. So it’s no surprise that they may sometimes choose to build on ORM’s, the middleware brought to you by Hibernate, Cake PHP, Active Record & the like.

And while these technologies seem a necessity in todays modern architectures, they play havoc with your long term scalability. Strong technical operations teams mean a better vision in this area. Heading off your reliance on these technologies will mean managing technical debt before it takes down your country.

Read: Are generalists better at scaling the web?

3. Operations informs strategy

Did you build in those operational switches to turn off the heaviest code, when your site gets overloaded? Operations strategy can help you see these problems on the horizon before they overwhelm you.

Have you considered building a browse only mode for your site? If you’ve ever visited Facebook or Yelp after hours you may have been greeted with the message “We can’t save your comments. Please try again later”. A small innocuous message to end users doesn’t disrupt their enjoyment of the site terribly. But from an technical operations perspective it’s huge. It means teams can perform backups, upgrades and maintenance without interrupting day-to-day activity on the site.

Related: Is scalability a big business?

4. Operations means resilience

We only learn real disaster recovery lessons from storms like Sandy. That’s because resilience highlighted best when it is a real & urgent need.

In technical operations, getting backups right & testing your recovery plan all form key steps in your path to excellence. Get them right before you need them, and ensure repeatability.

Read: Is high availability a real possibility?

5. Operations means technical strength

At the end of the day, getting technical operations right, means you can move from strength to strength. It means building on a solid foundation the likes of Google, Facebook, Foursquare & Etsy. It means you can evolve & grow with your customers, and meet their needs confidently.

Check out: Do startup CEO’s underestimate operational costs?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Cloud Deployment Interview

What does a cloud computing expert need to know? In part one of the cloud interview guide we covered some basic unix & Linux systems administration skills, and cloud computing and infrastructure concepts. Those are key starting points. You might also want to jump to part 3 cloud dba, architecture and management interview questions.

In this second part, let’s dig into deploying applications in the cloud, and day to day operations skills. There’s a lot of material here. We recommend picking a few questions out of the bunch and focusing on those questions, rather than trying to cover all of them.

Also while on the topic of hiring, keep in mind that Hiring is a Numbers Game.

1. Deploying in the Cloud

Deploying applications into virtual or cloud datacenters involves understanding and evaluating providers. Many just deploy on Amazon EC2 as it is far and away the largest cloud hosting solution, with the most robust offering.

You might also like our MySQL DBA Interview Guide as well.

o What sets amazon apart from the other cloud providers?

There are probably two things that set Amazon apart from other cloud infrastructure solutions. EBS or elastic block storage being one. Although the others have storage solutions, and Rackspace is working on their own virtualized storage, Amazon seems to be the furthest ahead with their offering. It is fully virtual, allows arbitrary chunks of storage to be attached to instances, and allows instances to boot of ebs volumes.

The other major point is that since Amazon has grown so large, so quickly, it has more datacenters, in more geographically dispersed areas than other providers. Since these are organized into logical resources, and can be accessed through API, it makes your application infrastructure truly virtual.

o What are some other large cloud providers?

Joyent, Rackspace cloud, Storm on Demand, GoGrid and VoxCloud. There are certainly many others. Take a look at this Quora post: Most Reliable Cloud Providers.

o Tell one vendor management story.

Everyone who has managed operations, has worked with vendors at one point or another. For example if you’ve worked with Rackspace you know that it’s pretty easy to get a human on the line. Amazon on the other hand allows you to do-it-yourself for everything, and only later added on a support service option. So their service pattern and history are different.

Also check out 3 Things CEOs should know about the cloud.

o How do you troubleshoot a problems?

There isn’t really a right or wrong answer to this question, but it’s a nice starting point to discussion. It can also help illustrate a candidates communication skills, and how specifically they walk through solving a problem. What problem they choose as an illustration, and how they work through to a resolution is an important indicator of operations experience.

[quote]
Pros and cons of Amazon versus Rackspace, configuration management & automation and cloud management solutions like Scalr and Rightscale… these and other skills are a important for a cloud deployment expert.
[/quote]

o What is puppet and chef?

Puppet is a configuration management system which allows ops teams to build templates for servers, and deploy many servers based on those templates. It further allows centralized control of configuration, to automate the management of a large number of servers.

Chef grew out of frustrations of Puppet, and is a sort of next generation configuration management system.

The term infrastructure as code may be thrown around. Since all cloud resources can be provisioned through API calls, everything in server deployment can be *theoretically* done via code, from spinup of servers, to installing packages, to configuring, code checkout, seeding databases and more.

Also our article What is Infrastructure provisioning and why is it important.

o What are some of the pros and cons of configuration management for operations?

Pros include allowing a smaller team to automate the deployment of a large fleet of servers, standardization, and consistency. Cons include complexity when needing to do surgical, urgent changes, and complexity when coming into an existing environment that you’ve inherited.

o How is rightscale different? What does it provide?

Rightscale is a layer on top of your cloud provider. They provide a common interface and dashboard from which to deploy servers. Templating, automation, and multi-cloud support make it a great solution for teams that have less technical expertise on staff or less hands to manage things.

o How about scalr?

They’re another management solution, that supports multiple cloud providers. They offer templating, and auto-scaling too.

While you’re here, take a look at our Myth of Five Nines – Why HA is Overrated.

2. Day to day skills

o What type of programming experience do you have?

The answer is that every ops guy or girl should be able to code, just as every developer should have some basic operational experience. Should and does are often two different things, so ask for some examples.

o shell scripts

Bash, csh, Perl and Python are all part of the Linux administrators toolbox. Writing backup scripts, log rotation, automating routine tasks and so forth are all common needs of an operations expert.

Regular expressions are a part of Unix and used in scripting to search files, cronjobs, and ETL jobs. Ask for some basic examples.

o What is continuous integration?

The old model of code deployment was called waterfall, and allowed long careful planning, coding of new features, testing, and finally deployment. The cycle could take weeks or months and iterative change took a lot of time. Continuous integration also known as agile deployments, allows a much more frequent in some cases many times per day deployment of changes.

o What are metrics good for?

Just like in website visitor tracking, and business analytics, server level analytics and tracking is possible. Collecting server metrics such as load averages, memory, disk and cpu usage over time can be invaluable. When an application slows or server stalls, checking historical metrics can often quickly reveal problems or causes.

What are some examples? nagios, ganglia, cacti, munin, opennms

o What is unit testing?

This allows for software to be build in small testable compontents. When the compontents are coded, tests are also written that test whether they are operating properly, and whether dependencies are also installed and working.

[quote]
Metrics, monitoring, load testing, firewalls, security & patching, Saas, Paas and IaaS there is a wide swath of skills needed to be competent as a web operations engineer. You’ve got your work cut out for you!
[/quote]
o What is load testing?

By performing some benchmarks, load testing can make estimates about how the application and code will perform when more users are hitting it.

o Security & networking

Sometimes a systems administrator is a generalized admin and sometimes there is a networking specialist on staff who doesn’t allow anyone else to touch that domain.

o What are firewall rules?

Unix services use port numbers to expose those services to the world. Since all servers on the internet are identified by IP addresses, firewall rules are defined around IP addresses or groups of them, and the ports they’re allowed to access.

o What is DNS?

DNS stands for domain name services. This is the sort of yellow pages of the internet. DNS allows a server name to be converted to it’s underlying IP address. It’s a very important service for any network, and generally includes many backup servers for when the primaries experience problems.

o What is a virtual private network?

A VPC provides a network link between a physical datacenter or your offices network, and your cloud provider. It allows you to elastically grow your existing datacenter using virtual resources, while treating those new boxes more like servers in your existing datacenter. IP addresses and subnets are controlled by your existing network rules and admins.

o Why is security important in web operations?

Since your business assets are primarily stored in digital form, the security of those assets depends on the security of your computer systems. Passwords, firewalls and encryption are all relevant.

o Why is patching software important?

Since security is a moving target, and vulnerabilities are constantly being discovered in software, patching and updates are important. Staying fairly current in applying patches means you network and systems will be more secure.

o What is intrusion detection?

Bugs in software open up vulnerabilities and ways into systems. Intrusion detection attempts to detect that such intrusions and avoid further damage.


o What is Saas – Software as a Service?

An example is dropbox, and other so-called hold-my-data type solutions fall into this category.

o What is Iaas – Infrastructure as a Service?

This is raw iron, the virtualized datacenters, hosting providers such as Amazon, GoGrid, Joyent, and Rackspace.

o What is Paas – platform as a service?

Solutions such as heroku, squarespace, wpengine and engineyard fall into this category. Some provide a platform such as the WordPress CMS, with arbitrary scaling options. Others like Heroku and EngineYard allow Ruby applications to be deployed without the need for a lot of fuss at the operational level.

We’re not done yet. In part three of this series, we’ll hit on dba skills, and a series of general questions that cut across the spectrum of web operations. Or jump back to part one of the cloud interview guide.

Read this far? Grab our newsletter – startup scalability.

How to hire a developer that doesn't suck

xkcd_goodcode

Strip by Randall Munroe; xkcd.com

First things first. This is not meant to be a beef against developers. But let’s not ignore the elephant in the living room that is the divide between brilliant code writers and the risk averse operations team.

By the way we also have a MySQL DBA Interview Questions article which is quite popular.

Also take a look at our AWS & EC2 Interview questions piece.

Lastly we have a great Oracle DBA Hiring Guide.

It is almost by default that developers are disruptive with their creative coding while the guys in operations, those who deploy the code, constantly cross their fingers in the hope that application changes won’t tilt the machine. And when you’re woken up at 4am to deal with an outage or your sluggish site is costing millions in losses, the blame game and finger-pointing starts.

If you manage a startup you may be faced with this problem all the time. You know your business, you know what you’re trying to build but how do you find people who can help you build and execute your ideas with minimal risk?

Ideally, you want people who can bridge the mentality divide between the programmers eager to see feature changes, the business units pushing for them, and the operations team resistant to changes for the sake of stability. Continue reading

Scale Quickly Like Birchbox – Startup Scalability 101

One of the great things about the Internet is how it has made it easier to put great ideas into practice. Whether the ideas are about improving people’s lives or a new way to sell and old-fashioned product, there’s nothing like a good little startup tale of creative disruption to deliver us from something old and tired.

We work with a lot of startup firms and we love being part of the atmosphere of optimism and ingenuity, peppered with a bit of youthful zeal – something very indie-rock-and-roll about it. But whether they are just starting out or already picking up pace every startup faces the same challenges to scale a business. Recently, we were reminded of this when we watched Inc’s video interview with Birchbox founders, Hayley Barna and Katia Beauchamp. Continue reading

5 things toxic to scalability

The.Rohit - Flickr

Slow Traffic - Penguin Crossing

Check out our followup post 5 More Things Deadly to Scalability

If you’re using MySQL checkout 5 ways to boost MySQL scalability.

1. Object Relational Mappers

ORMs are popular among developers but not among performance experts.  Why is that?  Primarily these two engineers experience a web application from entirely different perspectives.  One is building functionality, delivering features, and results are measured on fitting business requirements.  Performance and scalability are often low priorities at this stage.  ORMs allow developers to be much more productive, abstracting away the SQL difficulties of interacting with the backend datastore, and allowing them to concentrate on building the features and functionality.

[quote]
Scalability is about application, architecture and infrastructure design, and careful management of server components.
[/quote]

On the performance side the picture is a bit different.  By leaving SQL query writing to an ORM, you are faced with complex queries that the database cannot optimize well.  What’s more ORMs don’t allow easy tweaking of queries, slowing down the tuning process further.

2. Synchronous, Serial, Coupled or Locking Processes

Locking in a web application operates something like traffic lights in the real world.  Replacing a traffic light with a traffic circle often speeds up traffic dramatically.  That’s because when you’re out somewhere in the country where there’s very little traffic, no one is waiting idly at a traffic light for no reason.  What’s more even when there’s a lot of traffic, a traffic circle keeps things flowing.  If you need locking, better to use InnoDB tables as they offer granular row level locking than table level locking like MyISAM tables.

Avoid things like semi-synchronous replication that will wait for a message from another node before allowing the code to continue.  Such waits can add up in a highly transactional web application with many thousands of concurrent sessions.

Avoid any type of two-phase commit mechanism that we see in clustered databases quite often.  Multi-phase commit provides a serialization point so that multiple nodes can agree on what data looks like, but they are toxic to scalability.  Better to use technologies that employ an eventually consistent algorithm.

3. One Copy of Your Database

Without replication, you rely on only one copy of your database.  In this configuration, you limit all of your webservers to using a single backend datastore, which becomes a funnel or bottleneck.  It’s like a highway that is under construction, forcing all the cars to squeeze into one lane.  It’s sure to slow things down.  Better to build parallel roads to start with, and allow the application aka the drivers to choose alternate routes as their schedule and itinerary dictate.

Using MySQL? Checkout our our howto Easy Replication Setup with Hotbackups.

4. Having No Metrics

Having no metrics in place is toxic to scalability because you can’t visualize what is happening on your systems.  Without this visual cue, it is hard to get business units, developers and operations teams all on the same bandwagon about scalability issues.  If teams are having trouble groking this, realize that these tools simple provide analytics for infrastructure.

There are tons of solutions too, that use SNMP and are non-invasive.  Consider Cacti, Munin, OpenNMS, Ganglia and Zabbix to name a few.  Metrics collections can involve business metrics like user registrations, accounts or widgets sold.  And of course they should also include low level system cpu, memory, disk & network usage as well as database level activity like buffer pool, transaction log, locking sorting, temp table and queries per second activity.

5. Lack of Feature Flags

Applications built without feature flags make it much more difficult to degrade gracefully.  If your site gets bombarded by a spike in web traffic and you aren’t magically able to scale and expand capacity, having inbuilt feature flags gives the operations team a way to dial down the load on the servers without the site going down.   This can buy you time while you scale your webservers and/or database tier or even retrofit your application to allow multiple read and write databases.

Without these switches in place, you limit scalability and availability.

Want more?  Signup for our scalability newsletter.

5 Tips for Better Database Change Management

Deploying new code that includes changes to your database schema doesn’t have to be a process fraught with stress and burned fingers. Follow these five tips and enjoy a good nights sleep.

1. Deploy with Roll Forward & Rollback Scripts

When developers check-in code that requires schema changes, that release should also require two scripts to perform database changes. One script will apply those changes, alter tables to add columns, change data types, seed data, clean data, create new tables, views, stored procedures, functions, triggers and so forth. A release should also include a rollback script, which would return tables to their previous state. Continue reading

5 Scalability Pitfalls to Avoid

1. Object Relational Mappers

Software development has always made use of libraries, off-the-shelf components that are shared between different projects.  These allow you to stand on the shoulders of others and build bigger things.  Frameworks do the same thing, they provide a context from which to build on.  Ruby on Rails for example provides a great starting framework from which to build web applications, managing sessions in an elegant way. Continue reading

4 Considerations Migrating to The Cloud

When migrating to the cloud consider security and resource variability, the cultural shift for operations and the new cost model. Continue reading

Top 3 Questions From Clients

1. This page or area of the website is very slow, why?

There are a lot of components that make up modern internet websites, and a lot of places to get stuck in the mud.  Website performance starts with the browser, what caching it is doing, their bandwidth to your server, what the webserver is doing (caching or not and how), if the webserver has sufficient memory, and then what the application code is doing and lastly how it is interacting with the backend database. Continue reading