Tag Archives: devops

Why startups need techops

devops divide

I was at a talk recently on node.js. Even if I’m not working with a technology directly, it’s exciting to see what’s out there, and node.js is bringing some hyper fast performance to a certain category of web applications.

During the keynote, the speaker mentioned a service to deploy applications on. I can’t name names unfortunately but it was a cloud solution on top of which you could deploy your application. Go this route
and you can do without an operations team. Avoid overhead of hiring ops, he claimed. And hey, then you can hire more developers!

To be fair I’ve heard much of the same thing at DBA or linux conferences. I can’t count the number of stories that start with “what some idiot developer did that took down our production systems…”.

Yes, it seems dev & ops are still just a tad bit adverserial.

Join 13,000 others and follow Sean Hull on twitter @hullsean.

1. My little known origins as a developer

Many colleagues and clients I’ve met in the New York City startup industry know me primarily as an operations & scalability guy. I tune databases, infrastructure and components to make things lightening fast.

I spent my earliest years at university on the computer lab operations staff. We watched and managed, made sure level zero backups were taken care of, and moved the tapes. Directly after college, I started at a software firm. I did C++ GUI development on the Mac, using the toolbox libraries with Metrowerks Codewarrior. I built split windows, and scroll bars, and displayed rows of data with nice resizable columns. All this wasn’t built into the class library, so for a lot of it we needed to roll our own solution.

We always had a long list of features coming from the business units. I also fielded many support calls, often from the windows platform as the code there hadn’t been managed and built as carefully. But that too was instructive as you could feel the pain of customers day-to-day challenges. It also illustrated the tradeoffs between new code and features, and existing bug fixes and support.

Also: Why generalists are better at scaling the web

2. A trip through the dot-com bubble as Oracle DBA

Through a circuitous path, I moved to New York in the mid-nineties and joined a startup. I had the opportunity to wear a lot of hats there, and apply my computer lab and Linux operating systems experience to the challenge of managed Oracle. I got a lot more involved with operations quick.

As the dot-com bubble grew, I saw a hot and growing demand for Oracle DBAs as most startups used Oracle, but the talent was in short supply. In one startup 80 million dollars was on the line as performance hobbled the website, and investors feared the worst.

Read: Why the Twitter IPO made a shocking admission about scalability

3. Different priorities & mandates

I remember working at Starmedia a media darling at the time. I was analyzing the database & server systems, and finding some code & jobs running during peak daytime hours. Management claimed that could not be the case. Yet for the next days and weeks I saw the same jobs running. I held strong and spoke truth to power as they say. That’s not always easy when you have a lot of investors, screaming CTOs and 100+ hour weeks. But eventually the source of the job was located, and disabled. And the website returned to it’s speedy self.

These experiences though do underline in my mind the different priorities and focus that developers and operations staff have.

Techops, system administrators & DBAs are typically averse to change. They fight it tooth and nail. That isn’t because they like to be curmudgeons though. They are typically very concerned about the business, but from a dramatically different perspective of stability, and reliability, even at 2am in the morning. They are concerned about the longevity of data, consistency, and durability of it.

Developers on the other hand have a different mandate. They are responsible for new business features, solutions to business requirements. Rapid prototyping & reactive or agile is embraced because it means you can deliver quicker to the business.

Crucially, both of these folks care very much for the business. Just with very different priorities.

Check this: Why AirBNB didn’t have to fail

4. Can developers do operations for you?

In a lot of small startups, the initial phase is obviously on building a product. That’s the build phase, and not surprisingly you hire a lot of developers. As you should. But as you grow you may find the operational tasks that are defaulting to one or more developers are taking more and more of their time. As your customer base grows and you’ve seen your first few spikes, it’s time to start thinking about hiring for a real ops role.

In summary, yes they can, but perhaps not well.

Related: How to hire a developer that you can work with

5. Volume discount, made to order or instant coffee

You may choose to go with instant coffee, by bringing someone in-house. You may find the right talent is hard to find. I wrote about this: Why techops and DBAs are in short supply.

Alternatively you may prefer a volume discount from one of the larger remote DBA or managed support solutions such as Oracle’s, Pythian or Percona. These guys all provide great service, but keep in mind how big of a fish you are. You’ll likely work through a ticketing system, and in some cases different engineers will look at your systems at different times. You will likely need either a very hands-on technical CTO or other in-house person to take ownership, and manage things closely.

The third option is a made-to-order coffee. Yes you pay more for Toby’s, Blue Bottle, or Ninth Street Espresso but you get what you pay for as they say. A boutique shop or independent consultant will provide a lot more hand holding, help your internal staff get up to speed, and communicate intimately about the process. If you’re a more non-technical CTO, or you’re very busy running the business, this solution may make a lot of sense for you.

Also: Why cloud detractors need a history lesson

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Five More Things Deadly to Scalability

The.Rohit - Flickr
The.Rohit – Flickr

Join 19k others and follow Sean Hull on twitter @hullsean.

1. Slow Disk I/O – RAID 5 – Multi-tenant EBS

Disk is the grounding of all your servers, and the base of their performance. True with larger and larger main memory, much is available in cache, a server still needs to constantly read from disk and flush things from memory. So it’s a very very important component to performance and scalability.

What’s wrong with Raid 5?

Raid 5 was designed to give you more space, using fewer disks. It’s often used in a server with few slots or because ops misunderstand how bad it will impact performance. On a database server it can be particularly bad.

All writes see a performance hit. What’s worse is if you lose a disk, the RAID though technically still on line, will perform SO SLOWLY as to be offline. And a rebuild takes many hours. Worse still is the risk to lose another drive during that rebuild. What if you have order a drive and it takes a couple of days?

RAID 10 is the solution. Mirror each set of two drives, then stripe over those. Even with only four slots available, it’s worth it. Good read performance, good write performance, and protection.

What the heck is multi-tentant?

In the cloud, you share servers, network & disk just like you do apartments in a building. Hence the name. Amazon’s EBS or elastic block storage, extends this metaphor, offering you the welcome flexibility of a storage network. But your bottleneck can be fighting with other tenants on that same network.

Default servers do have this problem, however AWS has addressed this serious problem with a little known but VERY VERY useful feature called Provisioned IOPS. It’s a technical name, but means you can lock in reliable disk I/O. Just what the scalability doctor ordered.

Check out our original post: 5 Things Toxic to Scalability

2. Using the database for Queuing

MySQL is good at a lot of things, but it’s not ideal for managing application queues. Do you have a table like JOBS in your database, with a status column including values like “queued”, “working”, and “completed”? If so you’re probably using the database to queue work in your application.

It’s not a great use of MySQL because of locking problems that come up, as well as the search and scan to find the next task.

Luckily their are great solutions for developers. RabbitMQ is a great queuing solution, as is Amazon’s SQS solution. What’s more as external services they’re easier to scale.

[quote]
Scalability becomes key to your business, as you customer base grows. But it doesn’t have to be impossible. Disk I/O, caching, queuing and searching are all key areas where you can make a big dent, in a manageable way. Juggle your technical debt too, and you’re golden!
[/quote]

Also take a look at: Why Generalists are Better at Scaling the Web

3. Using Database for full-text searching

Oracle has full text search support, why shouldn’t we assume the same in MySQL? Well MySQL *does* have this, but in many versions only with the old MyISAM storage engine. It has it’s set of corruption problems, and isn’t really very performant.

Better to use a proven search solution like Apache Solr. It is built specifically for search, includes excellent library support for developers of most modern web languages and best of all is easy to SCALE! Just add more servers on your network, or distributed globally.

For folks interested in the bleeding edge, Fulltext is coming to Innodb crash safe & transactional storage engine in 5.6. That said you’re still probably better off going with an external solution like Solr or Sphinx and the MySQL Sphinx SE plugin.

[mytweetlinks]

How to find A Mythical MySQL DBA

4. Insufficient Caching at all layers

Cache, cache, and cache some more. Your webservers should use a solid memcache or other object cache between them & the database. All those little result sets will sit in resident memory, waiting for future web pages that need them.

Next use a page cache such as varnish. This sits in front of the webserver, think of it as a mini-webserver that handles very simple pages, but in a very high speed way. Like a pack of motorbikes riding down an otherwise packed freeway, they speed up your webserver to do more complex work.

Browser caching is also important. But you can’t get at your customers browsers, or can you? Well not directly, but you can instruct them what things to cache. Do that with proper expires headers. Have your system administrator configure apache to support this.

Also: Tweaking Disqus to Find Experts & Drive Traffic

5. Too much technical debt

Technical debt can bite. What is it? As you’re developing an new idea, you’ll build prototypes. As those get deployed to customers, change gets harder, and past things you glossed over because problems. One team leaves, and another inherits the application, and the problems multiple. Overtime you’re building your technical debt as your team spends more time supporting old code and fixing bugs, and less time building new features. At some point a rewrite of problem code becomes necessary.

Also: How I increased my blog pagerank to 5

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Here’s a sample

Hacking Job Search – Three Meaty Ideas

Also find the author on twitter @hullsean.

Demand for talented engineers has never been higher. It is in fact the dirty little secret of the startup industry, that there are simply not enough qualified folks to fill the positions.

What this means for you is that you have a lot of options. What it means for a hiring manager is that you will have to work even harder to find the right candidate. Just going to a recruiter isn’t enough. Use your network, go to meetups, follow Gary’s Guide daily.

Also check out our Mythical MySQL DBA piece where we talk about the shortage of DBAs and operations folks.

Further if you’ve dabbled in freelance or independent consulting, I wrote an interesting an in depth look at Why do people leave consulting. Understanding this can help avoid it in your own career, or avoid your resources leaving for better shores.

Find us on twitter @hullsean and linkedin where we share content and ideas everyday!

1. Build your reputation

As they say, your reputation precedes you. So start building it now. Fulltime or freelance, you want to be known.

Speaking, yes you can do it. Start with some small meetups, volunteer to speak on a topic. A ten person room is easier than 30, 50 or 100. Once you have a couple under your belt, fill out a CFP for Velocity, OSCon or some software developers conference. There are many.

Blog – if you’re not already doing so you should. Start with once a week. Comment on industry topics, controversial ideas, or engineering know-how. Prospects can look at this and learn a lot more than from a business card.

Write a book, yes you can. It may sound impossible, but the truth is that publishers are always looking for technical writers. Pick a topic near and dear to you. It’ll also give you endless material for your blog.

Go to meetups, you really need to be getting out there and networking. Get some Moo Business Cards and start working on your elevator pitch!

Social media – being active here helps your blog, and helps people find you. Twitter is a great place to do this. Interact with colleagues and startup founders, VCs and more. If you’re a hiring manager or CTO, you may find great programmers and devops this way.

We also wrote a more in-depth article Consulting and Freelance 101. It’s a three part guide with a lot of useful nugets.

Also take a look at our MySQL DBA Interview Guide which is as helpful to devops and DBAs as it is to managers hiring them!

[quote]
Above all else, build your network & your reputation. It will put you in front of more people as a person, not a commodity or a resume in a pile of hundreds.
[/quote]

2. Qualify prospects

You definitely don’t want to take the first offer you get, and managers don’t want to hire the first candidate that comes along. You want two or three to choose from. Best way to do this is to have options.

If you’re a candidate, network or work through your colleagues. When you do get a lead, be sure you’re speaking to an economic buyer. If you’re not you’ll need to try to find that person who actually signs the checks. They are the ones who ultimately make the decision, so you want to sell yourself to them.

Get a Deposit – I know I know, if it’s your first freelance job, you don’t want to scare them off. Or maybe you do? The only prospects that would be scared off by this are ones who may not pay down the line. Dragging their feet with a deposit can also mean bureaucratic red tap, so be patient too.

Sara Horowitz has an excellent book Freelancers bible, we recommend you grab a copy right now!

Commodity You Are Not so don’t sell yourself as one. What do I mean? You are not an interchangeable part. You have special skills, you have personality, you have things that you’re particularly good at. These traits are what you need to focus on. The dime-a-dozen skills should sit more in the background.

You’ll also need to price and package your services. We talked about this in-depth in Consulting Essentials – Getting the Business.

We also think there is a reason Why Generalists are better at scaling the web.

3. Play the numbers game

For hiring managers this doesn’t mean working through recruiters that might be bringing subpar talent, it means networking through industry events, meetups, startup pitch and venture capital events. There are a few every single day in NYC and there’s no reason not to go to some of them.

For candidates, be eyeing a few different companies, and following up on more than one prospect. You should really think of this process as an integral and enjoyable part of your career, not a temporary in between stage. Networking doesn’t happen overnight, but from a regular process of meeting and engaging with colleagues over years and years in an industry.

At the end of the day hiring is a numbers game so you should play it as such. Keep searching, and always be watching the horizon.

Read this far? Grab our Scalable Startups for more tips and special content.

Cloud DBA and Management Interview

What does a cloud computing expert need to know? This is the last of a three part guide to interviewing for a cloud operations position. You can find them here – part one Operations Interview and part two Deployment Interview.

Here’s my guide to do just that.

1. Database administration experience

Although in some shops the DBA role is a completely separate one, there are many others where the Linux and Operations teams manage these services as well. We do have a some other material Oracle DBA Interview questions and MySQL DBA Interview Guide. Here’s a taste of what to expect.

o What is RAID? Which type is best?

RAID is a way to share a whole bunch of disks on one server. Databases like Oracle or MySQL do a lot of writing and reading from disk. If there are more disks sharing this work, it’s like you have more waiters in your restaurant. Faster serivce.

Although some folks still hang onto RAID 5 as an option, it’s generally a very bad one. It has a serious write penalty because of parity checking it must perform. Most databases do a lot of writing, even when user transactions are not doing INSERT or UPDATE. What’s more if a disk fails, RAID 5 although technically online, will be so slow as to be effectively unusable while the long slow rebuild happens.

What’s the answer then? RAID 10! It mirrors each volume, and then stripes across those mirrored sets. Fast I/O, fast recovery. Done & done.

o What are the tradeoffs with more indexes versus fewer?

In all relational databases, you build indexes on data. Indexes are just like the ones you think of in the yellow pages, phonebooks of yore. An index on first name means you can look up Obama by Barack as well. Index on street addresses means you can lookup on the White House. So the more indexes you have, the more different ways you can search for & fetch what you want.

On the other hand the penalty here, is that whenever you add new data & records to this database, all those indexes must be updated. That’s overhead, which slows down writes.

So the tradeoff is more indexes – faster fetching, slower writing. Fewer indexes slower fetching, faster writing.

o What do NoSQL databases eliminate? How do they achieve great speed?

There are quite a few different types of NoSQL databases. So I’m generalizing quite a lot here. One thing NoSQL databases eliminate is the ability to JOIN data across different columns. By removing this great feature of relational databases, they dramatically simplify the underlying implementation. No free lunch!

What else? Many of these databases cut corners on what’s called durability. What is durability? Imagine you are in a lecture hall and bring your notebook or are waiting tables, and taking orders. It might be quicker to do so without writing things down. You keep it all in your head. Great, but what if you forget something? You have to go ask for the order again! It may be faster, but more prone to error. Losing data is not something to be taken lightly. NoSQL databases don’t always flush data to permanent storage.

[quote]
Whether or not an web operations candidate uses command line may seem like a small issue. But it speaks to what their DNA is, and the strength of their foundation. Strength and comfort on the command line is key.
[/quote]

o What is Amazon RDS? When should I use it?

Amazon has a managed relational database solution called RDS. It’s basically MySQL, Oracle or SQL Server, but modified so you can’t shoot yourself in the foot. Administrative tasks are simplified, but so are your configuration options.

I wrote an in-depth Amazon RDS use cases article. It mostly covers MySQL, but the general rules apply to Oracle & SQL Server. At the end of the data RDS is a lot less configurable and flexible. But if you don’t have a regular DBA on staff, it will probably simplify your administration of these servers.

o What are read-replicas? What about Multi-az?

Read-replicas are read-only copies of your data. Using MySQL these are fairly stock master-slave configurations. Note since they’re the standard technology, they’re still asyncronous. So yes the read-replica can lag behind.

Multi-az is a proprietary technology, and Amazon doesn’t disclose what’s under the hood. However it’s likely running on top of something like DRBD which is a distributed filesystem. This allows the underlying disk I/O to be mirrored across the internet, and to another availability zone. You’ll enjoy syncronous copies of your data, and no data consistency problems. Keep in mind those that the alternate server is offline or cold and can take time to come online.

o What is the primary bottleneck of hosting databases in the cloud? How has Amazon recently addressed this?

As I explained above disk I/O remains the largest bottleneck for relational databases, even if the entire dataset fits in memory. Why? Because sorting, joining, and rearranging data can take orders of magnitude more memory to magically do in memory. And that’s not even talking about durability guarentees.

The cloud has traditionally lagged quite a lot behind physical servers in terms of disk I/O so some internet firms have shyed away from moving to the cloud. EBS volumes were typically limited to a few hundred IOPs.

Amazon’s recently announced Provisioned IOPs. It’s a mouthful of a name for a very big development. It means you can provision how fast you want those virtual disks to be. For individual volumes the limit seems to be 2000 IOPs but you can also software raid across many of those virtual disks. For Amazon RDS the limit is reportedly 10,000 IOPs. This new feature will make a huge difference for hosting large high I/O databases in Amazon’s cloud.

2. Architecture & Management Questions

o Why does the API battle between Amazon & Eucalyptus (FOSS) matter?

As large applications are architected to build hardware components, and resources in the cloud, the API they work through becomes key. Sticking to an open standard for this API means you can change cloud vendors and/or build on multiple ones. We talked about this multi-cloud solution as a key way to avoid outages like AirBNB and Reddit experienced when AWS had an outage.

Following on the heels of that article, we were quoted about multi-cloud by Brandon Butler in his Network World piece .

o Do you use command line tools? Why?

A good web operations candidate should be very comfortable with command line tools. Everything in Linux is command line. It’s like broadway acting to movie acting, or literature to books. It’s the original source, much more powerful, what’s more it indicates and requires much stronger theoretical knowledge of the underlying systems being managed.

o What can go wrong with backups? How do we test them?

Everything can go wrong with them. They can fail to complete. Be backups of the wrong service or resource. Even the backup software itself can have bugs. The only way to sleep well at night is if you run firedrills and restore your application and data top to bottom.

o Should we encrypt filesystems in the cloud? What are the risks?

This depends on your environment and how sensitive your data is. If you’re collecting credit card data for instance, it may be key. However some surprising blips may push other applications to encrypt as well. Bugs in the hypervisor could potentially make your data vulnerable. What’s more if the cloud provider gets subpeonaed, it may well capture your server and data into the net. Better safe than sorry. Remember you don’t know where your data actually resides, but you do control who has access if you’re encrypted.

We wrote a very in-depth piece on Deploying on Amazon EC2 where we discuss questions such as encryption in more depth.
o Should we use offsite backups?

It’s definitely worth doing this. One more layer of insurance.

o What is load balancing? Why is it difficult with databases?


Load balancing puts a digital traffic circle into your infrastructure, giving you two roads or paths to resources. However those resources have to be exactly the same. With databases you are constantly writing to tables, and updating records. When you scale those horizontally, it becomes impossible to keep track of changes.

[quote]
Relational databases are inherently difficult to scale. Most environments scale a single authoritative master vertically, and add multiple read-only slaves horizontally to allow the appplication to serve more customers.
[/quote]


o Why use a package manager? Can we install from source?

Package managers simplify the installation of software components. A team such as Redhat, Ubuntu or Debian builds a distribution, and compiles all components storing them in a repository. Installing packages this way allows your setup to be standard across servers. This allows more automation, and is simpler for another admin to figure out what you have, down the line when it passes to someone elses shoulders.

Installing from source is generally a bad idea. Although it allows you to tweak and configure each piece of software the way you want, tightly and efficiently, it also means everything is custom. No commoditization advantages.

o What is horizontal scalability?

This involves adding more hardware, more individual servers to service the same application and users.

o What is vertical scalability?

This means scaling up or growing your existing single server, so it is larger, has more memory, cpu or faster disk.

o What can go wrong with automatic failover?

Just about everything. Applications and services can stall, disks can fail, servers can hang. What’s more networks can exhibit latency. Automatic failover is ultimately a piece of software or algorithm trying to diagnose and handle situations. And it does so based on a very small list of rules or heuristics. The real world is messy, so this can often lead to false failure detection, and potentially loss of data.

o How do cloud vendors implement vertical scalability?

This may vary dramatically between cloud providers. Ultimately, however since virtualization allows you to boot a disk image onto any hardware, you can snapshot your current root volume or disk and then boot it on another server, one that is larger, smaller and so forth. About the only thing you need to watch out for is 32 versus 64 bit questions.

If you haven’t already, don’t forget to checkout the rest of this series – part one Operations Interview and part two Deployment Interview.

Read this far? Grab our newsletter – startup scalability.

10 ways I avoid trouble in database operations

1. Avoid destructive commands

From time to time I’m working with new recruits and bringing them up to speed in operations. The first thing I emphasize is care with destructive commands.

What do I mean here? Well there are all sorts of them. SQL commands such as DROP table & DROP database. But also TRUNCATE and DELETE are all destructive. They’re easy to execute but harder to undo. Think of all the steps it would take to restore from your backup.

If you are logged in as root there are many many ways to shoot your own foot. I hope you know this right? rm has lots of options that can be very difficult to step back from like -r (recursive) and -f (force). Better to not use the command at all and just move the file or directory you’re working on by renaming it. You can always delete later.

2. Set your command prompts

When working on the command line, your prompt is crucial. You check it over and over to make sure you’re working on the right box. At the OS, your prompt can tell you if you’re root or not, what directory you’re sitting in, and what’s the hostname of the box. With a few different terminals open, it’s very easy to execute a heavy loading command or destructive command on the wrong box. Check thrice, cut once!

You can also set your mysql prompt too. This can provide similar insurance. It can tell you the database schema you’re set at default, and the user you’re logged in as. Hostname or localhost too. It is one more piece in the risk aversion puzzle.

3. Perform backups & test them

I know I know, we’re all doing backups already. Well I sure hope so. But if you’re getting on a system for the first time, it should be your very initial impulse to check and find out what types of backups are being done. If they’re not, you should set them up. I don’t care how big the database is. If it’s an obstacle, you need to sell or educate management on what might happen if. Paint some ugly scenarios. It’s not always easy to see urgency in these things without a good war story or two.

We wrote a guide to using xtrabackup for hotbackups. These can be done online even while your production database is serving customers without table locking or other downtime.

4. Stay off production machines

This may sound funny to some of you, but I live by it. If it ain’t broke, don’t go and try to fix it! You don’t need to be on all these boxes all the time. That goes for other folks too. Don’t give devs access to every production box. Too many hands in the pie so to speak. Also limit root users. But again if those systems are running well, you don’t have to login to them and poke around every five minutes. This just brings more chances for operator error.

5. Avoid change as much as possible

This one might sound controversial but it’s saved me more than once.

I worked at one firm a few years back managing the MySQL servers. The Oracle DBA was going on vacation for a few weeks so I was picking up the reigns for a bit. I met with the DBA for some brain dump sessions, and he outlined the main things that can and do go wrong. He also asked that I avoid any table alterations.

Sure enough ten days into his vacation, a problem arose in the application. One page on the site was failing silently. There was a missing field which needed to be added. I resisted. A fight ensued. Suddenly a lot of money was at stake if this change wasn’t pushed through. I continued to resist. I explained that if such a change were not done correctly, it very likely would break replication, pushing a domino of other things to break and causing an unpredictable mess.

I also knew I only had to hold on for a few more days. The resident dba would be returning and he could juggle the change. You see Oracle was setup to use multi-master replication those changes needed to go through a rather complex process to be applied. Done incorrectly the damage would have taken days to cleanup and caused much more financial damage.

The DBA was very thankful at my resistance and management somewhat magically found a solution to the application & edit problem.

Push back is very important sometimes.

[quote]
Many of these ten tips are great characteristics to select for in the DBA hiring process. If you’re a candidate, emphasize your caution and track record with uptime. If you’re a manager, ask candidates about how they handle these situations. We wrote a MySQL DBA hiring guide too.

[/quote]

6. Monitor important things

You should monitor your OS syslog and MySQL error log for starters. But also your slow query log for new activity, analyze them and send the reports along to devs. Provide analysis. Monitor your partitions. You don’t ever want disks to fill up. Monitor load average, and have a check that the database login or some other simple transaction can succeed. You can even monitor your backups to make sure they complete without error. Use your judgement to decide what checks satisfy these requirements.

7. Use one or more slaves & checksum

MySQL slave databases are a great way to provide insurance. You can use a lagging slave to provide insurance against operator error, or one of those destructive commands we mentioned above. Have it lag a few hours behind so you’ll have that much insurance. At night this slave may be fresh enough to use for backups.

Also since mysql uses statement based replication, data can get out of sync over time. Those problems may or may not flag errors. So use a tool to compare your master and slave for data consistency. We wrote a howto on using checksums to do just that.

8. Be very careful of automatic failover

Automation is wonderful when it works. We dream of a data center that works like clockwork, with robots that never sleep. We can work towards this ideal, and in some cases get close. But it’s important to also understand that failure is by nature *not* what we predicted. The myriad ways that complex systems can fail boggles the mind, and surprises even seasoned veterans of operations. So maintain a heathy suspicion of this type of automation. Understand that if you automate things to happen in this crucial time, you can potentially put yourself in an even *more* compromised position than simply failing.

Sometimes monitoring, alerting, and manual intervention are the more prudent path. Your mileage may vary of course.

9. Be paranoid

It takes many years of doing ops to realize you can never be paranoid enough. Already checked that you’re on the right host, and about to execute some command? Quit the shell prompt and check again. Go back and ask the team if that table really needs to be dropped. Try to rephrase what you’re about to do in different words. Email out again to the team and wait some time before you pull the trigger. Check one more time that you have a fresh backup.

Delay that destructive command as long as you possibly can.

10. Keep it simple

I know I know, we all want to use that new command or tool, or jump on the latest hardware and take it for a spin. We want to build beautiful architectures that perform great feats of magic. But the fewer moving parts, the less things that can go wrong. And in ops, your job is stability and availability. Can you avoid using multi-master replication and go with just basic master-slave replication in MySQL? That’s simpler. Can you have fewer schemas or fewer filter rules? Can you skip the complicated HA layer, and use monitoring and manual failover?

Made it this far? Grab our newsletter.

The four-letter-word dividing Dev and Ops

devops divide

What’s that word?  RISK

Operations teams are tasked with stability and uptime. That means working against change, limiting or slowing it down where possible.

Developers are tasked with features and delivering business solutions. For that an ORM layer seems appealing for example. It speeds up & simplifies coding. At the same time it eliminates database drudgery.

For ops who are tasked with uptime, an ORM is a force against scalability. I’ve outlined five things toxic to scalability. They work against performance.

The question remains – do devops folks solve the problem?

Consider the banking crisis

Bankers are tasked with making money for their shareholders. To do this they innovate with financial products. Though you may argue they are unscrupulous at times, capitalism and shareholder value drive them to find profit.

Meanwhile the government’s job is to provide a level playing field.  They enact rules, regulate and provide oversight and auditing. As with operations, this is a conservative role, that avoids risk, and seeks stability, growth and avoidance of recessions and depressions.

These tradeoffs exist in many disciplines. The trick is how we find the balance.

There is an equally interesting question of decoupling in internet architectures. I’ll write a future piece on similar parallels I see in the economy at large.

Road War Story – Hacking Inflight Solutions

 

The 2am phone call

Last summer I got my call from the president at 2am.  Actually it was my former boss at Hollywood Reporter.  I had worked there three months previous, and they had since hired an outsourced DBA solution.  Big outsource, big chops.  And big fail.

 

 

12 hours to liftoff

I was scrambling to pack my luggage to go on summer vacation.  I was bound for SF at the moment and my flight was leaving in the morning.  I was trying to wrap up loose ends and my former boss was entreating me – “Can you help us?  Our replication setup has just melted down.  We need you to cleanup the mess.”

The so-called pain point

After a few more early am Skype calls and chats, the team retired for the night and I finished packing my bags.   I snuck in an hour of sleep then headed straight for the airport.  Once through airport security, I bust out my laptop and start logging into the servers.

Although the exact cause of the replication failure remained opaque, I was asked to scan both databases and determine differences.  Out of my toolbox comes the perfect tool for the job, pt-table-checksum, and I run scans on both databases.  (For the curious, here is how) I find countless records different between the two databases.

Now my flight is boarding, so I pack up the laptop and find my seat.  As soon as the seat belt lights flash off, I’m flipping open my macbook at getting inflight wifi working.  Through the flight I’m on SKYPE with the team, with command line terminals open to the servers.  Discuss, debug, troubleshoot – rinse, repeat.

From there I write up a report and explain to the team & CTO the problem.  Syncing that many different records is too risky.  We’d have to review all the statements one-by-one.  I’d rather rebuild replication from scratch.

From there the CTO gives the go ahead, and with the help of Percona’s xtrabackup to do online hotbackups, we are able to fix replication without downtime. Amen to that!

Now with our primary MySQL database and secondary read-only one back online, things calm down a lot.  Traffic returns to a smooth predictable 2 million pageviews per day.  That’s smooth and predictable on a site that gets 50 million a month!   The database loads are calm and steady, as our all of our nerves.   In the coming days we continue to monitor the situation, and write up lengthly root cause analysis of the situation.

Freelancers & Consultants take note

To my recent Consulting 101 article I would add the following bullets:

  • Responsiveness is crucial
  • Be there when a client needs you, and your value goes up.  Be reliable, and loyal to those you’ve worked with.

  • Be an integral part of your team
  • Everyone knows eachother virtual or in real life, and are comfortable with the parts they play.  A team that can work together is crucial, whether it’s all fulltime folks, some consultants, some outsourced or wherever they may be.  Each has a role to play, and communication and team work brings it all together.

  • Have laptop will travel
  • I never turn down a job.  There will be plenty of time for vacations and rest when the dust settles.

  • Don’t break things
  • If there is any doubt in your mind, test, and test again.  Always err on the side of caution.  Check thrice and cut once!  If you haven’t done an operation ten, twenty or fifty times before, experiment a few more times with options to be sure.  And most importantly, if you don’t login to the systems you’re working on regularly, you better make damn sure you’re on the right box, flipping the right switch, and moving the right dials.  With modern internet infrastructure, there are a hundred ways to push the wrong red button!

    CTOs and Directors of Operations take note

  • Small & Nimble wins the day
  • I’ve used this value proposition before when speaking to prospects.  You can hire a big firm, and be a small fish to them.  Small fish means you’re gonna get less attention.  OR you can hire a small firm or contractor.  Then you’ll be a big fish to him or her.  Guess what?  If you’re their big fish, they’re gonna pay extra attention to every move they make, and ensure things don’t break.  They can’t afford mistakes, not to their reputation or their bottom line.  Not like the big boys can.

  • Choose passionate, yet conservative & risk averse operations folks
  • In developers you’re building technology, features, and forging ahead into new solutions.  The role is more to create waves, and break barriers.  How can we enable new business processes and so forth?

    In hiring operations personnel you want stability.  Look for individuals who are more risk averse.  This conservative streak is a countering force.  Ops teams are tasked with that job of bringing a steady state to your business services.  They don’t want to wake up at 2am in the morning.

Scalability Rules for managers and startups

Scalability RulesAbbott and Fisher’s previous book, The Art of Scalability received good reviews for shifting the way we think about scalability from merely splitting databases and adding servers, to include the human factors that weigh heavily on its success. Together with the authors’ distinguished pedigree (PayPal, Amazon, and eBay between them), I picked up a copy of their second book, Scalability Rules – 50 Principles for Scaling Web Sites without a second thought.

If Art was about laying a strong foundation for a scalable organization then Rules is the reference point for when you actually tackle the growth challenges. It acts as a reminder when you come to a crossroad of decision-taking, to keep with the principles of scaling. Each guiding principle is clearly explained and illustrated with examples. It also prescribes how and when to apply the rules. Continue reading Scalability Rules for managers and startups

How to hire a developer that doesn't suck

xkcd_goodcode
Strip by Randall Munroe; xkcd.com

First things first. This is not meant to be a beef against developers. But let’s not ignore the elephant in the living room that is the divide between brilliant code writers and the risk averse operations team.

By the way we also have a MySQL DBA Interview Questions article which is quite popular.

Also take a look at our AWS & EC2 Interview questions piece.

Lastly we have a great Oracle DBA Hiring Guide.

It is almost by default that developers are disruptive with their creative coding while the guys in operations, those who deploy the code, constantly cross their fingers in the hope that application changes won’t tilt the machine. And when you’re woken up at 4am to deal with an outage or your sluggish site is costing millions in losses, the blame game and finger-pointing starts.

If you manage a startup you may be faced with this problem all the time. You know your business, you know what you’re trying to build but how do you find people who can help you build and execute your ideas with minimal risk?

Ideally, you want people who can bridge the mentality divide between the programmers eager to see feature changes, the business units pushing for them, and the operations team resistant to changes for the sake of stability. Continue reading How to hire a developer that doesn't suck

Why generalists are better at scaling the web

Join 12,100 others and follow Sean Hull on twitter @hullsean.

Recently at Surge 2011, the annual  conference on scalability  and performance, Google’s CIO Ben Fried gave an illuminating keynote address. His main insight was that generalists are the people that will lead engineering teams in successfully scaling the web.

Read: Why devops talent is in short supply

In a world where the badge of Specialist or Expert is prized, this was refreshing perspective from an industry bigwig. As tech professionals, or any professional for that matter, we don’t welcome the label of generalist. The word suggests a jack-of-all-trades and master of none. But the generalist is no less an expert than the specialist. Generalists can get their hands greasy with the tools to fix bugs in the machine but they are especially good at mobilizing the machine itself; with their talents of broad vision, and perspective they can direct an entire team to accomplish tasks efficiently. This ability to see big-picture can not be underestimated especially during times of crisis or pressure to meet targets. For a team to scale the web effectively, you’re going to need a good mix of both types of personalities.

Also: Why a killer title can make or break your content efforts

Continue reading Why generalists are better at scaling the web