Category Archives: CTO/CIO

Crisis Management in the Crosshairs – Sandy

Crisis Management During Sandy

The news this past week has brought endless images of devastation. All metropolitan region, the damage is apparent.

More than once in conversation I’ve commented “That’s similar to what I do.” The response is often one of confusion. So I go on to clarify. Web operations is every bit about disaster recovery and crisis management in the datacenter. If you saw Con Edison down in the trenches you might not know how that power gets to your building, or what all those pipes down there do, but you know when it’s out! You know when something is out of order.

That’s why datacenter operations can learn so much about crisis management from the handling of Hurricane Sandy.

This is a followup to our popular article last week Real Disaster Recovery Lessons from Sandy.

1. Run Fire Drills

Nothing can substitute for real world testing. Run your application through it’s paces, pull the plugs, pull the power. You need to know what’s going to go wrong before it happens. Put your application on life support, and see how it handles. Failover to backup servers, restore the entire application stack and components from backups.

2. Let the Pros Handle Cleanup

This week Fred Wilson blogged about a small data room his family managed, for their personal photos, videos, music and so forth. He ruminated on what would have happened to that home datacenter, were he living there today when Sandy struck.

It’s a story many of us can related to, and points to obvious advantages of moving to the cloud. Handing things over to the pros means basic best practices will be followed. EBS storage, for example is redundant, so a single harddrive failure won’t take you out. What’s more S3 offers geographically distributed redundant copies of your data.

After last week’s AWS outage I wrote that AirBNB & Reddit didn’t have to fail. What’s more in the cloud, disaster recovery is also left to the professionals.

[quote]
Web Operations teams do what Con Edison does, but for the interwebs. We drill down into the bowels of our digital city, find the wires that are crossed, and repair them. Crisis management rules the day. I can admire how quickly they’ve brought NYC back up and running after the wrath of storm Sandy.
[/quote]

3. Have a few different backup plans

Watching New Yorkers find alternate means of transportation into the city has been nothing short of inspirational. Trains not running? A bus services takes it’s place. L trains not crossing the river? A huge stream of bikes takes to the williamsburg bridge to get workers to where they need to go.

Deploying on Amazon can be a great cloud option, but consider using multiple cloud providers to give you even more redundancy. Don’t put all your eggs in one basket.

Some very important things to remember about MySQL backups.

4. Keep Open Lines of Communication

While recovery continued apace, city dwellers below 34th street looked to text messages, and old school radios to get news and updates. When would power be restored? Does my building use gas or steam to heat? Why are certain streets coming back online, while others remain dark?

During an emergency like this one, it becomes obvious how important lines of communication are. So to in datacenter crisis management, key people from business units, operations teams, and dev all must coordinate. Orchestrating that is and art all by itself. A great CTO knows how to do this.

Read this far? Grab our monthly scalable startups.

Cloud Operations Interview

What does a cloud computing expert need to know? How do you hire a cloud computing expert? Competition for operations & DBAs is fierce, so you’ll want to know how to find the best.

If you’re a systems administrator or ops guy, you may want to prepare for an interview for such a position. Meanwhile, if you’re a director of it or operations, a recruiter or manager in HR, you’ll want to have some idea how to find the right candidate.

Here’s my guide to do just that. You may also jump to part two Cloud Deployment Interview or the last part three Cloud DBA, Architecture and Management Interview.

1. Solid unix systems administrator

At the top of the list, a cloud operations expert needs to understand Unix and more importantly Linux. Here are some sample questions to get the conversation moving:

o What is web operations and what have you done day-to-day?

Prepare some stories.

o What’s your favorite feature of the linux kernel?

This is an open ended question, but a systems administrator should have some knowledge here. The kernel is the most basic piece of software that runs when a computer boots up, whether it is a desktop or a server. This piece of software coordinates everything, manages resources, and directs traffic.

o Name some distributions of linux. What is a distro?

Linux is built by a collaborative team of thousands on the internet. That’s what makes it open source. The distributions, include the operating system, along with a collection of software to go along with it. All the supporting utilities, libraries and servers must be compiled and held in a repository. That’s what makes up a distribution. Debian, Redhat and Ubuntu are a few popular ones.

[quote]
A cloud operations expert needs to have a wide ranging skillset, from unix administration, architecture, scalability, database & webserver administration, troubleshooting & performance, load & stress testing. You’ll also want someone who has learned hard lessons from some failures, has some war stories to tell and has a hard nose for stability.
[/quote]

o What’s the difference between apache and nginx?

These two pieces of software are both webservers, that is they respond to the HTTP protocol, and can serve HTML pages. They also have a myriad of plugins to support different languages and features. The difference? Nginx (pronounced engine-X) is a newer incarnation. It’s been rearchitected from the ground up, building on all the things learned from Apache over the years. Its tighter, more efficient code, and easier to configure.

You might also enjoy our Intro to EC2 Cloud Deployments Guide.

o What is a key value store? examples?

There are lots of examples of these types of databases. They are a very simple memory cache that can interface with most applications. Memcache is a popular example of a key value store. Redis, CouchDB and Voldemort can also do this.

o What is a page cache? Reverse proxy cache? examples?

These are all the same thing. They are basically a very minimal webserver without all the plugins or bells and whistles. You put one of these in front of your webserver to handle all the easy stuff, and speed up overall throughput. Varnish is a popular example.

o What filesystem do you prefer?

This is a bit arcane, but one should have some opinions here. xfs is a popular filesystem, though ext3 and ext4 are also common. Emphasize the journaling aspect here. Journaling means that if you pull the cord or your server crashes, the filesystem can recover upon reboot. It does this by journaling changes, much how a database keeps a redolog cache of recent changes to database tables.

o Command line tools

There are lots of commands in the day-to-day toolbox of a web ops expert. Here are some examples:
rsync (pronounced our-sync) – sync files between servers & do checksums to allow easy restarts
scp (pronounced s-c-p) – secure copy, similar to rsync but no checksums, so less reliable
curl (pronounced kurl) – diagnose & test urls and HTTP from the command line
cron (pronounced cron) – run commands at scheduled times
ssh (pronounced s-s-h) – secure shell, the most basic tool to reach a cloud server
ifconfig (pronounced if-config) – check the network interfaces on the server
vi/emacs (pronounced v-i and e-macks) – terminal editors, to modify config files
uptime (pronounced up-time) – display the current load average of the server
top (pronounced top) – interactive display of system metrics like memory, load, swap & processes
ps (pronounced p-s) – shows running processes on the server
/var/log/messages – essential system logfile

o What are application servers? How are they different from webservers?

Tomcat & Glassfish are two examples of application servers. These handle heavier weight languages & applications like Java. Application server on some level is just a more heavyduty webserver and these days Apache can be thought of as an application server also.


2. Cloud concepts

o What is virtualization? What is a hypervisor?

Virtualization allows you to run one or more computers within a computer. You can do virtualization on a desktop, sharing network, memory, cpu and disk resources among a number of virtual servers. But more importantly in cloud computing or IaaS offerings you can do virtualization at the datacenter level. The hypervisor layer is a datacenter virtualization technology that provisions server resources, and balances shared network and disk resources.

o What is an image?

In Amazon the world, the AMI or amazon machine image is a snapshot of a server state at one moment in time. This image is take at the block level, and includes the master block record, the first block on disk that a server boots from. All that is the state of a server, when it is shutdown, is what is stored on disk or in this image. All config files, logfiles, and anything else writing to disk.

o What is multi-tenant?

This means that there are multiple servers sharing resources. The tenants are the customers who each want to get the server, cpu, memory, network and disk that they paid for.

o What is the downside to shared resources?

Contention for resources is always the challenge. If your fellow tenants are not very thirsty, this can work to your advantage. But if they’re also heavy users, the hypervisor layer has manage the balancing act. You may get a spike of disk I/O at one point, but later get a dearth. This can cause a relational database like MySQL or Oracle to suddenly look stalled.

o What is instance-store? What is ebs?

Instance store servers were Amazon’s original offering, where servers had their own local (and slow) storage. This storage was ephemeral, so all machine state was lost on reboot. These servers also boot slowly. EBS also known as elastic block storage is a virtualized storage option, similar to NAS or NFS. You can create arbitrary chunks of storage, and attach them to servers, all from command line APIs. Cool!

o What is virtual private cloud?

With the VPC offering, Amazon drops a router into your existing datacenter. You can then provision virtual servers to your hearts content, and they all appear to be servers in your existing datacenter. Elastically scale, within the network and security model you’re already using.

o What is a hybrid approach to cloud adoption?

Keeping your investments in hardware and datacenter is obviously an appealing option for firms that have large existing environment. A hybrid approach with a VPC allows you to get your feet wet, but still keep essential applications on physical servers.

o What is Amazon EC2?

Elastic Compute Cloud refers to the virtual servers you spinup in Amazon Web Services.

o What is Amazon RDS, Oracle RDS, Mysql RDS?

Amazon has various relational and non-relational database offerings. RDS stands for relational database service.

RDS or roll your own – which is better? Here are some use cases to help you decide.

o What is multi-az?

Amazon’s infrastructure offering isn’t just a single datacenter with servers. The beauty of what they’ve built is that they offer a number of datacenters (called availability zones) in each of many regions such as Northern Virginia, Oregon and Singapore.

Incidentally multi-az is a key feature to how businesses can protect themselves from failure. Amazon recently had an outage, but AirBNB, Reddit & Foursquare didn’t have to fail.

o What does a CDN do? How does it work? examples?

A CDN is a content delivery network. Remember all those files that make up a webpage? Images, video, css files? Turns out serving these components from servers *closer* to your customer, make their webpages load much faster. CDNs are networks of servers that hold the content of your pages, and serve them faster.

It works by replacing content paths with a special one from your provider. A simple change in your code will allow content to dynamically load from across the web. Cool!

CloudFront is Amazon’s offering coupled with S3 for file storage. Akamai is another big provider.

We’re not done yet. In part two on deployments and http://www.iheavy.com/2012/11/01/cloud-deployment-interview/”>part three of this series, we’ll hit on other important skills a cloud ops expert should have including scripting, database administration (Our MySQL Interview Guide), scalability, performance, configuration management, metrics, monitoring, and some all important war stories!

Here are some questions to pique your interest:

o Why does the API battle between Amazon & Eucalyptus (FOSS) matter?
o Do you use command line tools? why?
o What can go wrong with backups? how do we test them?
o Should we encrypt filesystems in the cloud? what are the risks?
o Should we use offsite backups?
o What is DRBD?
o Why is auditing important? access control?
o What is load balancing? why is it difficult with databases?
o How do you perform a benchmark? perform load testing?
o Why use a package manager? can we install from source?

Our Deploying MySQL on Amazon EC2 Guide is also related to this interview process.

You may also jump to part two Cloud Deployment Interview or the last part three Cloud DBA, Architecture and Management Interview.

Read this far? Grab our newsletter – startup scalability.

AirBNB didn't have to fail

Today part of Amazon Web Services failed, taking down with it a slew of startups that all run on Amazon’s Cloud infrastructure. AirBNB was one of the biggest, but also Heroku, Reddit, Minecraft, Flipboard & Coursera down with it. Its not the first time. What the heck happened, and why should we care?

1. Root Cause

The AWS service allows companies like AirBNB to build web applications, and host them on servers owned and managed by Amazon. The so-called raw iron of this army of compute power sits in datacenters. Each datacenter is a zone, and there are many in each of their service regions including US East (Northern Virginia), US West (Oregon), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), Asia Pacific (Tokyo), South America (Sao Paulo), and AWS GovCloud.

Today one of those datacenters in the Northern Virginia region had a failure. What does this mean? Essentially firms like AirBNB that hosted their applications ONLY in Northern Virginia experienced outages.

As it turns out, Amazon has a service level agreement of 99.95% availability. We’ve long since said goodbye to the five nines. HA is overrated.

2. Use Redundancy

Although there are lots of pieces and components to a web infrastructure, two big ones are webservers and database servers. Turns out AirBNB could make both of these tiers redundant. How do we do it?

On the database side, you can use Amazon’s multi-az or alternately read-replicas. Each have different service characteristics so you’ll have to evaluate your application to figure out what will work for you.

Then there is the option to host mysql or Percona directly on Amazon servers yourself and use replication.

[quote]Using redundant components like placing webservers and databases in multiple regions, AirBNB could avoid an Amazon outage like Monday’s that affected only Northern Virginia.[/quote]
When do I want RDS versus mysql? Here are some use cases for RDS versus roll your own MySQL.

Now that you’re using multiple zones and regions for your database the hard work is completed. Webservers can be hosted in different regions easily, and don’t require complicated replication to do it.

3. Have a browsing only mode

Another step AirBNB can take to be resilient is to build a browsing only mode into their application. Often we hear about this option for performing maintenance without downtime. But it’s even more valuable during a situation like this. In a real outage you don’t have control over how long it lasts or WHEN it happens. So a browsing only mode can provide real insurance.

For a site like AirBNB this would mean the entire website was up and operating. Customers could browse and view listings, only when they went to book a room would the encounter an error. This would be a very small segment of their customers, and a much less painful PR problem.

Facebook has experience intermittent outages of it’s service. People hardly notice because they’ll often only see a message when they are trying to comment on someone’s wall post, send a message or upload a photo. The site is still operating, but not allowing changes. That’s what a browsing only mode affords you.

[quote]A browsing only mode can make a big difference, keeping most of the site up even when transactions or publish are blocked.
[/quote]

Drupal, an open source CMS system that powers sites like Adweek.com, TheHollywoodReporter.com, and Economist.com uses this technology. It supports a browsing only mode out of the box. An amazon outage like this one would only stop editors from publishing new stories temporarily. A huge win to sites that get 50 to 100 million with-an-m pageviews per month.

4. Web Applications need Feature Flags

Feature flags give you an on/off switch. Build them into heavy duty parts of your site, and you can disable those in an emergency. Host components multiple availability zones for extra peace of mind.

One of our all time most popular posts 5 Things Toxic to Scalability included some indepth discussion of feature flags.

5. Consider Netflix’s Simian Army

Netflix takes a very progressive approach to availability. They bake redundancy and automation right into all of their infrastructure. Then they run an app called the Chaos Monkey which essentially causes outages, randomly. If resilience from constantly falling and getting back up can’t make you stronger, I don’t know what can!

Take a look at the Netflix blog for details on intentional load & stress testing.

6. Use multiple cloud providers

If all of the above isn’t enough for you, taking it further you’d do as George Reese of enstratus recommends and use multiple cloud providers. Not being beholden to one company could help in more situations than just these type of service disruptions too.

Basic EC2 Best Practices mean building redundancy into your infrastructure. Multiple cloud providers simply take that one step further.

Read this far? Grab our newsletter on scalability and startups!

Why do people leave consulting?

Join 12,100 others and follow Sean Hull on twitter @hullsean.

As a long time freelancer, it’s a question that’s intrigued me for some time. I do have some theories…

First, definitions… I’m not talking about working for a large consulting firm. Although this role may be called “consultant”, my meaning is consultant as sole proprietor, entrepreneur, gun for hire or lone wolf.

1. Make more money in a fulltime role

I’ve met a lot of people who fall into this trap. They take a fulltime role simply because it pays better. That raises a lot of questions…

o Are you pricing right?

You could be pricing to high to get *enough* work. You may also be pricing too low to cover benefits, health insurance and so forth. Or perhaps you can’t sell to your rate. You can be smart skills-wise, but do you feel your clients pain? Are you good at being a businessman? Consistent?

o Can you sell, and put together an appealing proposal?

o Can you execute to the clients satisfaction?

o Can you followup consistently while accounts payable gets tied up in knots?

o Can you followup if your client executes past their spend?

Running a business is complicated, and a lot of expenses can be hard to juggle. You will find times when a client may have spent a little faster than their revenue, and have trouble finding money when the invoice arrives. Followup, patience and persistence is key.

Read: Why high availability is so very hard to deliver

Want more? We wrote an in depth 3 part guide to consulting.

2. Make a consistent paycheck in a fulltime position

o Are you networking enough?

If you take a longterm gig and get comfortable, your pipeline can dry up. And your pipeline is the key to your longterm strength, and regular business. You must get out there, and let people know about you, your services, and your availability.

If you don’t network regularly, post across the web, engage on social media channels, blog regularly and so forth, you’ll likely just land a series of 6-12 month fulltimeish gigs through recruiters or headshops.

Related: 5 ways to evaluate independent consultants

[quote]Being a freelancer or entrepreneur involves wearing many hats. Finding business involves networking & marketing. Delivering to their needs involves emotional intelligence. And actually getting paid on time is a whole artform in itself. Leave a good taste in their mouth and your reputation will spread quickly by word of mouth.[/quote]

o Do you really *LIKE* being an entrepreneur?

Are you consistent? Consulting is like running a marathon, if you burn out you may give up!

Have a large web property or application which is experiencing some growing pains? Take a look at how we do performance reviews. It may be just what you’re looking for.

Related: MySQL interview guide for managers and candidates alike

3. Do you like the lifestyle of larger corporate environments?

o Fulltime roles allow for much more jedi sword play. Maneuvering up the ranks involves relationship building as much as consulting, but with a more well defined ladder to climb.

o Sometimes you’ll find pass the buck and pointing fingers quite common.

o There are roles involving managing people and processes. These less often lend themselves to short term or situational consulting arrangements. If you lean towards those roles

Trying to hire top tech talent? Here’s our MySQL DBA hiring guide & interview questions

[quote]Working as a sole proprietor for a couple of decades has taught me to be very entrepreneurial. It is every bit about building a real-world startup[/quote]

4. Want to do more cutting edge & at the keyboard work

Consulting can and often does allow you to bump into the latest technologies, and get your feet wet with what cutting edge firms are doing. However in a fulltime role you can more completely immerse yourself in the technology, and those long term solutions.

Also: Why devops talent is in short supply

o You can take part in R&D – Google’s 20% projects, for example

o You can build hypothetical projects

o You can work in more idealistic environments, operations and even lectures & training

Though you can certainly do all of this as a freelancer, you have to build enough capital, and so forth to make it work.

Juggling job roles as a consultant isn’t easy. What a CTO must never do.

5. Don’t like running a small business

Consulting as a sole proprietor and staying in business for almost twenty years, I’ve learned that it is every bit about running a small business or startup.

A. Acquiring customers, networking, marketing
B. Understanding their needs and delivering to improve their position
C. Pricing in a your customers understand
D. Offering value to your customers, at a competitive price
E. Managing relationships so your brand or reputation precedes you
F. Making sure payments and invoicing isn’t a hurdle, followup
G. Pacing yourself like a marathon runner – keep doing what you’re doing right

Read this far? Get our scalable startups monthly newsletter. We cover these topics in detail, year in and year out.

Anatomy of a Performance Review

A lot of firms come to us with a specific scalability problem. “Our user base is growing rapidly and the website is falling over!” Or they’re selling more widgets, “Our shopping cart is slowing down and we’re seeing users abandon their purchases”. These are real startup growing pains, so what to do?

We like to take a measured approach with these types of challenges, so we thought it would be helpful to run through a hypothetical scenario and see how we work.

Related: Why website speed is crucial to business

Having trouble with scalability? Check out our 5 things toxic to scalability piece.

1. Contract outline

First we talk on the phone, or meet face to face and discuss what’s happening. Do you have one page that’s problematic? Is the website slow during certain hours? Or are you seeing erratic behavior and can’t point to a single source?

From there we outline a course of action, based on:

o talking with team, devs & architects
o reviewing systems first hand
o identifying bottlenecks and trouble spots

This with this outline we’ll include an estimate of the number of work days it’ll take to complete. We’ll then send that back to you for review, exchange a deposit and set a start date.

2. Meet team & discuss architecture

Next we’ll meet the team and review the problems in more technical detail. If you’re in NYC we’ll probably make a stop into your offices and have a warm meet & greet. If you’re located further afield we can either meet over a skype call, or arrange for us to travel to your location for the start of the engagement.

3. Measure current throughput

In order to get a sense of the current state of the systems we’ll measure some system metrics. This could be load average or queries per second or other MySQL internal metrics. We’ll also look at some business metrics such as speed of an ecommerce checkout, or a speed test on a particularly slow page.

These metrics are designed to create a baseline of where things are before any changes are made.

[quote]Measuring both business and system metrics before and after changes, allow a rough ROI measurement to be done. This goes a long way towards justifying the expense of a performance review, current and future.[/quote]

4. Review systems, configurations & setups

Next we’ll jump on the various systems and review configurations. This includes webservers, caching servers and the database servers as necessary. We’ll review memory settings, important configurations, all the dials and switches.

Along with this we’ll also review development and architecture. Are you using Java with Hibernate a popular ORM? Or perhaps CakePHP? Are you writing custom SQL code? Are developers up to speed with EXPLAIN and query profiling? For that matter is code in version control?

Just looking for a DBA? Check out our MySQL Hiring Guide.

5. Report on actionable advice & findings

Perhaps the most essential and useful part of an initial engagement is our overall findings and review report. We’ve found these are very valuable to firms as they speak to a lot of folks up and down the business hierarchy. They speak to management about high level architectural problems and structural or process related challenges. And they can speak well to developers and operations teams as they provide a third party birds eye view of day-to-day activities.

Take a look at a sample report we’ve prepared for Acme StartUp, Inc.

6. Discuss which steps to move on

From here we’ll meet again. In particular we’ll review the actionable advice. Some changes will be low cost, requiring no downtime, while others might require a downtime window. Further medium term changes might require refactoring some code and deploying. Typically the larger longer term architecture changes will also be outlined.

Based on time & costs, we’ll decide together which changes are a priority. Obviously we’ll want to move on low hanging fruit first, and move forward from there.

Want to learn more about us? Check out our testimonials and our about page.

7. Take action on agreed changes

Once we’ve decided which changes we’ll make, we’ll schedule downtime windows as needed and make the changes to systems. From there we’ll carefully observe everything for stability, and no adverse affects.

8. Measure throughput again

Based on the throughput measurements in #3 above, we’ll perform those same benchmarks again. We’ll check low level system metrics, along with higher level business & user based throughput. Both of these are important as they can provide different perspectives on changes made.

For example if the system metrics improve markedly, but the business or user metrics do not, we know are change had some affect on overall performance, but likely we did not identify the one which directly is causing the business slowdown.

9. Summarize findings & performance gain

In the most likely case they both improve markedly, and we can measure the improvements from our entire process of performance review.

This can be helpful and measuring overall return on investment for the engagement. ROI is obviously an important exercise as we want to know that the money is well spent.

10. Document solutions & recommendations

The last step is to document what we did and what we learned. This allows us to carry forward that knowledge and keep applying it to the development and operations process. This allows the business to continue adding value from the engagement even after it’s completed.

Read this far? Grab our newsletter.

Why you should attend Percona Live 2012

What I loved about Percona Live 2011

Last year I was excited to go to Percona Live for the first time in NYC. I arrived just in time to hear Harrison Fisk from Facebook speak about some of the awesome tweaks they’re running with MySQL there. It’s not everyday that you get to hear from top MySQL engineers how they’re using the technology and what their biggest challenges are. If they can make MySQL hum, so can the rest of us!

Afterward, outside in the foyer, I ran into all sorts of luminaries in the MySQL space. Percona folks like Peter Zaitsev & Vadim Tkachenko, plus other big names like Baron Schwartz, Harrison, and Ronald Bradford. I ran into people from firms like Yahoo, Google, Daniweb, Pythian, SkySQL & Palomino.

You might also like our Setup MySQL Replication with Hotbackups as well as How to deploy MySQL on Amazon EC2 servers articles.

What to expect at Percona Live NYC 2012

This years event next month features rockstar engineers from an incredible lineup of firms including Etsy, New Relic, Youtube, Paypal, Tumblr, SugarCRM, Square, and of course a few from Percona themselves. I promise you this, these talks won’t be salesy or in any way a waste of your time and money. They will be thoroughly technical talks, with cutting edge insights and advice from those in the trenches using the technology everyday.

If I wasn’t heading to Oracle Open World for the publishers seminar & MySQL Connect, I would most certainly be there. In fact I had originally been slated to talk about point-in-time recovery in MySQL. Oh well, I’m sure I’ll catch you at the Percona Live in April 2013.

If you do decide to attend please enjoy a 15% discount with code “SeanHull” !

Looking to hire top MySQL talent? Check out our MySQL DBA Hiring Guide with advice for managers, recruiters, and candidates too! We also have an enduringly popular article about the mythical MySQL DBA and why they’re hard to find.

Also if you’ve read this far, please grab my newsletter scalable startups.

Beware the sales wolf in sheep suits

Recently a colleague called me up to get my opinion.

[quote]We’re in the process of standardizing our systems on Red Hat Linux, but management and higher ups are convinced we should deploy Oracle on Oracle’s own Linux distribution. Which is better?[/quote]

Therein lies the eternal drama in organizations, the push & pull between dollars and technology best practices.

We had a similar experience with a MySQL deployment, and solution framed by Oracle sales.

Battle lines are drawn

Clearly the battle lines are drawn now. Between director of operations & team versus management & business stakeholders, between high level and the trenches, or between the systems that support your business and day-to-day running of them.

Business units & management are tasked with budgets, cost management, and long term thinking about trajectory and what’s best for the business. Operations teams are tasked with the day-to-day stability, the command line perspective.

What is the sales team’s position?

Sales guys at Oracle have a job to sell licenses. This isn’t good or bad, it’s their driver. Understanding all the drivers will help us align them.

Sales guys sell to management, so they will likely frame all their stories to management concerns. Also Oracle’s history here is fairly clear. Get customers locked into Oracle up and down the stack, and they become more and more beholden to you as their primary provider. As customers become more dependent, they will begin to squeeze more and more out of them.

Nothing personal, this is how money is made. But understand the goal.

How do OS choices affect the business bottom line?

Standardizing across the enterprise reduces costs & reduces operational complexity. This can reduce risks to operator error & other downtime that increase with more heterogeneous environment.

On the Oracle distribution side, you likely have tweaks to make Oracle run better. However don’t forget the profit motive. Some tweaks may be conveniently “overlooked” in favor of profit. For example for many years the Oracle installer would not complete without error on many Linux systems. Imagine all the professional services that are sold around running through a complex install. Streamlining such an install would *reduce* profits. Don’t laugh.

What happens on the front lines?

On the front lines of course are the ops teams & DBAs, actually installing and supporting enterprise software. Let’s not forget these guys are at the command line. They know inordinately more about what’s really happening down in the trenches. You may find them repeatedly rolling their eyes at salesmen claims.

However they are not the colorful storytellers or communicators that salesmen are, so they may

Want to hire a DBA? Here’s our MySQL interview hiring guide. We also wrote a similar one for Oracle DBA Interview questions.

Align each division’s interests

Despite cultural differences, business management & operations teams should work hard to connect, and align with one another.

Operations should make an effort to better understand the business bottom line. Money doesn’t grow on trees as they say, and choices have to be based on budget, and real-world needs. We’d all like to sit in a university and program or build things just to create something new, but in a business there are market pressures. All teams should reflect on those.

Management should also make an effort to understand ops teams needs. Why are my ops teams telling me a different story than they Oracle sales guys? Fight the urge to bond with the sales folks, despite their smooth delivery, great suits and peer positioning.

Weigh short and long term tradeoffs

List out advantages & tradeoffs on all sides. These should be technical and business bullet points. Brainstorming a full list like this, and having the whole team discuss the list openly will help the team together come up with a more realistic outcome. Some questions to ask…

1. What are the advantages & disadvantages of having multiple providers for your technology stack?
2. Which solutions are open and which are proprietary? What are the tradeoffs there?
3. What does your team have subject matter expertise in?
4. Are there real technical advantages to one solution or the other?
5. Are there real cost advantages to one solution or the other?
6. Are there expertise advantages & training savings to go one direction?
7. Is the technology widely used in your industry? Will additional or replacement operations experts be easy or hard to find?

Read this far? Grab our scalable startups newsletter.

Juggling apples & oranges in the datacenter


In which a few choice words become one serious accident…

The Backstory

More than five years ago now, I worked for a shop in the business of news & information around the legal and real estate sectors. It was a fairly large organization with a number of Oracle and MySQL backed applications. The whole place ran on Sun servers, with a team of systems administrators, developers, and of course editors & content folks.

I was the primary database administrator for almost an entire year back then. I reported directly to the CTO. She was bright, competent and great to work for.

Although she had a technical background, she often spoke about products and gave very high level directives when making requests. This was made more confusing as the environment lacked naming conventions. So often product names didn’t match server or database names.

I tended to take the very paranoid approach. I’d ask over and over for clarification, and let some time pass before actually executing on a request.

A Changing of the Guard

After many months as a contractor DBA, the firm finally located a fulltime guy to replace me. It’s no easy task finding a DBA these days, especially for MySQL.

He was a very bright guy with a lot of technical knowledge. A bit green behind the ears, but fully capable to manage an enterprise database shop.

Looking for a top-notch DBA? Here’s our MySQL interview questions & hiring guide. We also have one for hiring an Oracle DBA.

Nuking the database

After two weeks on the job, something unpleasant happened.

Imagine a chef working with cooks & confusing dishes with vegetables.

[quote]
Chef says, “Toss the avocado”
Cook throws the avocado salad in the trash thinking it’s rotten.
Chef comes back later asking quizzically, “I wanted you to mix it up!”.
[/quote]

In the datacenter the conversation went something like this…

[quote]
CTO: Drop the journal database & rebuild.
DBA: Ok. Give me a few minutes
CTO: What did you do? The whole application is offline now!
[/quote]

From there scrambling ensued. After nearly six hours of screaming, and firefighting, everything is finally restored from backups and the application brought back online.

Naming – product or components?

Semantics is very important. Those in the trenches tend to take requests word-for-word while those managing the troops tend to make requests in terms of products, divisions & the vantage point of the business.

That’s why naming conventions can be so important. Don’t want to be talking about apples when you really mean oranges.

Living with dysfunction

As environments grow over years and years, they tend to evolve into a spaghetti of confusing names & relationships. It’s the nature of enterprise environments.

- big confusion can mean big mistakes
- check & recheck – be risk averse and a bit paranoid
- check yourself, your shell, your hostname, your login
- ask questions & clarify repeatedly
- let some time pass before executing a destructive command

Made it this far? Grab our newsletter.

Where’s my 80 million dollars?

Way back in the heydays of the dot-com boom, the year is 1999.

Join 12,100 others and follow Sean Hull on twitter @hullsean.

I worked for a medium size internet startup called Method Five. When I came on board they were having a terrible time with their site performance.

Website crashing

When I first met the team, I was tasked with performance problems. After all their flagship web property kept crashing, and it didn’t look good to investors. As with most web properties in those days it was a home-grown datacenter in the back of the office, running on Sun Microsystems hardware, with Oracle on the backend and Apache serving webpages.

Also: Why a killer title can make or break your content efforts

Negotiating an acquisition

As it became clearer after day one, the project was particularly sensitive. They were negotiating a huge acquisition by a firm called Xceed Corp. The sticking point? Their crashing website did not sell their technology prowess in a particularly positive light. To say the least!

Read: Why high availability is so very hard to deliver

Investigation

As it turns out the site had all the right players, from systems administrators to a DBA who sat watch over the Oracle systems.

As I dug into the systems, I found a serious smoking gun. It seems the Oracle software was configured to use just 5M of memory out of about 256M free. Just like MySQL, the server must be configured to use available memory upon startup. There are myriad caches and buffers which need to be attended to. By today’s standards these numbers probably sound absurd. Nevertheless the DBA wasn’t familiar with the basic memory settings, and so the system was terribly bottlenecked.

Read this: Why a four letter word divides dev and ops

Problem Solved

We then ordered some urgent changes to the system, configuring all of Oracle’s caches to use up the precious memory available.

Immediate the website unlocks, transactions begin flowing, and webpages are returning quickly. End users pull their noggins off their keyboards, and the executives begin breathing a sigh of relief. The site was literally 1000x faster during peak.

Related: MySQL interview guide for managers and candidates alike

Acquisition

Shortly thereafter the acquisition goes through for a cool 5 million in cash and 80 million in stock.

Where’s my cut?! You might be asking that question. But my policy is almost always defer to something concrete and tangible, aka fees and real compensation. I did not negotiate any stock in the deal.

Another popular war story we wrote A CTO Must Never Do This….

Read: Why devops talent is in short supply

Lesson’s Learned

o Don’t believe received wisdom. Check and double check what’s really happening.
o Use the memory and resources you have available.
o Measure capacity, and isolate bottlenecks in the system
o Decouple services wherever possible
o Problems are as often people and process as they are with technology

Also: 5 more things deadly to scalability

Make it this far? Grab our newsletter!

You're Too Young To Be My Boss

About a year ago I engaged with a firm to do some operations work on their site. They provided services to colleges and universities.

When they first reached out to me, they were rather quick to respond to my proposal. They seemed to think the quote was very reasonable. I also did some due diligence of my own, checking the guy’s profile on the about page. I noticed he was 25, rather young, but I didn’t think much else of it.

We discussed whether they wanted fixed hours. Since those would limit my availability we both agreed a more flexible approach made sense. This worked well for me as I tend to shift and schedule time liberally, so I can be efficient & flexible with clients, but still have a life too.

Trouble Brewing

As we began to interact the first week, I sensed something amiss. My thought was that the first week you work with a client, they feel you out. They see how you work, when you work, how much gets done and so forth. This provides a benchmark with which to measure you. If either party is unhappy with how things are going, they discuss and make adjustments accordingly.

What was happening in this case was the guy started pestering me. I began to get incessant messages on instant messenger asking for updates. I had none. I explained that I would contact him as things were completed, or if I had questions.

This was only two days into the project. I’d barely gained access to the servers!

The Fever Pitch

After discussing my concerns on the phone, the gentleman kind of glossed them over. From there the pestering continued. I explained that I could not be available to him any hour of the day, while the engagement only provided for one half of a week. This began to interrupt me from other client work, so I had to signoff of instant messenger. Not good.

The Pot Boils Over

We spoke again on Monday briefly, and decided to connect the following day. From there the pestering began anew, and I began to lose my patience. I insisted that we speak on the phone before work would continue. I felt the problem was deteriorating and discussing over text would only make things worse.

He emailed me back as I was then offline. In his email he ordered me to come online. While he sat in a meeting, he explained, he could not take a call! Nevertheless he insisted we resolve it during the meeting. Distracted no less.

[quote]It was then that I started receiving text messages on my personal mobile phone from the guy, pestering me to get online so we could resolve our communication problem! You can’t make this stuff up![/quote]

The Fallout

Eventually we did both get on the phone, and I explained I had reached wits end. After only ten short days of working together, we had both set strong precedents and they were obviously not compatible. He asked if I would stay on longer, and reconsider working together, and I said I would think about it.

I chose not to dig a deeper hole, and let him know I wouldn’t be invoicing for previous the weeks work.

The Lessons

o beware age differences – in our case an 18 year gap
o pay attention to management styles – self-starters don’t need micromanaging
o be patient & keep communicating
o allow for an exit strategy that is amenable to both parties

Read this far? You’ll love our newsletter. Get Scalable Startups. No Spam. No Selling..