Category Archives: Cloud Computing

Cloud DBA and Management Interview

What does a cloud computing expert need to know? This is the last of a three part guide to interviewing for a cloud operations position. You can find them here – part one Operations Interview and part two Deployment Interview.

Here’s my guide to do just that.

1. Database administration experience

Although in some shops the DBA role is a completely separate one, there are many others where the Linux and Operations teams manage these services as well. We do have a some other material Oracle DBA Interview questions and MySQL DBA Interview Guide. Here’s a taste of what to expect.

o What is RAID? Which type is best?

RAID is a way to share a whole bunch of disks on one server. Databases like Oracle or MySQL do a lot of writing and reading from disk. If there are more disks sharing this work, it’s like you have more waiters in your restaurant. Faster serivce.

Although some folks still hang onto RAID 5 as an option, it’s generally a very bad one. It has a serious write penalty because of parity checking it must perform. Most databases do a lot of writing, even when user transactions are not doing INSERT or UPDATE. What’s more if a disk fails, RAID 5 although technically online, will be so slow as to be effectively unusable while the long slow rebuild happens.

What’s the answer then? RAID 10! It mirrors each volume, and then stripes across those mirrored sets. Fast I/O, fast recovery. Done & done.

o What are the tradeoffs with more indexes versus fewer?

In all relational databases, you build indexes on data. Indexes are just like the ones you think of in the yellow pages, phonebooks of yore. An index on first name means you can look up Obama by Barack as well. Index on street addresses means you can lookup on the White House. So the more indexes you have, the more different ways you can search for & fetch what you want.

On the other hand the penalty here, is that whenever you add new data & records to this database, all those indexes must be updated. That’s overhead, which slows down writes.

So the tradeoff is more indexes – faster fetching, slower writing. Fewer indexes slower fetching, faster writing.

o What do NoSQL databases eliminate? How do they achieve great speed?

There are quite a few different types of NoSQL databases. So I’m generalizing quite a lot here. One thing NoSQL databases eliminate is the ability to JOIN data across different columns. By removing this great feature of relational databases, they dramatically simplify the underlying implementation. No free lunch!

What else? Many of these databases cut corners on what’s called durability. What is durability? Imagine you are in a lecture hall and bring your notebook or are waiting tables, and taking orders. It might be quicker to do so without writing things down. You keep it all in your head. Great, but what if you forget something? You have to go ask for the order again! It may be faster, but more prone to error. Losing data is not something to be taken lightly. NoSQL databases don’t always flush data to permanent storage.

[quote]
Whether or not an web operations candidate uses command line may seem like a small issue. But it speaks to what their DNA is, and the strength of their foundation. Strength and comfort on the command line is key.
[/quote]

o What is Amazon RDS? When should I use it?

Amazon has a managed relational database solution called RDS. It’s basically MySQL, Oracle or SQL Server, but modified so you can’t shoot yourself in the foot. Administrative tasks are simplified, but so are your configuration options.

I wrote an in-depth Amazon RDS use cases article. It mostly covers MySQL, but the general rules apply to Oracle & SQL Server. At the end of the data RDS is a lot less configurable and flexible. But if you don’t have a regular DBA on staff, it will probably simplify your administration of these servers.

o What are read-replicas? What about Multi-az?

Read-replicas are read-only copies of your data. Using MySQL these are fairly stock master-slave configurations. Note since they’re the standard technology, they’re still asyncronous. So yes the read-replica can lag behind.

Multi-az is a proprietary technology, and Amazon doesn’t disclose what’s under the hood. However it’s likely running on top of something like DRBD which is a distributed filesystem. This allows the underlying disk I/O to be mirrored across the internet, and to another availability zone. You’ll enjoy syncronous copies of your data, and no data consistency problems. Keep in mind those that the alternate server is offline or cold and can take time to come online.

o What is the primary bottleneck of hosting databases in the cloud? How has Amazon recently addressed this?

As I explained above disk I/O remains the largest bottleneck for relational databases, even if the entire dataset fits in memory. Why? Because sorting, joining, and rearranging data can take orders of magnitude more memory to magically do in memory. And that’s not even talking about durability guarentees.

The cloud has traditionally lagged quite a lot behind physical servers in terms of disk I/O so some internet firms have shyed away from moving to the cloud. EBS volumes were typically limited to a few hundred IOPs.

Amazon’s recently announced Provisioned IOPs. It’s a mouthful of a name for a very big development. It means you can provision how fast you want those virtual disks to be. For individual volumes the limit seems to be 2000 IOPs but you can also software raid across many of those virtual disks. For Amazon RDS the limit is reportedly 10,000 IOPs. This new feature will make a huge difference for hosting large high I/O databases in Amazon’s cloud.

2. Architecture & Management Questions

o Why does the API battle between Amazon & Eucalyptus (FOSS) matter?

As large applications are architected to build hardware components, and resources in the cloud, the API they work through becomes key. Sticking to an open standard for this API means you can change cloud vendors and/or build on multiple ones. We talked about this multi-cloud solution as a key way to avoid outages like AirBNB and Reddit experienced when AWS had an outage.

Following on the heels of that article, we were quoted about multi-cloud by Brandon Butler in his Network World piece .

o Do you use command line tools? Why?

A good web operations candidate should be very comfortable with command line tools. Everything in Linux is command line. It’s like broadway acting to movie acting, or literature to books. It’s the original source, much more powerful, what’s more it indicates and requires much stronger theoretical knowledge of the underlying systems being managed.

o What can go wrong with backups? How do we test them?

Everything can go wrong with them. They can fail to complete. Be backups of the wrong service or resource. Even the backup software itself can have bugs. The only way to sleep well at night is if you run firedrills and restore your application and data top to bottom.

o Should we encrypt filesystems in the cloud? What are the risks?

This depends on your environment and how sensitive your data is. If you’re collecting credit card data for instance, it may be key. However some surprising blips may push other applications to encrypt as well. Bugs in the hypervisor could potentially make your data vulnerable. What’s more if the cloud provider gets subpeonaed, it may well capture your server and data into the net. Better safe than sorry. Remember you don’t know where your data actually resides, but you do control who has access if you’re encrypted.

We wrote a very in-depth piece on Deploying on Amazon EC2 where we discuss questions such as encryption in more depth.
o Should we use offsite backups?

It’s definitely worth doing this. One more layer of insurance.

o What is load balancing? Why is it difficult with databases?


Load balancing puts a digital traffic circle into your infrastructure, giving you two roads or paths to resources. However those resources have to be exactly the same. With databases you are constantly writing to tables, and updating records. When you scale those horizontally, it becomes impossible to keep track of changes.

[quote]
Relational databases are inherently difficult to scale. Most environments scale a single authoritative master vertically, and add multiple read-only slaves horizontally to allow the appplication to serve more customers.
[/quote]


o Why use a package manager? Can we install from source?

Package managers simplify the installation of software components. A team such as Redhat, Ubuntu or Debian builds a distribution, and compiles all components storing them in a repository. Installing packages this way allows your setup to be standard across servers. This allows more automation, and is simpler for another admin to figure out what you have, down the line when it passes to someone elses shoulders.

Installing from source is generally a bad idea. Although it allows you to tweak and configure each piece of software the way you want, tightly and efficiently, it also means everything is custom. No commoditization advantages.

o What is horizontal scalability?

This involves adding more hardware, more individual servers to service the same application and users.

o What is vertical scalability?

This means scaling up or growing your existing single server, so it is larger, has more memory, cpu or faster disk.

o What can go wrong with automatic failover?

Just about everything. Applications and services can stall, disks can fail, servers can hang. What’s more networks can exhibit latency. Automatic failover is ultimately a piece of software or algorithm trying to diagnose and handle situations. And it does so based on a very small list of rules or heuristics. The real world is messy, so this can often lead to false failure detection, and potentially loss of data.

o How do cloud vendors implement vertical scalability?

This may vary dramatically between cloud providers. Ultimately, however since virtualization allows you to boot a disk image onto any hardware, you can snapshot your current root volume or disk and then boot it on another server, one that is larger, smaller and so forth. About the only thing you need to watch out for is 32 versus 64 bit questions.

If you haven’t already, don’t forget to checkout the rest of this series – part one Operations Interview and part two Deployment Interview.

Read this far? Grab our newsletter – startup scalability.

Cloud Deployment Interview

What does a cloud computing expert need to know? In part one of the cloud interview guide we covered some basic unix & Linux systems administration skills, and cloud computing and infrastructure concepts. Those are key starting points. You might also want to jump to part 3 cloud dba, architecture and management interview questions.

In this second part, let’s dig into deploying applications in the cloud, and day to day operations skills. There’s a lot of material here. We recommend picking a few questions out of the bunch and focusing on those questions, rather than trying to cover all of them.

Also while on the topic of hiring, keep in mind that Hiring is a Numbers Game.

1. Deploying in the Cloud

Deploying applications into virtual or cloud datacenters involves understanding and evaluating providers. Many just deploy on Amazon EC2 as it is far and away the largest cloud hosting solution, with the most robust offering.

You might also like our MySQL DBA Interview Guide as well.

o What sets amazon apart from the other cloud providers?

There are probably two things that set Amazon apart from other cloud infrastructure solutions. EBS or elastic block storage being one. Although the others have storage solutions, and Rackspace is working on their own virtualized storage, Amazon seems to be the furthest ahead with their offering. It is fully virtual, allows arbitrary chunks of storage to be attached to instances, and allows instances to boot of ebs volumes.

The other major point is that since Amazon has grown so large, so quickly, it has more datacenters, in more geographically dispersed areas than other providers. Since these are organized into logical resources, and can be accessed through API, it makes your application infrastructure truly virtual.

o What are some other large cloud providers?

Joyent, Rackspace cloud, Storm on Demand, GoGrid and VoxCloud. There are certainly many others. Take a look at this Quora post: Most Reliable Cloud Providers.

o Tell one vendor management story.

Everyone who has managed operations, has worked with vendors at one point or another. For example if you’ve worked with Rackspace you know that it’s pretty easy to get a human on the line. Amazon on the other hand allows you to do-it-yourself for everything, and only later added on a support service option. So their service pattern and history are different.

Also check out 3 Things CEOs should know about the cloud.

o How do you troubleshoot a problems?

There isn’t really a right or wrong answer to this question, but it’s a nice starting point to discussion. It can also help illustrate a candidates communication skills, and how specifically they walk through solving a problem. What problem they choose as an illustration, and how they work through to a resolution is an important indicator of operations experience.

[quote]
Pros and cons of Amazon versus Rackspace, configuration management & automation and cloud management solutions like Scalr and Rightscale… these and other skills are a important for a cloud deployment expert.
[/quote]

o What is puppet and chef?

Puppet is a configuration management system which allows ops teams to build templates for servers, and deploy many servers based on those templates. It further allows centralized control of configuration, to automate the management of a large number of servers.

Chef grew out of frustrations of Puppet, and is a sort of next generation configuration management system.

The term infrastructure as code may be thrown around. Since all cloud resources can be provisioned through API calls, everything in server deployment can be *theoretically* done via code, from spinup of servers, to installing packages, to configuring, code checkout, seeding databases and more.

Also our article What is Infrastructure provisioning and why is it important.

o What are some of the pros and cons of configuration management for operations?

Pros include allowing a smaller team to automate the deployment of a large fleet of servers, standardization, and consistency. Cons include complexity when needing to do surgical, urgent changes, and complexity when coming into an existing environment that you’ve inherited.

o How is rightscale different? What does it provide?

Rightscale is a layer on top of your cloud provider. They provide a common interface and dashboard from which to deploy servers. Templating, automation, and multi-cloud support make it a great solution for teams that have less technical expertise on staff or less hands to manage things.

o How about scalr?

They’re another management solution, that supports multiple cloud providers. They offer templating, and auto-scaling too.

While you’re here, take a look at our Myth of Five Nines – Why HA is Overrated.

2. Day to day skills

o What type of programming experience do you have?

The answer is that every ops guy or girl should be able to code, just as every developer should have some basic operational experience. Should and does are often two different things, so ask for some examples.

o shell scripts

Bash, csh, Perl and Python are all part of the Linux administrators toolbox. Writing backup scripts, log rotation, automating routine tasks and so forth are all common needs of an operations expert.

Regular expressions are a part of Unix and used in scripting to search files, cronjobs, and ETL jobs. Ask for some basic examples.

o What is continuous integration?

The old model of code deployment was called waterfall, and allowed long careful planning, coding of new features, testing, and finally deployment. The cycle could take weeks or months and iterative change took a lot of time. Continuous integration also known as agile deployments, allows a much more frequent in some cases many times per day deployment of changes.

o What are metrics good for?

Just like in website visitor tracking, and business analytics, server level analytics and tracking is possible. Collecting server metrics such as load averages, memory, disk and cpu usage over time can be invaluable. When an application slows or server stalls, checking historical metrics can often quickly reveal problems or causes.

What are some examples? nagios, ganglia, cacti, munin, opennms

o What is unit testing?

This allows for software to be build in small testable compontents. When the compontents are coded, tests are also written that test whether they are operating properly, and whether dependencies are also installed and working.

[quote]
Metrics, monitoring, load testing, firewalls, security & patching, Saas, Paas and IaaS there is a wide swath of skills needed to be competent as a web operations engineer. You’ve got your work cut out for you!
[/quote]
o What is load testing?

By performing some benchmarks, load testing can make estimates about how the application and code will perform when more users are hitting it.

o Security & networking

Sometimes a systems administrator is a generalized admin and sometimes there is a networking specialist on staff who doesn’t allow anyone else to touch that domain.

o What are firewall rules?

Unix services use port numbers to expose those services to the world. Since all servers on the internet are identified by IP addresses, firewall rules are defined around IP addresses or groups of them, and the ports they’re allowed to access.

o What is DNS?

DNS stands for domain name services. This is the sort of yellow pages of the internet. DNS allows a server name to be converted to it’s underlying IP address. It’s a very important service for any network, and generally includes many backup servers for when the primaries experience problems.

o What is a virtual private network?

A VPC provides a network link between a physical datacenter or your offices network, and your cloud provider. It allows you to elastically grow your existing datacenter using virtual resources, while treating those new boxes more like servers in your existing datacenter. IP addresses and subnets are controlled by your existing network rules and admins.

o Why is security important in web operations?

Since your business assets are primarily stored in digital form, the security of those assets depends on the security of your computer systems. Passwords, firewalls and encryption are all relevant.

o Why is patching software important?

Since security is a moving target, and vulnerabilities are constantly being discovered in software, patching and updates are important. Staying fairly current in applying patches means you network and systems will be more secure.

o What is intrusion detection?

Bugs in software open up vulnerabilities and ways into systems. Intrusion detection attempts to detect that such intrusions and avoid further damage.


o What is Saas – Software as a Service?

An example is dropbox, and other so-called hold-my-data type solutions fall into this category.

o What is Iaas – Infrastructure as a Service?

This is raw iron, the virtualized datacenters, hosting providers such as Amazon, GoGrid, Joyent, and Rackspace.

o What is Paas – platform as a service?

Solutions such as heroku, squarespace, wpengine and engineyard fall into this category. Some provide a platform such as the WordPress CMS, with arbitrary scaling options. Others like Heroku and EngineYard allow Ruby applications to be deployed without the need for a lot of fuss at the operational level.

We’re not done yet. In part three of this series, we’ll hit on dba skills, and a series of general questions that cut across the spectrum of web operations. Or jump back to part one of the cloud interview guide.

Read this far? Grab our newsletter – startup scalability.

Cloud Operations Interview

What does a cloud computing expert need to know? How do you hire a cloud computing expert? Competition for operations & DBAs is fierce, so you’ll want to know how to find the best.

If you’re a systems administrator or ops guy, you may want to prepare for an interview for such a position. Meanwhile, if you’re a director of it or operations, a recruiter or manager in HR, you’ll want to have some idea how to find the right candidate.

Here’s my guide to do just that. You may also jump to part two Cloud Deployment Interview or the last part three Cloud DBA, Architecture and Management Interview.

1. Solid unix systems administrator

At the top of the list, a cloud operations expert needs to understand Unix and more importantly Linux. Here are some sample questions to get the conversation moving:

o What is web operations and what have you done day-to-day?

Prepare some stories.

o What’s your favorite feature of the linux kernel?

This is an open ended question, but a systems administrator should have some knowledge here. The kernel is the most basic piece of software that runs when a computer boots up, whether it is a desktop or a server. This piece of software coordinates everything, manages resources, and directs traffic.

o Name some distributions of linux. What is a distro?

Linux is built by a collaborative team of thousands on the internet. That’s what makes it open source. The distributions, include the operating system, along with a collection of software to go along with it. All the supporting utilities, libraries and servers must be compiled and held in a repository. That’s what makes up a distribution. Debian, Redhat and Ubuntu are a few popular ones.

[quote]
A cloud operations expert needs to have a wide ranging skillset, from unix administration, architecture, scalability, database & webserver administration, troubleshooting & performance, load & stress testing. You’ll also want someone who has learned hard lessons from some failures, has some war stories to tell and has a hard nose for stability.
[/quote]

o What’s the difference between apache and nginx?

These two pieces of software are both webservers, that is they respond to the HTTP protocol, and can serve HTML pages. They also have a myriad of plugins to support different languages and features. The difference? Nginx (pronounced engine-X) is a newer incarnation. It’s been rearchitected from the ground up, building on all the things learned from Apache over the years. Its tighter, more efficient code, and easier to configure.

You might also enjoy our Intro to EC2 Cloud Deployments Guide.

o What is a key value store? examples?

There are lots of examples of these types of databases. They are a very simple memory cache that can interface with most applications. Memcache is a popular example of a key value store. Redis, CouchDB and Voldemort can also do this.

o What is a page cache? Reverse proxy cache? examples?

These are all the same thing. They are basically a very minimal webserver without all the plugins or bells and whistles. You put one of these in front of your webserver to handle all the easy stuff, and speed up overall throughput. Varnish is a popular example.

o What filesystem do you prefer?

This is a bit arcane, but one should have some opinions here. xfs is a popular filesystem, though ext3 and ext4 are also common. Emphasize the journaling aspect here. Journaling means that if you pull the cord or your server crashes, the filesystem can recover upon reboot. It does this by journaling changes, much how a database keeps a redolog cache of recent changes to database tables.

o Command line tools

There are lots of commands in the day-to-day toolbox of a web ops expert. Here are some examples:
rsync (pronounced our-sync) – sync files between servers & do checksums to allow easy restarts
scp (pronounced s-c-p) – secure copy, similar to rsync but no checksums, so less reliable
curl (pronounced kurl) – diagnose & test urls and HTTP from the command line
cron (pronounced cron) – run commands at scheduled times
ssh (pronounced s-s-h) – secure shell, the most basic tool to reach a cloud server
ifconfig (pronounced if-config) – check the network interfaces on the server
vi/emacs (pronounced v-i and e-macks) – terminal editors, to modify config files
uptime (pronounced up-time) – display the current load average of the server
top (pronounced top) – interactive display of system metrics like memory, load, swap & processes
ps (pronounced p-s) – shows running processes on the server
/var/log/messages – essential system logfile

o What are application servers? How are they different from webservers?

Tomcat & Glassfish are two examples of application servers. These handle heavier weight languages & applications like Java. Application server on some level is just a more heavyduty webserver and these days Apache can be thought of as an application server also.


2. Cloud concepts

o What is virtualization? What is a hypervisor?

Virtualization allows you to run one or more computers within a computer. You can do virtualization on a desktop, sharing network, memory, cpu and disk resources among a number of virtual servers. But more importantly in cloud computing or IaaS offerings you can do virtualization at the datacenter level. The hypervisor layer is a datacenter virtualization technology that provisions server resources, and balances shared network and disk resources.

o What is an image?

In Amazon the world, the AMI or amazon machine image is a snapshot of a server state at one moment in time. This image is take at the block level, and includes the master block record, the first block on disk that a server boots from. All that is the state of a server, when it is shutdown, is what is stored on disk or in this image. All config files, logfiles, and anything else writing to disk.

o What is multi-tenant?

This means that there are multiple servers sharing resources. The tenants are the customers who each want to get the server, cpu, memory, network and disk that they paid for.

o What is the downside to shared resources?

Contention for resources is always the challenge. If your fellow tenants are not very thirsty, this can work to your advantage. But if they’re also heavy users, the hypervisor layer has manage the balancing act. You may get a spike of disk I/O at one point, but later get a dearth. This can cause a relational database like MySQL or Oracle to suddenly look stalled.

o What is instance-store? What is ebs?

Instance store servers were Amazon’s original offering, where servers had their own local (and slow) storage. This storage was ephemeral, so all machine state was lost on reboot. These servers also boot slowly. EBS also known as elastic block storage is a virtualized storage option, similar to NAS or NFS. You can create arbitrary chunks of storage, and attach them to servers, all from command line APIs. Cool!

o What is virtual private cloud?

With the VPC offering, Amazon drops a router into your existing datacenter. You can then provision virtual servers to your hearts content, and they all appear to be servers in your existing datacenter. Elastically scale, within the network and security model you’re already using.

o What is a hybrid approach to cloud adoption?

Keeping your investments in hardware and datacenter is obviously an appealing option for firms that have large existing environment. A hybrid approach with a VPC allows you to get your feet wet, but still keep essential applications on physical servers.

o What is Amazon EC2?

Elastic Compute Cloud refers to the virtual servers you spinup in Amazon Web Services.

o What is Amazon RDS, Oracle RDS, Mysql RDS?

Amazon has various relational and non-relational database offerings. RDS stands for relational database service.

RDS or roll your own – which is better? Here are some use cases to help you decide.

o What is multi-az?

Amazon’s infrastructure offering isn’t just a single datacenter with servers. The beauty of what they’ve built is that they offer a number of datacenters (called availability zones) in each of many regions such as Northern Virginia, Oregon and Singapore.

Incidentally multi-az is a key feature to how businesses can protect themselves from failure. Amazon recently had an outage, but AirBNB, Reddit & Foursquare didn’t have to fail.

o What does a CDN do? How does it work? examples?

A CDN is a content delivery network. Remember all those files that make up a webpage? Images, video, css files? Turns out serving these components from servers *closer* to your customer, make their webpages load much faster. CDNs are networks of servers that hold the content of your pages, and serve them faster.

It works by replacing content paths with a special one from your provider. A simple change in your code will allow content to dynamically load from across the web. Cool!

CloudFront is Amazon’s offering coupled with S3 for file storage. Akamai is another big provider.

We’re not done yet. In part two on deployments and http://www.iheavy.com/2012/11/01/cloud-deployment-interview/”>part three of this series, we’ll hit on other important skills a cloud ops expert should have including scripting, database administration (Our MySQL Interview Guide), scalability, performance, configuration management, metrics, monitoring, and some all important war stories!

Here are some questions to pique your interest:

o Why does the API battle between Amazon & Eucalyptus (FOSS) matter?
o Do you use command line tools? why?
o What can go wrong with backups? how do we test them?
o Should we encrypt filesystems in the cloud? what are the risks?
o Should we use offsite backups?
o What is DRBD?
o Why is auditing important? access control?
o What is load balancing? why is it difficult with databases?
o How do you perform a benchmark? perform load testing?
o Why use a package manager? can we install from source?

Our Deploying MySQL on Amazon EC2 Guide is also related to this interview process.

You may also jump to part two Cloud Deployment Interview or the last part three Cloud DBA, Architecture and Management Interview.

Read this far? Grab our newsletter – startup scalability.

AirBNB didn't have to fail

Today part of Amazon Web Services failed, taking down with it a slew of startups that all run on Amazon’s Cloud infrastructure. AirBNB was one of the biggest, but also Heroku, Reddit, Minecraft, Flipboard & Coursera down with it. Its not the first time. What the heck happened, and why should we care?

1. Root Cause

The AWS service allows companies like AirBNB to build web applications, and host them on servers owned and managed by Amazon. The so-called raw iron of this army of compute power sits in datacenters. Each datacenter is a zone, and there are many in each of their service regions including US East (Northern Virginia), US West (Oregon), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), Asia Pacific (Tokyo), South America (Sao Paulo), and AWS GovCloud.

Today one of those datacenters in the Northern Virginia region had a failure. What does this mean? Essentially firms like AirBNB that hosted their applications ONLY in Northern Virginia experienced outages.

As it turns out, Amazon has a service level agreement of 99.95% availability. We’ve long since said goodbye to the five nines. HA is overrated.

2. Use Redundancy

Although there are lots of pieces and components to a web infrastructure, two big ones are webservers and database servers. Turns out AirBNB could make both of these tiers redundant. How do we do it?

On the database side, you can use Amazon’s multi-az or alternately read-replicas. Each have different service characteristics so you’ll have to evaluate your application to figure out what will work for you.

Then there is the option to host mysql or Percona directly on Amazon servers yourself and use replication.

[quote]Using redundant components like placing webservers and databases in multiple regions, AirBNB could avoid an Amazon outage like Monday’s that affected only Northern Virginia.[/quote]
When do I want RDS versus mysql? Here are some use cases for RDS versus roll your own MySQL.

Now that you’re using multiple zones and regions for your database the hard work is completed. Webservers can be hosted in different regions easily, and don’t require complicated replication to do it.

3. Have a browsing only mode

Another step AirBNB can take to be resilient is to build a browsing only mode into their application. Often we hear about this option for performing maintenance without downtime. But it’s even more valuable during a situation like this. In a real outage you don’t have control over how long it lasts or WHEN it happens. So a browsing only mode can provide real insurance.

For a site like AirBNB this would mean the entire website was up and operating. Customers could browse and view listings, only when they went to book a room would the encounter an error. This would be a very small segment of their customers, and a much less painful PR problem.

Facebook has experience intermittent outages of it’s service. People hardly notice because they’ll often only see a message when they are trying to comment on someone’s wall post, send a message or upload a photo. The site is still operating, but not allowing changes. That’s what a browsing only mode affords you.

[quote]A browsing only mode can make a big difference, keeping most of the site up even when transactions or publish are blocked.
[/quote]

Drupal, an open source CMS system that powers sites like Adweek.com, TheHollywoodReporter.com, and Economist.com uses this technology. It supports a browsing only mode out of the box. An amazon outage like this one would only stop editors from publishing new stories temporarily. A huge win to sites that get 50 to 100 million with-an-m pageviews per month.

4. Web Applications need Feature Flags

Feature flags give you an on/off switch. Build them into heavy duty parts of your site, and you can disable those in an emergency. Host components multiple availability zones for extra peace of mind.

One of our all time most popular posts 5 Things Toxic to Scalability included some indepth discussion of feature flags.

5. Consider Netflix’s Simian Army

Netflix takes a very progressive approach to availability. They bake redundancy and automation right into all of their infrastructure. Then they run an app called the Chaos Monkey which essentially causes outages, randomly. If resilience from constantly falling and getting back up can’t make you stronger, I don’t know what can!

Take a look at the Netflix blog for details on intentional load & stress testing.

6. Use multiple cloud providers

If all of the above isn’t enough for you, taking it further you’d do as George Reese of enstratus recommends and use multiple cloud providers. Not being beholden to one company could help in more situations than just these type of service disruptions too.

Basic EC2 Best Practices mean building redundancy into your infrastructure. Multiple cloud providers simply take that one step further.

Read this far? Grab our newsletter on scalability and startups!

Why I Wrote the Book – Oracle and Open Source

Back in the late 90′s New York City was deep in the dot-com boom. Silicon Alley was being born, and a thousand internet startups were sprouting. Everyone was hiring, it was an exciting time to work in technology!

Join 11,500 others and follow Sean Hull on twitter @hullsean.

Trend Spotting Circa 2000

As an independent consultant, I had the opportunity to work at quite a few startups. The technology stack was identical at almost all of them. Sun Microsystems hardware, Apache webservers, and Oracle on the backend. The database was always the sticking point, and developers struggled to get their queries right.

It was an interesting role to hold. Most career DBAs worked at large fortune 500 firms, the old stodgy kind where nothing ever changes. Few of the Oracle old guard, the kind you’d meet at User Groups or conferences, had much exposure to Linux, and they certainly didn’t trust it.

Also: Here’s how to do a scalability performance review

Meanwhile in the startup scene in NYC I was seeing the cutting edge uses of the technology, with more and more shops switching to Linux and commodity hardware. There was even talk of *gasp* Oracle porting to Linux. There was a real rumor mill around all of this.

Oracle and Open Source Published – 2001

Seeing this shift towards commodity hardware, and the tremendous demand for Oracle married with open source technologies, I pitched O’Reilly and Associates with a book idea. Let’s talk about what’s happening in the trenches. How and when does Oracle – the most commercial of relational databases, work with Open Source technologies? What is in the mix? What are real firms using it for? What tools and technologies can help firms grow faster?

Related: Oracle DBA Interview questions for managers, candidates & recruiters alike

These were the questions my co-author and I sought to answer, and to judge from the response I think we did a very good job. As that push continued, Oracle eventually ported it’s enterprise database to Linux. This was a seismic shift that meant existing Oracle customers would spend a lot less on hardware, and thus have more to spend on Oracle licenses. Win-win except for Sun. The trend continued with Oracle pushing Apache into the mix as well.

Fast Forward a Decade

Now a decade later, Oracle has bought it’s former partner Sun, and in so doing owns MySQL too.

Read this: Top MySQL Interview questions for Devops, managers & recruiters

What new trends are happening? We hear an incessant drum of hype around cloud computing. In many ways the trend parallels what happened a decade ago. See our related piece a history lesson for cloud detractors. How so?

[quote]Commoditization: push towards new platforms, driven by cost. [/quote]

But this is slowed by an equally large stumbling block.

[quote]Performance: new cloud servers can’t compete with their big iron cousins. Not yet at least.[/quote]

Interested in Amazon EC2? We wrote an Intro to EC2 Cloud Deployments article which digs in deeper.

What’s Next for Datacenters

Commiditization will continue, driving costs downward. This will provide more gravity to cloud migrations for firms big and small.

Performance will improve. Cloud services like Amazon EC2 will get bigger & better, as will the all important network & disk subsystems.

Also: 5 things toxic to scalability

Big enterprises are already dipping their feet in the water with VPC technology, tying their existing datacenter to a cloud. They can grow elastically while still having feet firmly planted on the ground.

As large enterprises begin to get experience behind the wheel, it’ll chip away at the stranglehold of Oracle and the huge taxation type licensing that firms struggle with today. Where salesforce.com had a huge impact, workday.com will be even bigger.

[quote]The cloud will finally disrupt the last old guard industry – enterprise software.[/quote]

Read this far? Get us monthly in your inbox. Grab our scalable startups newsletter!

Ask Me Questions – Scalability, Performance, Cloud Computing

 

 

I blog a lot about various topics near and dear to me.  So I thought I’d turn the tables a bit, and offer the microphone up to readers.  Do you have questions on any of these topics?  Feel free to drop a note in the comments.

 

Business & Consulting

Are you a freelancer or independent consultant?  Struggling with some part of the business?  Or are you a CTO or Director looking to hire short term talent?  Glad to offer up advice and suggestions, just fire away!

Scalability

The goal of every hypergrowth company, from Pinterest to Facebook, Zynga to FourSquare.  How do they achieve it?  What architecture decisions make those applications grow effortlessly to meet user demand?  I’m sure you have some questions!

Performance

It’s on everyone’s mind, especially in the context of deploying in the cloud.  How best to achieve good performance?  Are you struggling with something specific?

High Availability

We want our systems to be available 24x7x 3million days a year!  Is it feasible?  What should we aim for in the real world?  Does our technology mix and hosting factor into the equation?  Ask away.

Startup Challenges

I’ve worked with a lot of startups over the years.  From the dot-com hey days to the more sober field we see before us today.  Many struggle with similar challenges.  Am glad to offer up comments and suggestions from what I’ve learned over the years.

Cloud Computing

What really works in the cloud? Is Amazon EC2 the only way to go?  What are the pros and cons of going with a Rackspace type provider that also has great service and the physical data center option?  How will I get good performance?  These and other questions are on everyone’s lips.  Feel free to comment if you have some concerns.

 

3 things CEOs should know about the Cloud

You’ve heard all the buzz and spiel about the cloud, and there’re good reasons to want to get there. On-demand compute power makes new levels of scalability possible. Low up front costs means moving capital expenditure to operating expenditure and saving a bundle in the process. We won’t give you anymore of the rah rah marketing hoopla. You’ve heard enough of that. We’ll gently play devil’s advocate for a moment, and give you a few things to think about when deploying applications with a cloud provider. Our focus is mainly on Amazon EC2.

You might also be interested in a wide reaching introduction to deploying on Amazon EC2.

  1. Funky Performance
  2. One of the biggest hurdles we see clients struggle with on Amazon EC2 is performance. This is rooted in the nature of shared resources. Computer servers, just like desktops rely on CPUs, Memory, Network and Disk. In the virtual datacenter, you can be given more than your fair share without you even knowing it. More bandwidth, more CPU, more disk? Who would complain? Well if your application behaves erratically, while you suddenly compete for disk resources you’ll quickly feel the flip side of that coin. Stocks go up, and they can just as easily come right back down.

    Variability around disk I/O seems to be the one that hits applications the hardest, especially the database tier of many web applications. If your application requires extremely high database transaction throughput, you would do well to consider physical servers and a real RAID array to host your database server. Read more about IOPs

  3. Uncertain Reliability – A Loaded Gun
  4. Everybody has heard the saying, don’t hand someone a loaded gun. In the case of Amazon servers, you really do load your applications onto fickle and neurotic servers.

    Imagine you open a car rental business. You could have two brand new fully reliable cars to rent out to customers. Your customers would be very happy, but you’d have a very small business. Alternatively you could have twenty used Pintos. You’d have some breaking down a lot, but as long as you keep ten of them rented at a time, your business is booming.

    In the Amazon world you have all the tools to keep your Ford Pintos running, but it’s important to think long and hard about reliability, redundancy, and automation. Read more about Failures, Lessons & the Chaos Monkey

  5. Iffy Support
  6. Managed hosting providers vary drastically in terms of the support you can expect. Companies like Rackspace, Servint or Datapipe have Support built into their DNA. They’ve grown up around having a support tech that your team can reach when they’re having trouble.

    Amazon takes the opposite approach. They give you all the tools to do everything yourself. But in a crunch it can be great to have that service available to help troubleshoot and diagnose a problem. Although they’re now offering support contracts, it’s not how they started out.

    If you have a crack operations team at your disposal, or you hire a third party provider like Heavyweight Internet Group Amazon Web Services gives you the flexibility and power to build phenomenal and scalable architectures. But if you’re a very small team without tons of technical know-how, you may well do better with a service-oriented provider like Rackspace et al.

A few more considerations…

  • Will your cloud provider go out of business?
  • Could a Subpoena against your provider draw you into the net?
  • Since you don’t know where your sensitive data is, should you consider encryption?
  • Should I keep additional backups outside of the cloud?
  • Should I use multiple cloud providers?
  • Should I be concerned about the lack of perimeter security?

A History lesson for Cloud Detractors

Computing history

We’ve all seen cloud computing discussed ad nauseam on blogs, on Twitter, Quora, Stack Exchange, your mom’s Facebook page… you get the idea. The tech bloggers and performance experts often pipe in with their graphs and statistics showing clearly that dollar-for-dollar, cloud hosted virtual servers can’t compete with physical servers in performance, so why is everyone pushing them? It’s just foolhardy, they say.

On the other end, management and their bean counters would simply roll their eyes saying this is why the tech guys aren’t running the business.

Seriously, why the disconnect? Open source has always involved a lot of bushwacking…

Continue reading

Seattle Web Tech Meetup Nov 21

I’ll be one of two speakers at the next Seattle Web Technology Bi-Weekly Meet up on Nov 21 at the Citrus Lounge.

They’ve sexed it up a little by calling it a face-off between Windows Azure and Amazon EC2  (no prizes for guessing which side I represent) but really it’s going to be a primer on the Platform-as-a service and Infrastructure-as-a-service models. I expect some lively discussions during Q&A.

I’ll be covering questions such as what cloud computing is, what EC2 provides, what is datacenter automation and the differences between a standard datacenter liks Rackspace and Amazon EC2. Meanwhile you folks who’ve  large investments in say EXCHANGE servers will be able to pose questions to Marcus Wendt of Composite C1.

It’s Amex sponsored and you’ll get a ticket good for a beer or a Citrus signature drink with which you can get cozy and warm up by the fireplace while Marcus and I are beamed through a flat screen with our respective presentations. If you’re in Seattle drop in. I hope to see you or at least, hear you there.