iHeavy Insights 82 – Better Practices

Best Practices, the term we hear thrown around a lot.  But like going on that new years diet, too often ends up more talk than action.

Manage Processes

Operator error ie typing the wrong command is always a risk.  Logging into the wrong server to drop a database or typing the dump command such that you dump data into the database, these are risks that operations folks face everyday.

Accountability is important, be sure all of your systems folks login to their own accounts.  Apply the least privileges model, give permissions on an as needed basis.

Set prompts with big bold names that indicate production servers and their purpose.  Automate repetitive commands that are prone to typos.

Don’t be afraid to give developers read-only accounts on production servers.

Communicate Clearly

Regular team meetings, a la the Agile stand ups are a great way to encourage folks to communicate.  Bring the developers and operations folks together.   Ask everyone in turn to voice their current todos, their concerns and risks they see.  Encourage everyone to listen with an open mind.  Consider different perspectives.

Communication is a cultural attribute.  So it comes from the top.  Encourage this as a CTO or CIO by asking questions, communicating your concerns, repeat your own requests in different words and paraphrase.  Listen to what your team is saying, repeat and rephrase those concerns, and how and when they will be addressed.

Document Processes

A culture of documenting services, and processes is healthy.  It provides a central location and knowledge base for the team.  It also prevents sliding into the situation where only one team member understands how to administer critical business components.  Were that person to be unavailable or to leave the company, you’re stuck reverse engineering your infrastructure and guessing at architectural decisions.

Better Practices

Rather than think of best practices as something you need to achieve today, think of it as an ongoing day-to-day quest for improvement.

  • repetitive manual processes – employ automation & script those processes where possible.
  • where steps require investigation and research – document it
  • where production changes are involved – communicate with business units, qa & operations
  • always be improving – striving for better practices

Amazon Web Services – What is it and why is it important?

Amazon Web Services is a division of Amazon the bookseller, but this part of the business is devoted solely to infrastructure and internet servers.  These are the building blocks of data centers, the workhorses of the internet.  AWS’s offering of Cloud Computing solutions allows a business to setup or “spinup” in the jargon of cloud computing, new compute resources at will.  Need a small single cpu 32bit ubuntu server with two 20G disks attached?  One command and 30 seconds away, and you can have that!

As we discussed previously, Infrastructure Provisioning has evolved dramatically over the past fifteen years from something took time and cost a lot, to a fast automatic process that it is today with cloud computing.  This has also brought with it a dramatic culture shift in the way that systems administration is being done, from a fairly manual process of physical machines, and software configuration, one that took weeks to setup new services, to a scriptable and automateable process that can then take seconds.

This new realm of cloud computing infrastructure and provisioning is called Infrastructure as a Service or IaaS, and Amazon Web Services is one of the largest providers of such compute resources.  They’re not the only ones of course.  Others include:

  • Rackspace Cloud
  • Joyent
  • GoGrid
  • Terremark
  • 3Tera
  • IBM
  • Microsoft
  • Enomaly
  • AT&T

Cloud Computing is still in it’s infancy, but is growing quickly.   Amazon themselves had a major data center outage in April that we discussed in detail. It sent some hot internet startups into a tailspin!

More discussion of Amazon Web Services on Quora – Sean Hull

Point-in-time Recovery – What is it and why is it important?

Web-facing database servers receive a barrage of activity 24 hours a day.  Sessions are managed for users logging in, ratings are clicked and comments are added.  Even more complex are web-based ecommerce applications.  All of this activity is organized into small chunks called transactions.  They are discrete sets of changes.  If you’re editing a word processing document, it might autosave every five minutes.  If you’re doing something in excel it may provide a similar feature.  There is also an in-built mechanism for undo and redo of recent edits you have made.  These are all analogous to transactions in a database.

These are important because all of these transactions are written to logfiles.  They make replication possible, by replaying those changes on another database server downstream.

If you have lost your database server because of hardware failure or instance failure in EC2, you’ll be faced with the challenge of restoring your database server.  How is this accomplished?  Well the first step would be to restore from the last full backup you have, perhaps a full database dump that you perform everyday late at night.  Great, now you’ve restored to 2am.  How do I get the rest of my data?

That is where point-in-time recovery comes in.  Since those transactions were being written to your transaction logs, all the changes made to your database since the last full backup must be reapplied.  In MySQL this transaction log is called the binlog, and there is a mysqlbinlog utility that reads the transaction log files, and replays those statements.  You’ll tell it the start time – in this case 2am when the backup happened.  And you’ll tell it the end time, which is the point-in-time you want to recover to.  That time will likely be the time you lost your database server hardware.

Point-in-time recovery is crucial to high availability, so be sure to backup your binlogs right alongside your full database backups that you keep every night.  If you lose the server or disk that the database is hosted on, you’ll want an alternate copy of those binlogs available for recovery!

Quora discussion on Point-in-time Recovery by Sean Hull

Migrating to the Cloud – Why and why not?

A lot of technical forums and discussions have highlighted the limitations of EC2 and how it loses  on performance when compared to physical servers of equal cost.  They argue that you can get much more hardware and bigger iron for the same money.  So it then seems foolhardy to turn to the cloud.  Why this mad rush to the cloud then?  Of course if all you’re looking at is performance, it might seem odd indeed.  But another way of looking at it is, if performance is not as good, it’s clearly not the driving factor to cloud adoption.

CIOs and CTOs are often asking questions more along the lines of, “Can we deploy in the cloud and settle with the performance limitations, and if so how do we get there?”

Another question, “Is it a good idea to deploy your database in the cloud?”  It depends!  Let’s take a look at some of the strengths and weaknesses, then you decide.

8 big strengths of the cloud

  1. Flexibility in disaster recovery – it becomes a script, no need to buy additional hardware
  2. Easier roll out of patches and upgrades
  3. Reduced operational headache – scripting and automation becomes central
  4. Uniquely suited to seasonal traffic patterns – keep online only the capacity you’re using
  5. Low initial investment
  6. Auto-scaling – set thresholds and deploy new capacity automatically
  7. Easy compromise response – take server offline and spinup a new one
  8. Easy setup of dev, qa & test environments

Some challenges with deploying in the cloud

  1. Big cultural shift in how operations is done
  2. Lower SLAs and less reliable virtual servers – mitigate with automation
  3. No perimeter security – new model for managing & locking down servers
  4. Where is my data?  — concerns over compliance and privacy
  5. Variable disk performance – can be problematic for MySQL databases
  6. New procurement process can be a hurdle

Many of these challenges can be mitigated against.  The promise of the infrastructure deployed in the cloud is huge, so digging our heels in with gradual adoption is perhaps the best option for many firms.  Mitigate the weaknesses of the cloud by:

  • Use encrypted filesystems and backups where necessary
  • Also keep offsite backups inhouse or at an alternate cloud provider
  • Mitigate against EBS performance – cache at every layer of your application stack
  • Employ configuration management & automation tools such as Puppet & Chef

Quora discussion – Why or why not to migrate to the cloud?

Migration to MySQL – What is it and why is it important?

MySQL is a relational database that backs many internet websites and enterprise applications.  Like all enterprise software, it has a whole complement of features which are well documented, such as data types, storage engines, transactional behaviors and so forth.  It also has a set of processes, many of which involve how software operates on Linux servers, such as how it gets installed, where binaries and libraries will get placed, where to find logfiles, and how to move directories and set permissions.  Thirdly it is important to understand the culture, in this case Unix-based, forum discussions and community contributions as an open-source project.

MySQL can do much of the workhorse kind of stuff you see in databases like Oracle or SQL Server, but sometimes it achieves those goals in very different ways.  For instance there are many open-source projects that support and surround the database, such as mysqltuner an analysis script, innotop a unix top-like utility for monitoring on-going activity in the database, and maatkit a whole suite of tools that build on and expand the features already present in the MySQL database.

Some Limitations in MySQL

  1. Complex queries and subqueries specifically can be problematic in MySQL.  If you’re used to writing huge queries in Oracle, and having the CBO figure everything out for you, you’ll be in for a surprise with MySQL.  Keep your queries simple, proper columns indexed and avoid complex joins where possible.  The EXPLAIN facility is available to you and at your disposal.  Use it!
  2. Vertical Scalability problems – primarily addressed in 5.5, the latest version of MySQL, previously the database did not scale well on greater than four processor boxes.  SMP or Symmetric Multiprocessing servers were less common 10-15 years ago when MySQL was in it’s infancy, and development is slowly catching up with the big iron of today.
  3. There is no flashback table, tablespace or database that you might find in other databases such as Oracle.  You can achieve the same thing with point-in-time recovery, so keep regular backups of your database, and also backup the transaction logs.
  4. MySQL can do JOINs, but only with the nested loops algorithm.  It can’t do sort merge join or hash join.
  5. MyISAM is the default table type and storage engine.  It is not crash safe and not transactional.  On new installations it’s recommended that you change this to InnoDB and use InnoDB for most if not all of your tables.  It’s very reliable and very fast!
  6. There is a query cache, but it caches result sets not query plans!  It also has some performance issues and shows some erratic behavior on larger SMP boxes.  Query plans are cached on a session basis, but when a session is closed and reopened, MySQL must reparse and reexecute that query.
  7. MySQL does not have a facility like Oracle’s Real Application Clusters.  It does have NDB Cluster which is an all-in-memory clustering solution.  Despite it’s promise, it tends to have very serious performance problems with any type of join, and is mainly good for single table index-based lookups.  If managed well it can increase availability but will probably reduce performance.
  8. MySQL’s default replication solution is statement based.  Although it is easy to setup, it breaks almost as easily, sometimes with resolvable errors, and sometimes silently.  Consider row-based replication, and definitely make use of Maatkit’s mk-table-checksum and mk-table-sync tools.  Also be sure to do thorough and regular monitoring of your replication setup.
  9. There are no in-built materialized views or snapshots in MySQL.  There is an open-source project called Flexviews by Justin Swanhart that provides this facility to the MySQL community.
  10. MySQL provides stored procedures, triggers and functions as a regular feature to the database.  However I would use them with caution.  They are very difficult to edit, troubleshoot and diagnose when they are causing troubles.  Also as with the query plan caching, stored procedures are cached at the session level, so they can be expensive to execute over and over again in different areas of your application.  They can cause real performance problems.
  11. There is no in-built mechanism for auditing that you find in relational databases such as Oracle 11g.
  12. Only b-tree indexes are supported, no bitmap indexes, index-organized tables, clustered indexes or other more exotic index types.
  13. ALTER TABLE is generally a locking and blocking operation.  For example if you add a new column or change a columns data type, the entire table will be locked for the duration of the operation.  This will be a surprise coming from the Oracle world where these type of operations can routinely be done online.

MySQL’s Strengths Are Numerous

  1. Install with an RPM using Yum or Aptget.  Fast & simple!
  2. Works great in the cloud, using MySQL Community distro, Percona distro, or Amazon’s own RDS solution.
  3. Comes out-of-the-box with an excellent command line shell providing all sorts of features and power that are constant frustrations on the Oracle side.  Command history, standard input/output redirection support, a full compliment of features and options, and easy autologin with a user level my.cnf file which fits in nicely with the global settings as well.
  4. A simpler mechanism to serve unique id columns with the auto-increment data type.  Although Oracle’s sequence method is extremely scalable, for many many developers it is troublesome and confusing.
  5. Good support of the LIMIT clause allowing an easier method for developers to fetch a subset of data.
  6. A huge community of users, forums, and support in third party applications such as monitoring (Nagios etc…) as well as metrics collection (Munin, Cacti, OpenNMS, Ganglia etc.)
  7. Great visibility of system variables with SHOW VARIABLES.  Many can be changed dynamically as well, just like Oracle.
  8. Great visibility of internal system state with SHOW PROCESSLIST.
  9. System counters for all sorts of internal instrumentation data using SHOW STATUS and SHOW INNODB STATUS.  Ultimately it is not as comprehensive as Oracle’s own data dictionary and millions of instrumentation counts.  However Oracle could take a huge page out of the MySQL book in terms of usability.  The obfuscation of Oracle’s internal kernel state makes it all but unusable by most.
  10. innotop, the utility much like the unix TOP facility that all Unix & Linux folks love, it provides instant visibility into what queries are running, what work is being done, and what is blocking.  Oracle could really take a page from this playbook, as this tool is so invaluable.
  11. The incredible Maatkit, a veritable goldmine of great community contributed powertools.  Query analyzers, profilers, log tools, replication tools, data archiver, a find facility, and a whole lot more!

Sean Hull discusses further on Quora – What considerations are important when migrating to MySQL?

Scalability – What is it and why is it important?

Scaling comes in a few different flavors.  Vertical scaling involves growing the computing power of a single server, adding memory, faster or more CPUs and/or faster disk I/O.

Horizontal scaling involves adding additional computing resources or servers in parallel and then load balacing across them.

Scalability refers to applications which facilitate scaling well.  With web applications, the middle tier aka the webservers are fairly easy to scale horizontally and most enterprise class applications already do this with commercial load balancers – with either hardware or software.

Doing the same with the database tier, however can be trickier.  Enter MySQL replication to facilitate a fairly painless horizontal scalability.  Build your application architecture with read-only transactions, and write/update transactions segmented apart, and you can send the latter to one master database, and the former to a handful of replicated slaves.  With a typical web application that is less than 10% writes, and 90% reads, there is the potential to add as many as 5-10 servers horizontally to increase application throughput by as much as 500-1000%.

Sean Hull asks on Quora: What is scalability and why is it important?

Amazon EC2 Outage – Failures, Lessons and Cloud Deployments

Now that we’ve had a chance to take a deep breath after last week’s AWS outage, I’ll offer some comments of my own.  Hopefully just enough time has passed to begin to have a broader view, and put events in perspective.
Despite what some reports may have announced, Amazon wasn’t down, but rather a small part of Amazon Web Services went down.  A failure, yes.  Beyond their service level agreement of 99.95% yes also.  Survivable, yes to this last question too.

Learning From Failure

The business management conversation du jour is all about learning from failure, rather than trying to avoid it.  Harvard Business Review’s April issue headlined with “The Failure Issue – How to Understand It, Learn From It, and Recover From It”.  The economist’s April 16th issue had some similarly interesting pieces one by Schumpeter “Fail often, fail well”,
and another in April 23rd issue “Lessons from Deepwater Horizon and Fukushima”.
With all this talk of failure there is surely one takeaway.  Complex systems will fail and it is in the anticipation of that failure that we gain the most.  Let’s stop howling and look at how to handle these situations intelligently.

How Do You Rebuild A Website?

In the cloud you will likely need two things.  (a) scripts to rebuild all the components in your architecture, spinup servers, fetch source code, fetch software and configuration files, configure load balancers and mount your database and more importantly (b) a database backup from which you can rebuild your current dataset.

Want to stick with EC2, build out your infrastructure in an alternate availability zone or region and you’re back up and running in hours.  Or better yet have an alternate cloud provider on hand to handle these rare outages.  The choice is yours.

Mitigate risk?  Yes indeed failure is more common in the cloud, but recovery is also easier.  Failure should pressure the adoption of best practices and force discipline in deployments, not make you more of a gunslinger!

Want to see an extreme example of how this can play in your favor?  Read Jeff Atwood’s discussion of so-called Chaos Monkey, a component whose sole job it is to randomly kill off servers in the Netflix environment at random.  Now that type of gunslinging will surely keep everyone on their toes!  Here’s a Wired article that discusses Chaos Monkey.

George Reese of enStratus discusses the recent failure at length.  The I would argue calling Amazon’s outage the Cloud’s Shing Moment, all of his points are wisened and this is the direction we should all be moving.

Going The Way of Commodity Hardware

Though it is still not obvious to everyone, I’ll spell it out loud and clear.  Like it or not, the cloud is coming.  Look at these numbers.

Furthermore the recent outage also highlights how much and how many internet sites rely on cloud computing, and Amazon EC2.
Way back in 2001 I authored a book on O’Reilly called “Oracle and Open Source”.  In it I discussed the technologies I was seeing in the real world.  Oracle on the backend and Linux, Apache, and PHP, Perl or some other language on the frontend.  These were the technologies that startups were using.  They were fast, cheap and with the right smarts reliable too.

Around that time Oracle started smelling the coffee and ported it’s enterprise database to Linux.  The equation for them was simple.  Customers that were previously paying tons of money to their good friend and confidant Sun for hardware, could now spend 1/10th as much on hardware and shift a lot of that left over cash to – you guessed it Oracle!  The hardware wasn’t as good, but who cares because you can get a lot more of it.

Despite a long entrenched and trusted brand like Sun being better and more reliable, guess what?  Folks still switched to commodity hardware.  Now this is so obvious, no one questions it.  But the same trend is happening with cloud computing.

Performance is variable, disk I/O can be iffy, and what’s more the recent outage illustrates front and center, the servers and network can crash at any moment.  Who in their right mind would want to move to this platform?

If that’s the question you’re stuck on, you’re still stuck on the old model.  You have not truely comprehended the power to build infrastructure with code, to provision through automation, and really embrace managing those components as software.  As the internet itself has the ability to route around political strife, and network outages, so too does cloud computing bring that power to mom & pop web shops.

Conclusions

  • Have existing investments in hardware?  Slow and cautious adoption makes most sense for you.
  • Have seasonal traffic variations?  An application like this is uniquely suited to the cloud.  In fact some of the gaming applications which can autoscale to 10x or 100x servers under load, are newly solveable with the advent of cloud computing.
  • Are you currently paying a lot for disaster recovery systems that primarily lay idle.  Script your infrastructure for rebuilding from bare metal, and save that part of the budget for more useful projects.