Category Archives: All

Book Review – Rework

rework coverRework is chock full of ideas

Jason Fried and David Heinemeier Hansson’s new book REWORK is one of the best startup business books I’ve read since Alan Weiss’ Million Dollar Consulting. If you’re already a fan of their signal vs noise blog, you’d be familiar with their terse style. Sharp and to the point.

Which is why you can pick it up and read it in a few hours.  You’ll want to because it’s well written and pared down to essentials.  In fact the book reads like their workflow advice, less mass, do it yourself, cut out the fat, concentrate on essentials.  As such they are clearly practicing what they preach, which I like. Continue reading

8 Questions to ask an AWS Expert

If you’re headhunting a cloud computing expert, specifically someone who knows Amazon Web Services (AWS) and EC2, you’ll want to have a battery of questions to ask them to assess their knowledge.  As with any technical interview focus on concepts and big picture.  As the 37Signals folks like to say “hire for attitude, train for skill”.  Absolutely!

If you want more general info about Amazon Web Services, read our Intro to EC2 Deployments.

  1. Explain Elastic Block Storage?  What type of performance can you expect?  How do you back it up?  How do you improve performance?
  2. EBS is a virtualized SAN or storage area network.  That means it is RAID storage to start with so it’s redundant and fault tolerant.  If disks die in that RAID you don’t lose data.  Great!  It is also virtualized, so you can provision and allocate storage, and attach it to your server with various API calls.  No calling the storage expert and asking him or her to run specialized commands from the hardware vendor.

    Performance on EBS can exhibit variability.  That is it can go above the SLA performance level, then drop below it.  The SLA provides you with an average disk I/O rate you can expect.  This can frustrate some folks especially performance experts who expect reliable and consistent disk throughput on a server.  Traditional physically hosted servers behave that way.  Virtual AWS instances do not.

    Backup EBS volumes by using the snapshot facility via API call or via a GUI interface like elasticfox.

    Improve performance by using Linux software raid and striping across four volumes.

  3. What is S3?  What is it used for? Should encryption be used?
  4. S3 stands for Simple Storage Service.  You can think of it like ftp storage, where you can move files to and from there, but not mount it like a filesystem.  AWS automatically puts your snapshots there, as well as AMIs there.  Encryption should be considered for sensitive data, as S3 is a proprietary technology developed by Amazon themselves, and as yet unproven vis-a-vis a security standpoint.

  5. What is an AMI?  How do I build one?
  6. AMI stands for Amazon Machine Image.  It is effectively a snapshot of the root filesystem.  Commodity hardware servers have a bios that points the the master boot record of the first block on a disk.  A disk image though can sit anywhere physically on a disk, so Linux can boot from an arbitrary location on the EBS storage network.

    Build a new AMI by first spinning up and instance from a trusted AMI.  Then adding packages and components as required.  Be wary of putting sensitive data onto an AMI.  For instance your access credentials should be added to an instance after spinup.  With a database, mount an outside volume that holds your MySQL data after spinup as well.

  7. Can I vertically scale an Amazon instance? How?
  8. Yes.  This is an incredible feature of AWS and cloud virtualization.  Spinup a new larger instance than the one you are currently running.  Pause that instance and detach the root ebs volume from this server and discard.  Then stop your live instance, detach its root volume.  Note the unique device ID and attach that root volume to your new server.   And the start it again.  Voila you have scaled vertically in-place!!

  9. What is auto-scaling? How does it work?
  10. Autoscaling is a feature of AWS which allows you to configure and automatically provision and spinup new instances without the need for your intervention.  You do this by setting thresholds and metrics to monitor.  When those thresholds are crossed a new instance of your choosing will be spun up, configured, and rolled into the load balancer pool.  Voila you’ve scaled horizontally without any operator intervention!

    With MySQL databases autoscaling can get a little dicey, so we wrote a guide to autoscaling MySQL on amazon EC2.

  11. What automation tools can I use to spinup servers?
  12. The most obvious way is to roll-your-own scripts, and use the AWS API tools.  Such scripts could be written in bash, perl or other language or your choice.  Next option is to use a configuration management and provisioning tool like puppet or better it’s successor Opscode Chef.  You might also look towards a tool like Scalr.  Lastly you can go with a managed solution such as Rightscale.

  13. What is configuration management?  Why would I want to use it with cloud provisioning of resources?
  14. Configuration management has been around for a long time in web operations and systems administration.  Yet the cultural popularity of it has been limited.  Most systems administrators configure machines as software was developed before version control – that is manually making changes on servers.  Each server can then and usually is slightly different.  Troubleshooting though is straightforward as you login to the box and operate on it directly.  Configuration management brings a large automation tool into the picture, managing servers like strings of a puppet.  This forces standardization, best practices, and reproducibility as all configs are versioned and managed.  It also introduces a new way of working which is the biggest hurdle to its adoption.

    Enter the cloud, and configuration management becomes even more critical.  That’s because virtual servers such as amazons EC2 instances are much less reliable than physical ones.  You absolutely need a mechanism to rebuild them as-is at any moment.  This pushes best practices like automation, reproducibility and disaster recovery into center stage.

    While on the subject of configuration management take a quick peek at hiring a devops guide.

  15. Explain how you would simulate perimeter security using Amazon Web Services model?
  16. Traditional perimeter security that we’re already familiar with using firewalls and so forth is not supported in the Amazon EC2 world.  AWS supports security groups.  One can create a security group for a jump box with ssh access – only port 22 open.  From there a webserver group and database group are created.  The webserver group allows 80 and 443 from the world, but port 22 *only* from the jump box group.  Further the database group allows port 3306 from the webserver group and port 22 from the jump box group.  Add any machines to the webserver group and they can all hit the database.  No one from the world can, and no one can directly ssh to any of your boxes.

    Want to further lock this configuration down?  Only allow ssh access from specific IP addresses on your network, or allow just your subnet.

Did you make it this far?!?! Grab our newsletter.

The New Commodity Hardware Craze aka Cloud Computing

Does anyone remember 15 years ago when the dot-com boom was just starting?  A lot of companies were running on Sun.  Sun was the best hardware you could buy for the price.  It was reliable and a lot of engineers had experience with the operating system, SunOS a flavor of Unix.

Yet suddenly companies were switching to cheap crappy hardware.  The stuff failed more often, had lower quality control, and cheaper and slower buses.  Despite all of that, cutting edge firms and startups were moving to commodity hardware in droves.  Why was it so? Continue reading

7 Ways to Troubleshoot MySQL

MySQL databases are great work horses of the internet.  They back tons of modern websites, from blogs and checkout carts, to huge sites like Facebook.  But these technologies don’t run themselves.  When you’re faced with a system that is slowing down, you’ll need the right tools to diagnose and troubleshoot the problem.  MySQL has a huge community following and that means scores of great tools for your toolbox. Here are 7 ways to troubleshoot MySQL. Continue reading

5 Ways to Avoid EC2 Outages

1. Backup outside of the Cloud

Some of the high profile companies affected by Amazon’s April 2011 outage could have recovered had they kept a backup of their entire site outside of the cloud.  With any hosting provider, managed traditional data center or cloud provider, alternate backups are always a good idea.  A MySQL logical backup and/or incremental backup can be copied regularly offsite or to an alternate cloud provider.  That’s real insurance! Continue reading

3 Ways to Boost Cloud Scalability

Deploying in the Amazon cloud is touted as a great way to achieve high scalability while paying only for the computing power you use. How do you get the best scalability from the technology? Continue reading

5 Ways to Boost MySQL Scalability

There are a lot of scalability challenges we see with clients over and over. The list could easily include 20, 50 or even 100 items, but we shortened it down to the biggest five issues we see.

1. Tune those queries

By far the biggest bang for your buck is query optimization. Queries can be functionally correct and meet business requirements without being stress tested for high traffic and high load. This is why we often see clients with growing pains, and scalability challenges as their site becomes more popular. This also makes sense. It wouldn’t necessarily be a good use of time to tune a query for some page off in a remote corner of your site, that didn’t receive real-world traffic. So some amount of reactive tuning is common and appropriate.

Enable the slow query log and watch it. Use mk-query-digest, the great tool from Maatkit to analyze the log. Also make sure the log_queries_not_using_indexes flag is set.  Once you’ve found a heavy resource intensive query, optimize it!  Use the EXPLAIN facility, use a profiler, look at index usage and create missing indexes, and understand how it is joining and/or sorting.

Also: Why generalists are better at scaling the web

2. Employ Master-Master Replication

Master-master active-passive replication, otherwise known as circular replication, can be a boon for high availability, but also for scalability.  That’s because you immediately have a read-only slave for your application to hit as well.  Many web applications exhibit an 80/20 split, where 80% of activity is read or SELECT and the remainder is INSERT and UPDATE.  Configure your application to send read traffic to the slave or rearchitect so this is possible.  This type of horizontal scalability can then be extended further, adding additional read-only slaves to the infrastructure as necessary.

If you’re setting up replication for the first time, we recommend you do it using hotbackups. Here’s how.

Keep in mind MySQL’s replication has a tendency to drift, often silently from the master. Data can really get out of sync without throwing errors! Be sure to bulletproof your setup with checksums.

Related: Why you can’t find a MySQL DBA

3. Use Your Memory

It sounds very basic and straightforward, yet there are often details overlooked.  At minimum be sure to set these:

  • innodb_buffer_pool_size
  • key_buffer_size (MyISAM index caching)
  • query_cache_size – though beware of issues on large SMP boxes
  • thread_cache & table_cache
  • innodb_log_file_size & innodb_log_buffer_size
  • sort_buffer_size, join_buffer_size, read_buffer_size, read_rnd_buffer_size
  • tmp_table_size & max_heap_table_size

Read: Why Twitter made a shocking admission about their data centers in the IPO

4. RAID Your Disk I/O

What is underneath your database?  You don’t know?  Well please find out!  Are you using RAID 5?  This is a big performance hit.  RAID5 is slow for inserts and updates.  It is also almost non-functional during a rebuild if you lose a disk.  Very very slow performance.  What should I use instead?  RAID 10 mirroring and striping, with as many disks as you can fit in your server or raid cabinet.  A database does a lot of disk I/O even if you have enough memory to hold the entire database.  Why?  Sorting requires rearranging rows, as does group by, joins, and so forth.  Plus the transaction log is disk I/O as well!

Are you running on EC2?  In that case EBS is already fault tolerant and redundant.  So give your performance a boost by striping-only across a number of EBS volumes using the Linux md software raid.

Also checkout our Intro to EC2 Cloud Deployments.

Also of interest autoscaling MySQL on EC2.

Also: Why startups are trying to do without techops and failing

5. Tune Key Parameters

These additional parameters can also help a lot with performance.

innodb_flush_log_at_trx_commit=2

This speeds up inserts & updates dramatically by being a little bit lazy about flushing the innodb log buffer.  You can do more research yourself but for most environments this setting is recommended.

innodb_file_per_table

Innodb was developed like Oracle with the tablespace model for storage.  Apparently the kernel developers didn’t do a very good job.  That’s because the default setting to use a single tablespace turns out to be a performance bottleneck.  Contention for file descriptors and so forth.  This setting makes innodb create tablespace and underlying datafile for each table, just like MyISAM does.

Read this: Why a four letter word still divides dev and ops

Made it to the end eh?!?! Grab our newsletter.

iHeavy Insights 82 – Better Practices

Best Practices, the term we hear thrown around a lot.  But like going on that new years diet, too often ends up more talk than action.

Manage Processes

Operator error ie typing the wrong command is always a risk.  Logging into the wrong server to drop a database or typing the dump command such that you dump data into the database, these are risks that operations folks face everyday.

Accountability is important, be sure all of your systems folks login to their own accounts.  Apply the least privileges model, give permissions on an as needed basis.

Set prompts with big bold names that indicate production servers and their purpose.  Automate repetitive commands that are prone to typos.

Don’t be afraid to give developers read-only accounts on production servers.

Communicate Clearly

Regular team meetings, a la the Agile stand ups are a great way to encourage folks to communicate.  Bring the developers and operations folks together.   Ask everyone in turn to voice their current todos, their concerns and risks they see.  Encourage everyone to listen with an open mind.  Consider different perspectives.

Communication is a cultural attribute.  So it comes from the top.  Encourage this as a CTO or CIO by asking questions, communicating your concerns, repeat your own requests in different words and paraphrase.  Listen to what your team is saying, repeat and rephrase those concerns, and how and when they will be addressed.

Document Processes

A culture of documenting services, and processes is healthy.  It provides a central location and knowledge base for the team.  It also prevents sliding into the situation where only one team member understands how to administer critical business components.  Were that person to be unavailable or to leave the company, you’re stuck reverse engineering your infrastructure and guessing at architectural decisions.

Better Practices

Rather than think of best practices as something you need to achieve today, think of it as an ongoing day-to-day quest for improvement.

  • repetitive manual processes – employ automation & script those processes where possible.
  • where steps require investigation and research – document it
  • where production changes are involved – communicate with business units, qa & operations
  • always be improving – striving for better practices

Review – Test Driven Infrastructure with Chef – Stephen Nelson-Smith

In search of a good book on Chef itself, I picked up this new title on O’Reilly.  It’s one of their new format books, small in size, only 75 pages.

There was some very good material in this book.  Mr. Nelson-Smith’s writing style is good, readable, and informative.  The discussion of risks of infrastructure as code was instructive.  With the advent of APIs to build out virtual data centers, the idea of automating every aspect of systems administration, and building infrastructure itself as code is a new one.  So an honest discussion of the risks of such an approach is bold and much needed.  I also liked the introduction to Chef itself, and the discussion of installation.

Chef isn’t really the main focus of this book, unfortunately.  The book spends a lot of time introducing us to Agile Development, and specifically test driven development.  While these are lofty goals, and the first time I’ve seen treatment of the topic in relation to provisioning cloud infrastructure, I did feel too much time was spent on that.  Continue reading

Software Unit Testing – What is it and why is it important?

Software development is composed of individual components.  As developers are building these units, they build tests to verify them for correctness.  These tests can verify the environment, they can verify data, they can verify edge cases and include test harnesses.  In essence they verify that the code meets the design specification.

There are a few key advantages to the unit testing approach:

  1. Self-Documenting – The tests themselves provide a type of documentation for the system as a whole.
  2. Advances Refactoring – At a later date you may need to repair, rewrite or refactor portions of code.  Previously built unit tests provide a tremendous help to make sure your changes still meet the previous design specification.
  3. Simplifies Functional Testing – With unit testing as an ongoing concern, the final components will likely perform more reliably, and if not the tests & self-documentation may point to how or why they fail to meet some specification.

Sean Hull Quora Discussion – What is software unit testing?