Tag Archives: capacity planning

Why Healthcare.gov desperately needs techops

healthcare.gov logo

Join 15,000 others and follow Sean Hull on twitter @hullsean.

1. Tech-what? A quick education

Techops is operational excellence. It’s the handoff when code is complete. These are the folks who are up at 2am when a website is down. They manage servers, keep the pipes clean, and the hackers out. They also help plan for capacity needs, and may help with load testing too.

If you’re new to technology, imagine a movie set. Story writers (programmers) have already done their part. The producers (venture capital folks) have financed the project. The director (architect) is there trying to put the vision together. But the folks who manage everything on set, from sound guys, camera guys, lighting people, and all the coordination, this is operations. In web application deployments it is devops, sysops or techops.

Also: Why the Twitter IPO is afraid of scalability

2. In contrast with Obama election campaign

Notice how phenomenally well Obama for America project was run. Like a finely tuned machine. Harper Reed and team pulled off one of the most data backed election campaigns in history.

That project used AWS cloud technologies to the fullest, from devops tools like Puppet and Asgard, collaboration tools like Campfire & Github, and superb monitoring & instrumentation tools NewRelic and Chartbeat.

Clearly Obama knows how to run an election. Something is drastically different with the healthcare.gov project. Too many cooks in the kitchen, perhaps?

Read: Why your startup needs professional techops

3. A failure in capacity planning

Many popular news outlets covered the outage, but most pointed to “bugs”, which caused the outage. But when a site dies under load, while it’s working in test & Q/A, that’s a failure of load testing, and capacity planning.

I would wager a good bet, database tuning would definitely help as it’s the most common and prevalent cause of

Read this: What four letter word divides dev and ops?

4. More testing & more Agility needed

Modern software projects take advantage of continuous integration & agile methods. That is they make small incremental changes. Developers build unit tests, and the code is always in a working state. There is no multi-month dev cycle, where your current software is in doubt.

Reports indicate that the healthcare.gov software was being designed & developed using this old and most agree inferior method of software development, the waterfall method. New Yorker criticises it in Don’t go chasing waterfalls.

Read: Why devops talent is in short supply

5. Caching is desperately needed

All high performance, high scale websites need to take advantage of various types of caching as I’ve discussed in detail before. From browser caching, to page & object caching on the server side.

Hayden James investigated in depth, and found healthcare.gov severely lacking. Again this is a huge failure in techops, sysops or devops. It’s not a bug, and not something the developers are responsible to deliver.

Read: Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Scale Quickly Like Birchbox – Startup Scalability 101

One of the great things about the Internet is how it has made it easier to put great ideas into practice. Whether the ideas are about improving people’s lives or a new way to sell and old-fashioned product, there’s nothing like a good little startup tale of creative disruption to deliver us from something old and tired.

We work with a lot of startup firms and we love being part of the atmosphere of optimism and ingenuity, peppered with a bit of youthful zeal – something very indie-rock-and-roll about it. But whether they are just starting out or already picking up pace every startup faces the same challenges to scale a business. Recently, we were reminded of this when we watched Inc’s video interview with Birchbox founders, Hayley Barna and Katia Beauchamp. Continue reading Scale Quickly Like Birchbox – Startup Scalability 101

5 things toxic to scalability

The.Rohit - Flickr
The.Rohit – Flickr

Check out our followup post 5 More Things Deadly to Scalability

If you’re using MySQL checkout 5 ways to boost MySQL scalability.

1. Object Relational Mappers

ORMs are popular among developers but not among performance experts.  Why is that?  Primarily these two engineers experience a web application from entirely different perspectives.  One is building functionality, delivering features, and results are measured on fitting business requirements.  Performance and scalability are often low priorities at this stage.  ORMs allow developers to be much more productive, abstracting away the SQL difficulties of interacting with the backend datastore, and allowing them to concentrate on building the features and functionality.

Scalability is about application, architecture and infrastructure design, and careful management of server components.

On the performance side the picture is a bit different.  By leaving SQL query writing to an ORM, you are faced with complex queries that the database cannot optimize well.  What’s more ORMs don’t allow easy tweaking of queries, slowing down the tuning process further.

Also: Is the difference between dev & ops a four-letter word?

2. Synchronous, Serial, Coupled or Locking Processes

Locking in a web application operates something like traffic lights in the real world.  Replacing a traffic light with a traffic circle often speeds up traffic dramatically.  That’s because when you’re out somewhere in the country where there’s very little traffic, no one is waiting idly at a traffic light for no reason.  What’s more even when there’s a lot of traffic, a traffic circle keeps things flowing.  If you need locking, better to use InnoDB tables as they offer granular row level locking than table level locking like MyISAM tables.

Avoid things like semi-synchronous replication that will wait for a message from another node before allowing the code to continue.  Such waits can add up in a highly transactional web application with many thousands of concurrent sessions.

Avoid any type of two-phase commit mechanism that we see in clustered databases quite often.  Multi-phase commit provides a serialization point so that multiple nodes can agree on what data looks like, but they are toxic to scalability.  Better to use technologies that employ an eventually consistent algorithm.

Related: Is automation killing old-school operations?

3. One Copy of Your Database

Without replication, you rely on only one copy of your database.  In this configuration, you limit all of your webservers to using a single backend datastore, which becomes a funnel or bottleneck.  It’s like a highway that is under construction, forcing all the cars to squeeze into one lane.  It’s sure to slow things down.  Better to build parallel roads to start with, and allow the application aka the drivers to choose alternate routes as their schedule and itinerary dictate.

Using MySQL? Checkout our our howto Easy Replication Setup with Hotbackups.

Read: Do managers underestimate operational cost?

4. Having No Metrics

Having no metrics in place is toxic to scalability because you can’t visualize what is happening on your systems.  Without this visual cue, it is hard to get business units, developers and operations teams all on the same bandwagon about scalability issues.  If teams are having trouble groking this, realize that these tools simple provide analytics for infrastructure.

There are tons of solutions too, that use SNMP and are non-invasive.  Consider Cacti, Munin, OpenNMS, Ganglia and Zabbix to name a few.  Metrics collections can involve business metrics like user registrations, accounts or widgets sold.  And of course they should also include low level system cpu, memory, disk & network usage as well as database level activity like buffer pool, transaction log, locking sorting, temp table and queries per second activity.

Also: Are SQL Databases dead?

5. Lack of Feature Flags

Applications built without feature flags make it much more difficult to degrade gracefully.  If your site gets bombarded by a spike in web traffic and you aren’t magically able to scale and expand capacity, having inbuilt feature flags gives the operations team a way to dial down the load on the servers without the site going down.   This can buy you time while you scale your webservers and/or database tier or even retrofit your application to allow multiple read and write databases.

Without these switches in place, you limit scalability and availability.

Also: Is high availability overrated? The myth of five nines…

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Cloud for Burst Capacity

One very strong case for cloud computing is that it can satisfy applications with seasonal traffic patterns.  One way to test the advantages of the cloud is through a hybrid approach.

Cloud infrastructure can be built completely through scripts.  You can spinup specific AMIs or machine images, automatically install and update packages, install your credentials, startup services, and you’re running.

All of these steps can be performed in advance of your need at little cost.  Simply build and test.  When you’re finished, shutdown those instances.  What you walk away with is scripts.  What do we mean?

The power here is that you carry zero costs for that burst capacity until you need it.  You’ve already build the automation scripts, and have them in place.  When your capacity planning warrants it, spinup additional compute power, and watch your internet application scale horizontally.  Once your busy season is over, scale back and disable your usage until you need it again.

3 Ways to Boost Cloud Scalability

Deploying in the Amazon cloud is touted as a great way to achieve high scalability while paying only for the computing power you use. How do you get the best scalability from the technology? Continue reading 3 Ways to Boost Cloud Scalability

Migrating to the Cloud – Why and why not?

A lot of technical forums and discussions have highlighted the limitations of EC2 and how it loses  on performance when compared to physical servers of equal cost.  They argue that you can get much more hardware and bigger iron for the same money.  So it then seems foolhardy to turn to the cloud.  Why this mad rush to the cloud then?  Of course if all you’re looking at is performance, it might seem odd indeed.  But another way of looking at it is, if performance is not as good, it’s clearly not the driving factor to cloud adoption.

CIOs and CTOs are often asking questions more along the lines of, “Can we deploy in the cloud and settle with the performance limitations, and if so how do we get there?”

Another question, “Is it a good idea to deploy your database in the cloud?”  It depends!  Let’s take a look at some of the strengths and weaknesses, then you decide.

8 big strengths of the cloud

  1. Flexibility in disaster recovery – it becomes a script, no need to buy additional hardware
  2. Easier roll out of patches and upgrades
  3. Reduced operational headache – scripting and automation becomes central
  4. Uniquely suited to seasonal traffic patterns – keep online only the capacity you’re using
  5. Low initial investment
  6. Auto-scaling – set thresholds and deploy new capacity automatically
  7. Easy compromise response – take server offline and spinup a new one
  8. Easy setup of dev, qa & test environments

Some challenges with deploying in the cloud

  1. Big cultural shift in how operations is done
  2. Lower SLAs and less reliable virtual servers – mitigate with automation
  3. No perimeter security – new model for managing & locking down servers
  4. Where is my data?  — concerns over compliance and privacy
  5. Variable disk performance – can be problematic for MySQL databases
  6. New procurement process can be a hurdle

Many of these challenges can be mitigated against.  The promise of the infrastructure deployed in the cloud is huge, so digging our heels in with gradual adoption is perhaps the best option for many firms.  Mitigate the weaknesses of the cloud by:

  • Use encrypted filesystems and backups where necessary
  • Also keep offsite backups inhouse or at an alternate cloud provider
  • Mitigate against EBS performance – cache at every layer of your application stack
  • Employ configuration management & automation tools such as Puppet & Chef

Quora discussion – Why or why not to migrate to the cloud?

iHeavy Insights 81 – Web Performance Metrics

Metrics are those pesky little numbers we like to keep an eye on to see how we’re doing.  Website performance may be full of lots of jargon and fancy terminology but in the end the purpose is the same.  Watch the numbers to know how we’re doing.

In Economics

If you follow the economy you’re probably familiar with GDP or Gross Domestic Product tells you the total amount of goods and services that a country produced.  What about the CPI or Consumer Price Index, well that measures the price of a so-called basket of goods.  Those are intended to be goods everyone must have, such as food & beverages, housing, apparel, transportation, medical care and so forth.  By measuring the CPI we get a sense of consumers buying power or how far the dollar goes.

In Baseball

If you follow sports, you’ve probably heard of a players batting average which is hits divided by at bats.  A simple ratio, gives a picture of the players past performance.  Another statistic is the RBI or runs batted in, which tells you how many times the player caused runs to be scored.

Web Performance

Taken generally metrics give us a quick view of a more complicate picture.  Performance metrics for websites are no different.  For instance if we’re looking at the business or application level, we might keep track of things like:

  • user registrations
  • subscriptions sold
  • widgets sold
  • new accounts sold
  • user & social interactions
  • ratings and other gamification stats

So too at a lower level we can capture metrics of the systems our web application runs on top of with tools like Cacti, Munin, Ganglia, Zabbix, or OpenNMS.  The basics include:

  • cpu utilization
  • network throughput
  • disk throughput
  • memory usage
  • load average

And further down the stack we can keep metrics of are database activity such as:

  • buffer pool usage
  • files & table I/O
  • sorting activity
  • locking and lock waits
  • queries per second
  • transaction log activity

By tracking these metrics over time, we can view graphs at-a-glance and see trends.  What’s more folks from different sides of the business, can get visibility into what others needs are.  Business teams can see server loads and operations people can see real revenue and income.  That brings teams together to a common goal.

Performance Metrics – What are they and why are they important?

In order to understand how fast your website is, we need some numbers.  We call these fancy numbers performance metrics, objective measures that we can track over time.  We can track them for seasonality as well as website traffic and growth.  But we can also track them for feature and application changes based on deployments to see if new code has caused perfermance to improve or degrade.

Some useful business or application performance metrics include:

  • user registrations
  • accounts sold
  • widgets sold
  • user interactions & social metrics
  • so-called gamification, ratings & related

We also want to capture lower-level system  metrics with a tool like Cacti, Ganglia, Munin, OpenNMS, Zabbix or similar:

  • cpu
  • memory
  • disk
  • network
  • load average

Along with the basic system level metrics you’ll want to collect some at the database level such as:

  • InnoDB Buffer Pool activity
  • Files & Tables
  • Binary log activity
  • Locking
  • Sorting
  • Temporary objects
  • Queries/second

Sean Hull asks on Quora – What are web performance metrics and why are they important?

Capacity Planning – What is it and why is it important?

Look at your website’s current traffic patterns, pageviews or visits per day, and compare that to your server infrastructure. In a nutshell your current capacity would measure the ceiling your traffic could grow to, and still be supported by your current servers. Think of it as the horsepower of you application stack – load balancer, caching server, webserver and database.

Capacity planning seeks to estimate when you will reach capacity with your current infrastructure by doing load testing, and stress testing. With traditional servers, you estimate how many months you will be comfortable with currently provisioned servers, and plan to bring new ones online and into rotation before you reach that traffic ceiling.

Your reaction to capacity and seasonal traffic variations becomes much more nimble with cloud computing solutions, as you can script server spinups to match capacity and growth needs. In fact you can implement auto-scaling as well, setting rules and thresholds to bring additional capacity online – or offline – automatically as traffic dictates.

In order to be able to do proper capacity planning, you need good data. Pageviews and visits per day can come from your analytics package, but you’ll also need more complex metrics on what your servers are doing over time. Packages like Cacti, Munin, Ganglia, OpenNMS or Zenoss can provide you with very useful data collection with very little overhead to the server. With these in place, you can view load average, memory & disk usage, database or webserver threads and correlate all that data back to your application. What’s more with time-based data and graphs, you can compare changes to application change management and deployment data, to determine how new code rollouts affect capacity requirements.

Sean Hull asks about Capacity Planning on Quora.

Stress Testing – What is it and why is it important?

Stress testing applications is like putting a car through crash tests, wear and tear tests, and performance tests.  It’s about finding the leaks, and bottlenecks before they become a limitation to growth.  In fact, stress testing is a big part of capacity planning.

There are a few different ways to stress test a web application.  You can start at the database side of the house itself, and just stress test the queries your application uses.  There are benchmarking tools included with MySQL such as mysqlslap which allow you to run a query or sets of queries repeated times against the database.  You can also run them in parallel and in large batches together.  All of these methods are an effort to push the limit and find out when the server can handle no more.

There are tools that operate by firing off repeated url requests to the webserver like httperf and also jmeter. These can be good for hammering away at the server, but if you want to do more complex and nuanced tests a like Selenium will allow you to record a web browsing session and play it back to the server, many times or in parallel again to simulate a greater load on the servers.

Sean Hull asks on Quora – What is Stress Testing and why is it important?