Tag Archives: web operations

The Art of Resistance

Sometimes, you have to be the bad guy. Be resistant to change. Here’s a story about how stubbornness pays off. As we’ve written about before A 4 letter word divides Dev & Ops.

I had one experience working as the primary MySQL DBA for an internet startup. Turns out they had Oracle for some applications too. And another DBA just to handle the Oracle stuff.

So it came time for Oracle guy to go on vacation. Suddenly these Oracle systems landed on my shoulders. We reviewed everything in advance, then he bid his goodbyes.

Almost as soon as he was out the door I started getting requests to change things.
“Oh we have to add this field”, or “oh we’d like to make this table change”.

I resisted enough to hold off development for a week.

“We’re now getting more heat from the CTO. Apparently certain pages on the site don’t save some very important content properly. It’s costing the business a lot of money. We need to make this change.”

My response –

[quote]No I can’t sanction this. If you want to do it, understand that it could well break Oracle’s replication.[/quote]

Oracle’s multi-master replication notoriously requires a lot of baby sitting. We held off for a few more days, and the Oracle DBA returned from his much needed vacation.

When discussing it afterward he said…

[quote]Am very glad you didn’t change those fields in the database. It would indeed have broken replication and caused problems with the backups![/quote]

Moral of the story…

  • apply the brakes around tight turns
  • don’t be afraid to say, we’re not going to do this before testing
  • you may have to say – we’re not going to do this at all…

Data warehousing – What is it and why is it important?

A data warehouse is a special type of database.  It is used to store large amounts of data, such as analytics, historical, or customer data, and then build large reports and data mining against it.  It is markedly different from a web-facing or high-transaction database, which typically has many many small transactions or pieces of data that are constantly changing, through many 100’s or 1000’s or small user sessions.  These typically execute in speeds on the order of 1/100th of a second, while in data warehouse you have fewer large queries which can take minutes to execute.

Data warehouses are tuned for updates happening in bulk via batch jobs, and for large queries which need big chunks of memory to sort and cross-tabulate data from different tables.  Often full table scans are required because of the specialized one-off nature of these reports.  The same queries are not executed over and over.

It’s important not to mix data warehousing databases with transactional databases in the same instance, whether you are dealing with MySQL or Oracle.  That’s because they are tuned totally differently.  It would be like trying to use the same engine for commuting to work, and a container ship traveling around the world.  Different jobs require different databases or databases that with their dials set for different uses.

Quora discussion of data warehousing – Sean Hull

Agile – What is it and why is it important?

Agile software development seeks a more lightweight methodology of making changes and releases to software.  In the traditional, incremental approach, large pieces of software are written at once, and releases happen less frequently.  Once features are complete, the testing phase happens, and then deployment to production.  These releases can happen over many weeks in time, so turnaround for new features tends to be slow.   Advocates would argue that this forces discipline in the process, and prevents haphazard releases and buggy software.

Agile methodologies, seek to accelerate releases of much smaller pieces of code. These releases can happen daily or even many times a day, as developers themselves are given the levers to push code.  Agile tends to be more reactive to business needs, with less planning and requirements gathering up front.

While Agile remains the buzzword of the day, it may not work for every software development project.  Web development & applications where small failures can easily be tolerated and where small teams are at work on the effort, make most sense.

Sean Hull asks on Quora – What is Agile software development and why is it important?

Sharding – What is it and why is it important?

Sharding is a way of partitioning your datastore to benefit from the computing power of more than one server.  For instance many web-facing databases get sharded on user_id, the unique serial number your application assigns to each user on the website.

Sharding can bring you the advantages of horizontal scalability by dividing up data into multiple backend databases.  This can bring tremendous speedups and performance improvements.

Sharding, however has a number of important costs.

  • reduced availability
  • higher administrative complexity
  • greater application complexity

High Availability is a goal of most web applications as they aim for always-on or 24×7 by 365 availability.  By introducing more servers, you have more components that have to work flawlessly.  If the expected downtime of any one backend database is 1/2 hour per month and you shard across five servers, your downtime has now increased by a factor of five to 2.5 hours per month.

Administrative complexity is an important consideration as well.  More databases means more servers to backup, more complex recovery, more complex testing, more complex replication and more complex data integrity checking.

Since Sharding keeps a chunk of your data on various different servers, your application must accept the burden of deciding where the data is, and fetching it there.  In some cases the application must make alternate decisions if it cannot find the data where it expects.  All of this increases application complexity and is important to keep in mind.

Sean Hull asks on Quora – What is Sharding and why is it important?

Object Relational Mapper – What is it and why is it important?

Object Relational Mappers or ORMs are a layer of software that sits between web developers and the database backend.  For instance if you’re using Ruby as your web development language, you’ll interact with MySQL through an ORM layer called ActiveRecord.  If you’re using Java, you may be fond of the ORM called Hibernate.

ORMs have been controversial because they expose two very different perspectives to software development.  On the one hand we have developers who are tasked with building applications, fulfilling business requirements, and satisfying functional requirements in a finite amount of time.  On the other hand we have operations teams which are tasked with managing resources, supporting applications, and maintaining uptime and availability.

Often these goals are opposing.  As many in the devops movement have pointed out, these teams don’t always work together keeping common goals in mind.  How does this play into the discussion of ORMs?

Relational databases are a technology developed in the 70’s that use an arcane language called SQL to move data in and out of them.  Advocates of ORMs would argue rightly so, that SQL is cumbersome and difficult to write, and that having a layer of software which helps you in this task is a great benefit.  To be sure it definitely helps the development effort, as software designers, architects and coders can focus more of their efforts on functional requirements and less on arcane minutiae of SQL.

Problems come when you bump up against scalability challenges.  The operations team is often tasked with supporting performance requirements.  Although this can often mean providing sufficient servers, disk, memory & cpu resources to support an application, it also means tuning the application.  Adding hardware can bring you 2x or 5x improvement.  Tuning an application can bring 10x or 100x improvement.  Inevitably this involves query tuning.

That’s where ORMs become problematic, as they don’t promote tweaking of queries.  They are a layer or buffer to keep query writing out of sight.

In our experience as performance and scalability experts for the past fifteen years, query tuning is the single biggest thing you can do to improve your web application.  Furthermore some of the most challenging and troublesome applications we’ve been asked to tune have been built on top of ORMs like Hibernate.

Sean Hull asks on Quora – What is an ORM and why is it important?

Capacity Planning – What is it and why is it important?

Look at your website’s current traffic patterns, pageviews or visits per day, and compare that to your server infrastructure. In a nutshell your current capacity would measure the ceiling your traffic could grow to, and still be supported by your current servers. Think of it as the horsepower of you application stack – load balancer, caching server, webserver and database.

Capacity planning seeks to estimate when you will reach capacity with your current infrastructure by doing load testing, and stress testing. With traditional servers, you estimate how many months you will be comfortable with currently provisioned servers, and plan to bring new ones online and into rotation before you reach that traffic ceiling.

Your reaction to capacity and seasonal traffic variations becomes much more nimble with cloud computing solutions, as you can script server spinups to match capacity and growth needs. In fact you can implement auto-scaling as well, setting rules and thresholds to bring additional capacity online – or offline – automatically as traffic dictates.

In order to be able to do proper capacity planning, you need good data. Pageviews and visits per day can come from your analytics package, but you’ll also need more complex metrics on what your servers are doing over time. Packages like Cacti, Munin, Ganglia, OpenNMS or Zenoss can provide you with very useful data collection with very little overhead to the server. With these in place, you can view load average, memory & disk usage, database or webserver threads and correlate all that data back to your application. What’s more with time-based data and graphs, you can compare changes to application change management and deployment data, to determine how new code rollouts affect capacity requirements.

Sean Hull asks about Capacity Planning on Quora.

Zero Downtime – What is it and why is it important?

For most large web applications, uptime is of foremost importants.  Any outage can be seen by customers as a frustration, or opportunity to move to a competitor.  What’s more for a site that also includes e-commerce, it can mean real lost sales.

Zero Downtime describes a site without service interruption.  To achieve such lofty goals, redundancy becomes a critical requirement at every level of your infrastructure.  If you’re using cloud hosting, are you redundant to alternate availability zones and regions?  Are you using geographically distributed load balancing?  Do you have multiple clustered databases on the backend, and multiple webservers load balanced.

All of these requirements will increase uptime, but may not bring you close to zero downtime.  For that you’ll need thorough testing.  The solution is to pull the trigger on sections of your infrastructure, and prove that it fails over quickly without noticeable outage.  The ultimate test is the outage itself.

Sean Hull on Quora: What is zero downtime and why is it important?

Feature Flags – What are they and why are they important?

Feature flags are switches that developers architect into their web applications to allow a feature to be turned on or off.  It is simple sounding in description, but harder to implement or enable after the fact.

These switches allow the systems team to operationalize new application functionality.  It allows the ability to turn hot button features on or off as needed.  This can be bring a tremendous power and flexibility to the operations team for deployments where traffic patterns and site usage patterns cannot be known in advance.   It can increase uptime and availability of the overall site, by minimizing the impact any new feature might have.

Feature flags can also be implemented as feature dials, allowing the feature to be exposed to a percentage of users, select users, or some other meaningful way to turn it up or down gradually.

Sean Hull asks on Quora: What are feature flags and why are they important?

Devops – What is it and why is it important?

Devops is one of those fancy contractions that tech folks just love.  One part development or developer, and another part operations.  It imagines a blissful marriage where the team that develops software and builds features that fit the business, works closely and in concert with an operations and datacenter team that thinks more like developers themselves.

In the long tradition of technology companies, two separate cultures comprise these two roles.  Developers, focused on development languages, libraries, and functionality that match the business requirements keep their gaze firmly in that direction.  The servers, network and resources those components of software are consuming are left for the ops teams to think about.

So too, ops teams are squarely focused on uptime, resource consumption, performance, availability, and always-on.  They will be the ones worken up at 4am if something goes down, and are thus sensitive to version changes, unplanned or unmanaged deployments, and resource heavy or resource wasteful code and technologies.

Lastly there are the QA teams tasked with quality assurance, testing, and making sure the ongoing dearth of features don’t break anything previously working or introduce new show stoppers.

Devops is a new and I think growing area where the three teams work more closely together.  But devops also speaks to the emerging area of cloud deployments, where servers can be provisioned with command line api calls, and completely scripted.  In this new world, infrastructure components all become components in software, and thus infrastructure itself, long the domain of manual processes, and labor intensive tasks becomes repeatable, and amenable to the techniques of good software development.  Suddenly version control, configuration management, and agile development methodologies can be applied to operations, bringing a whole new level of professionalism to deployments.

Sean Hull asks on Quora – What is devops and why is it important?

Degrade Gracefully – What is it and why is it imporant?

Websites and web applications have traffic patterns that are often unpredictable.  After all growth in traffic is really what we’re after.  However, even with the best stress testing, it’s sometimes difficult to predict what areas of the site will get innundated, or how the site will scale.

Degrade gracefully describes an architecture built specially to unwind in a smooth manner without any real site-wide outage.  What do we mean by that?  We mean build in operational switches to turn off components in the site.  Have a star rating on pages?  Build an on/off switch for your operations team to disable it if necessary.  Have site-wide comments, or robust search?  Allow those features to be disabled.  If possible, architect in a read-only mode for your site that you can turn on in a real difficult situation.  By operationalizing these components, you give more flexibility to the operations team, and reduce the likelihood of having a complete outage.

Sean Hull asks on Quora: What does degrade gracefully mean, and why is it important?