Tag Archives: devops

iHeavy Insights 82 – Better Practices

Best Practices, the term we hear thrown around a lot.  But like going on that new years diet, too often ends up more talk than action.

Manage Processes

Operator error ie typing the wrong command is always a risk.  Logging into the wrong server to drop a database or typing the dump command such that you dump data into the database, these are risks that operations folks face everyday.

Accountability is important, be sure all of your systems folks login to their own accounts.  Apply the least privileges model, give permissions on an as needed basis.

Set prompts with big bold names that indicate production servers and their purpose.  Automate repetitive commands that are prone to typos.

Don’t be afraid to give developers read-only accounts on production servers.

Communicate Clearly

Regular team meetings, a la the Agile stand ups are a great way to encourage folks to communicate.  Bring the developers and operations folks together.   Ask everyone in turn to voice their current todos, their concerns and risks they see.  Encourage everyone to listen with an open mind.  Consider different perspectives.

Communication is a cultural attribute.  So it comes from the top.  Encourage this as a CTO or CIO by asking questions, communicating your concerns, repeat your own requests in different words and paraphrase.  Listen to what your team is saying, repeat and rephrase those concerns, and how and when they will be addressed.

Document Processes

A culture of documenting services, and processes is healthy.  It provides a central location and knowledge base for the team.  It also prevents sliding into the situation where only one team member understands how to administer critical business components.  Were that person to be unavailable or to leave the company, you’re stuck reverse engineering your infrastructure and guessing at architectural decisions.

Better Practices

Rather than think of best practices as something you need to achieve today, think of it as an ongoing day-to-day quest for improvement.

  • repetitive manual processes – employ automation & script those processes where possible.
  • where steps require investigation and research – document it
  • where production changes are involved – communicate with business units, qa & operations
  • always be improving – striving for better practices

Review – Test Driven Infrastructure with Chef – Stephen Nelson-Smith

In search of a good book on Chef itself, I picked up this new title on O’Reilly.  It’s one of their new format books, small in size, only 75 pages.

There was some very good material in this book.  Mr. Nelson-Smith’s writing style is good, readable, and informative.  The discussion of risks of infrastructure as code was instructive.  With the advent of APIs to build out virtual data centers, the idea of automating every aspect of systems administration, and building infrastructure itself as code is a new one.  So an honest discussion of the risks of such an approach is bold and much needed.  I also liked the introduction to Chef itself, and the discussion of installation.

Chef isn’t really the main focus of this book, unfortunately.  The book spends a lot of time introducing us to Agile Development, and specifically test driven development.  While these are lofty goals, and the first time I’ve seen treatment of the topic in relation to provisioning cloud infrastructure, I did feel too much time was spent on that.  Continue reading Review – Test Driven Infrastructure with Chef – Stephen Nelson-Smith

Service Monitoring – What is it and why is it important?

Data centers are complex beasts, and no amount of operator monitoring by itself can keep track of everything.  That’s why automated monitoring is so important.

So what should you monitor?  You can divide up your monitoring into a couple of strategic areas.  Just as with metrics collection, there is business & application level monitoring and then there is lower level system monitoring which is also important.

Business & Application Monitoring

  • If a user is getting an error page or cannot connect
  • If an e-commerce  transaction is failing
  • General service outages
  • If a business goal is met – or not
  • Page timeouts or slowness

Systems Level Monitoring

  • Backups completed and success
  • Error logs from database, webserver & other major services like email
  • Database replication is running
  • Webserver timeouts
  • Database timeouts
  • Replication failures – via error logs & checksum checks
  • Memory, CPU, Disk I/O, Server load average
  • Network latency
  • Network security

Tools that can perform this type of monitoring include Nagios,

Quora discussion – Web Operations Monitoring

Infrastructure Provisioning – What is it and why is it important?

In the old days…

You would have a closet in your startup company with a rack of computers.  Provisioning involved:

  1. Deciding on your architectural direction, what, where & how
  2. Ordering the new hardware
  3. Waiting weeks for the packages to arrive
  4. Setup the hardware, wire things together, power up
  5. Discover some component is missing, or failed and order replacement
  6. Wait longer…
  7. Finally get all the pieces setup
  8. Configure software components and go

Along came some industrious folks who realized power and data to your physical location wasn’t reliable.  So datacenters sprang up.  With data centers, most of the above steps didn’t change except between steps 3 & 4 you would send your engineers out to the datacenter location.  Trips back and forth ate up time and energy.

Then along came managed hosting.  Managed hosting saved companies a lot of headache, wasted man hours, and other resources.  They allowed your company to do more of what it does well, run the business, and less on managing hardware and infrastructure.  Provisioning now became:

  1. Decide on architecture direction
  2. Call hosting provider and talk to sales person
  3. Wait a day or two
  4. Setup & configure software components and go

Obviously this new state of affairs improved infrastructure provisioning dramatically.  It simplified the process and sped it up as well.  What’s more a managed hosting provider could keep spare parts and standard components on hand in much greater volume than a small firm.  That’s a big plus.  This evolution continued because it was a win-win for everyone.  The only downside was when engineers made mistakes, and finger pointing began.  But despite all of that, a managed hosting provider which does only that, can do it better, and more reliably than you can yourself.

So where are we in present day?  We are all either doing, or looking out cloud provisioning of infrastructure.  What’s cloud provisioning?  It is a complete paradigm shift, but along the same trajectory as what we’ve described above.  Now you removed all the waiting.  No waiting for sales team, or the ordering process.  That’s automatic.  No waiting for engineers to setup the servers, they’re already setup.  They are allocated by your software and scripts.  Even the setup and configuration of software components, Operating System and services to run on that server – all automatic.

This is such a dramatic shift, that we are still feeling the affects of it.  Traditional operations teams have little experience with this arrangement, and perhaps little trust in virtual servers.  Business units are also not used to handing the trigger to infrastructure spending over to ops teams or to scripts and software.

However the huge economic pressures continue to push firms to this new model, as well as new operational flexibility.  Gartner predicts this trend will only continue. The advantages of cloud infrastructure provisioning include:

  1. Metered payment – no huge outlay of cash for new infrastructure
  2. Infrastructure as a service – scripted components automate & reduced manual processes
  3. Devops – Manage infrastructure like code with version control and reproduceability
  4. Take unused capacity offline easily & save on those costs
  5. Disaster Recovery is free – reuse scripts to build standard components
  6. Easily meet seasonal traffic requirements – spinup additional servers instantly

On Quora Sean Hull asks – What is infrastructure provisioning and why is it important?

Object Relational Mapper – What is it and why is it important?

Object Relational Mappers or ORMs are a layer of software that sits between web developers and the database backend.  For instance if you’re using Ruby as your web development language, you’ll interact with MySQL through an ORM layer called ActiveRecord.  If you’re using Java, you may be fond of the ORM called Hibernate.

ORMs have been controversial because they expose two very different perspectives to software development.  On the one hand we have developers who are tasked with building applications, fulfilling business requirements, and satisfying functional requirements in a finite amount of time.  On the other hand we have operations teams which are tasked with managing resources, supporting applications, and maintaining uptime and availability.

Often these goals are opposing.  As many in the devops movement have pointed out, these teams don’t always work together keeping common goals in mind.  How does this play into the discussion of ORMs?

Relational databases are a technology developed in the 70’s that use an arcane language called SQL to move data in and out of them.  Advocates of ORMs would argue rightly so, that SQL is cumbersome and difficult to write, and that having a layer of software which helps you in this task is a great benefit.  To be sure it definitely helps the development effort, as software designers, architects and coders can focus more of their efforts on functional requirements and less on arcane minutiae of SQL.

Problems come when you bump up against scalability challenges.  The operations team is often tasked with supporting performance requirements.  Although this can often mean providing sufficient servers, disk, memory & cpu resources to support an application, it also means tuning the application.  Adding hardware can bring you 2x or 5x improvement.  Tuning an application can bring 10x or 100x improvement.  Inevitably this involves query tuning.

That’s where ORMs become problematic, as they don’t promote tweaking of queries.  They are a layer or buffer to keep query writing out of sight.

In our experience as performance and scalability experts for the past fifteen years, query tuning is the single biggest thing you can do to improve your web application.  Furthermore some of the most challenging and troublesome applications we’ve been asked to tune have been built on top of ORMs like Hibernate.

Sean Hull asks on Quora – What is an ORM and why is it important?

Capacity Planning – What is it and why is it important?

Look at your website’s current traffic patterns, pageviews or visits per day, and compare that to your server infrastructure. In a nutshell your current capacity would measure the ceiling your traffic could grow to, and still be supported by your current servers. Think of it as the horsepower of you application stack – load balancer, caching server, webserver and database.

Capacity planning seeks to estimate when you will reach capacity with your current infrastructure by doing load testing, and stress testing. With traditional servers, you estimate how many months you will be comfortable with currently provisioned servers, and plan to bring new ones online and into rotation before you reach that traffic ceiling.

Your reaction to capacity and seasonal traffic variations becomes much more nimble with cloud computing solutions, as you can script server spinups to match capacity and growth needs. In fact you can implement auto-scaling as well, setting rules and thresholds to bring additional capacity online – or offline – automatically as traffic dictates.

In order to be able to do proper capacity planning, you need good data. Pageviews and visits per day can come from your analytics package, but you’ll also need more complex metrics on what your servers are doing over time. Packages like Cacti, Munin, Ganglia, OpenNMS or Zenoss can provide you with very useful data collection with very little overhead to the server. With these in place, you can view load average, memory & disk usage, database or webserver threads and correlate all that data back to your application. What’s more with time-based data and graphs, you can compare changes to application change management and deployment data, to determine how new code rollouts affect capacity requirements.

Sean Hull asks about Capacity Planning on Quora.

Zero Downtime – What is it and why is it important?

For most large web applications, uptime is of foremost importants.  Any outage can be seen by customers as a frustration, or opportunity to move to a competitor.  What’s more for a site that also includes e-commerce, it can mean real lost sales.

Zero Downtime describes a site without service interruption.  To achieve such lofty goals, redundancy becomes a critical requirement at every level of your infrastructure.  If you’re using cloud hosting, are you redundant to alternate availability zones and regions?  Are you using geographically distributed load balancing?  Do you have multiple clustered databases on the backend, and multiple webservers load balanced.

All of these requirements will increase uptime, but may not bring you close to zero downtime.  For that you’ll need thorough testing.  The solution is to pull the trigger on sections of your infrastructure, and prove that it fails over quickly without noticeable outage.  The ultimate test is the outage itself.

Sean Hull on Quora: What is zero downtime and why is it important?

Feature Flags – What are they and why are they important?

Feature flags are switches that developers architect into their web applications to allow a feature to be turned on or off.  It is simple sounding in description, but harder to implement or enable after the fact.

These switches allow the systems team to operationalize new application functionality.  It allows the ability to turn hot button features on or off as needed.  This can be bring a tremendous power and flexibility to the operations team for deployments where traffic patterns and site usage patterns cannot be known in advance.   It can increase uptime and availability of the overall site, by minimizing the impact any new feature might have.

Feature flags can also be implemented as feature dials, allowing the feature to be exposed to a percentage of users, select users, or some other meaningful way to turn it up or down gradually.

Sean Hull asks on Quora: What are feature flags and why are they important?

Devops – What is it and why is it important?

Devops is one of those fancy contractions that tech folks just love.  One part development or developer, and another part operations.  It imagines a blissful marriage where the team that develops software and builds features that fit the business, works closely and in concert with an operations and datacenter team that thinks more like developers themselves.

In the long tradition of technology companies, two separate cultures comprise these two roles.  Developers, focused on development languages, libraries, and functionality that match the business requirements keep their gaze firmly in that direction.  The servers, network and resources those components of software are consuming are left for the ops teams to think about.

So too, ops teams are squarely focused on uptime, resource consumption, performance, availability, and always-on.  They will be the ones worken up at 4am if something goes down, and are thus sensitive to version changes, unplanned or unmanaged deployments, and resource heavy or resource wasteful code and technologies.

Lastly there are the QA teams tasked with quality assurance, testing, and making sure the ongoing dearth of features don’t break anything previously working or introduce new show stoppers.

Devops is a new and I think growing area where the three teams work more closely together.  But devops also speaks to the emerging area of cloud deployments, where servers can be provisioned with command line api calls, and completely scripted.  In this new world, infrastructure components all become components in software, and thus infrastructure itself, long the domain of manual processes, and labor intensive tasks becomes repeatable, and amenable to the techniques of good software development.  Suddenly version control, configuration management, and agile development methodologies can be applied to operations, bringing a whole new level of professionalism to deployments.

Sean Hull asks on Quora – What is devops and why is it important?

iHeavy Insights 79 – Plumbing the Interwebs

I meet new people all the time.  It’s a way of life in New York.  One of the first questions new people ask each other is “What do you do?”.  It begins to sound like a cliche after a while, but it can also provide endless fascinating discussions as there are so many people with different professions in New York.  Some choose a titled answer “i’m an investment banker”, “I’m an emcee”, “I’m an executive recruiter”.  I find for “Web Scalability Consultant” or “Web Operations Expert” this only leaves confused looks.

A Plumber By Another Name

The solution of course is to tell a good story.  Stories illustrate what titles and crusty vernacular cannot.  I’ve used analogies to surgeons or mechanics, of course they all operate on something people can related to in front of them.  People or vehicles we use everyday.  Of course with the internet, there is a huge hidden infrastructure that most people don’t see everyday.  They may vaguely know it’s there, but it’s still hidden out of site.

That’s why I think plumbing provides such an apt visual.  As it turns out the internet is built with countless data pipes both large and small, coming into your home or laying across the bottom of the transatlantic ocean.  These pipes plug into routers, high speed traffic lights and traffic cops.  Ultimately they feed into datacenters, huge rooms filled with racks of computers, holding your websites crown jewels.  Therein contains the images and status updates from your facebook profile, your banking transactions from your personal bank account or credit card, your netflix movie stream, or the email you sent via gmail.  Even your instant messaging stream, or the data from your favorite iphone app are all stored and retrieved from here.

Amazon Outage

The recent Amazon outage has been high profile enough that a lot of folks who don’t follow the latest trends in web operations, devops, and datacenter automation still heard about this event.  Turns out it’s had a silver lining for Amazon cause now everyone is scrutinizing how many sites actually rely on this goliath of a hosting provider.

As it turns out the root of the amazon outage was indeed a plumbing problem.  Amazon has shown rather high transparency publishing intimate details of the problem and it’s resolution.  Read more.

A misconfigured network cascaded through the system creating countless failures.  If you imagine water repairs being done in a large New York City building, they often ask tenants to turn off their water, so they won’t all come on at the same time when service is restored.  SImilarly intricate problems complicated the Amazon effort, slowing down attempts to restore everything after the incident.  I wrote at length about the outage if you’re interested, read more.

BOOK REVIEW:  Game-Based Marketing by Zicherman & Linder

There are so many new books coming out all the time, it’s tough to sift and find the good ones.  Anyone with a website as their storefront, whether they are a product company or a services company, can gain from reading this book.

From leaderboards to frequent flyer programs, badges and more this book is full of real-world examples where game-based principles are put into action.  On the internet where attention is a rarer and rarer commodity, these concepts will surely make a big difference to your business.

Amazon book link – Game Based Marketing