Tag Archives: operations

How do we lock down cloud systems from disgruntled engineers?


I worked at a customer last year, on a short term assignment. A brilliant engineer had built their infrastructure, automated deployments, and managed all the systems. Sadly despite all the sleepless nights, and dedication, they hadn’t managed to build up good report with management.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

I’ve seen this happen so many times, and I do find it a bit sad. Here’s an engineer who’s working his butt off, really wants the company to succeed. Really cares about the systems. But doesn’t connect well with people, often is dismissive, disrespectful or talks down to people like they’re stupid. All of this burns bridges, and there’s a lot of bad feelings between all parties.

How do you manage the exit process? Here’s a battery of recommendations for changing credentials & logins so that systems can’t be accessed anymore.

1. Lock out API access

You can do this by removing the administrator role or any other role their IAM user might have. That way you keep the account around *just in case*. This will also prevent them from doing anything on the console, but you can see if they attempt any logins.

Also: Is AWS too complex for small dev teams?

2. Lock out of servers

They may have the private keys for various serves in your environment. So to lock them out, scan through all the security groups, and make sure their whitelisted IPs are gone.

Are you using a bastion box for access? That’s ideal because then you only have one accesspoint. Eliminate their login and audit access there. Then you’ve covered your bases.

Related: Does Amazon eat it’s own dogfood?

3. Update deployment keys

At one of my customers the outgoing op had setup many moving parts & automated & orchestrated all the deployment processes beautifully. However he also used his personal github key inside jenkins. So when it went to deploy, it used those credentials to get the code from github. Oops.

We ended up creating a company github account, then updating jenkins with those credentials. There were of course other places in the capistrano bits that also needed to be reviewed.

Read: Is aws a patient that needs constant medication?

4. Update dashboard logins

Monitoring with NewRelic or Nagios? Perhaps you have a centralized dashboard for your internal apps? Or you’re using Slack?

Also: Is Amazon too big to fail?

5. Audit Non-key based logins

Have some servers outside of AWS in a traditional datacenter? Or even servers in AWS that are using usernames & passwords? Be sure to audit the full list of systems, and change passwords or disable accounts for the outgoing sysop.

Also: When hosting data on Amazon turns bloodsport?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

5 Things I just learned from James Turnbull about Docker

docker containers

Join 28,000 others and follow Sean Hull on twitter @hullsean.

I just got my hands on a copy of James Turnbull’s new book The Docker Book. It’s an excellent introduction to Linux containers & the powerful things you can do with them. It’s 335 pages covering all the introductory topics to get you up and running and then more advanced topics like working with the docker API, building services & extending docker.

Here’s what I learned…

1. Containers aren’t new

The technology today we call containers in Unix is based on chroot mechanism which was introduced way back in the 80’s.

With traditional virtualization, we use a hypervisor layer, so we emulate hardware. The virtual machine running on top, can run anything, from Windows, to different flavors & versions of unix. It appears to be a completely separate piece of hardware.

With containers we move up to the operating system level, and we create isolation between users. These users all share the same parent operating system. This means it requires dramatically less overhead. That means speed!

Docker is an automation layer built on Lightweight Linux Containers or LXC. To applications it looks like they have their own machine, their own userspace, their own filesystem, their own network.

Also: Is Apple betting against big data?

2. No more VirtualBoxes

Are you tired of waiting for your VMs to spinup? Building dev & test environments becomes lightening fast with Docker. This accelerates software development, and makes a lot of things easier.

Also: When prospects mislead

3. Images, registries & containers

Images share some of the properties of images in hypervisor virtualization. However they are implemented with union file systems. While VirtualBox images take some time to boot, as the entire filesystem must be read & code executed anew, docker images are more like source code to the LXC subsystem.

Registries store your public and private images. The Docker Hub is one popular one. You can also host & deploy your own docker registry as your needs dictate.

Like VMs, containers can be started & stopped at will, albeit at lightening fast speed. They can also be deleted much as a VM can be.

Also: What can new york fashion week teach Chad Dickerson about Net Neutrality?

4. Lightning fast sandboxes

As we mentioned containers are fast. Did we mention really fast?

This can facilitate unit testing & continuous integration. A lot of shops are starting to use Jenkins for continuous integration, and fast testing is key to this process.

Also: Is automation killing old-school technical operations?

5. They work with Vagrant

Are you already using Vagrant to automate deployment of virtual environments. If so the transition is easy. Here Docker becomes your provisioner.

Mark Stratmann put together a great how to, Implementing a Vagrant / Docker Dev environment which we’d recommend you take a look at. You can also head over to the Vagrant docs themselves.

Also: Which tech do startups use most?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Is Hunter Walk right about operations & startups?

The.Rohit - Flickr
The.Rohit – Flickr

Join 26,000 others and follow Sean Hull on twitter @hullsean.

Hunter Walk blogged recently about the importance of building great operations teams. And while he was speaking primarily about business operations, the startup technical operations teams are equally difficult to get right.

1. performance & scalability

As your grows like Birchbox, your customer growth curve may begin to look like a hockey stick. That’s a good problem to have. Will your web application be able to keep up with the onslaught of traffic those customers bring?

Getting performance and scalability just right, will mean fewer site crashes during those key moments when all eyes are on your site.

Also: Is top operations talent hard to find?

2. Operations is key to architecture

Developers will always have strong opinions on architecture. However they may be heavily influenced by their own mandate, features, deliverability & deadlines. So it’s no surprise that they may sometimes choose to build on ORM’s, the middleware brought to you by Hibernate, Cake PHP, Active Record & the like.

And while these technologies seem a necessity in todays modern architectures, they play havoc with your long term scalability. Strong technical operations teams mean a better vision in this area. Heading off your reliance on these technologies will mean managing technical debt before it takes down your country.

Read: Are generalists better at scaling the web?

3. Operations informs strategy

Did you build in those operational switches to turn off the heaviest code, when your site gets overloaded? Operations strategy can help you see these problems on the horizon before they overwhelm you.

Have you considered building a browse only mode for your site? If you’ve ever visited Facebook or Yelp after hours you may have been greeted with the message “We can’t save your comments. Please try again later”. A small innocuous message to end users doesn’t disrupt their enjoyment of the site terribly. But from an technical operations perspective it’s huge. It means teams can perform backups, upgrades and maintenance without interrupting day-to-day activity on the site.

Related: Is scalability a big business?

4. Operations means resilience

We only learn real disaster recovery lessons from storms like Sandy. That’s because resilience highlighted best when it is a real & urgent need.

In technical operations, getting backups right & testing your recovery plan all form key steps in your path to excellence. Get them right before you need them, and ensure repeatability.

Read: Is high availability a real possibility?

5. Operations means technical strength

At the end of the day, getting technical operations right, means you can move from strength to strength. It means building on a solid foundation the likes of Google, Facebook, Foursquare & Etsy. It means you can evolve & grow with your customers, and meet their needs confidently.

Check out: Do startup CEO’s underestimate operational costs?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Why startups need techops

devops divide

I was at a talk recently on node.js. Even if I’m not working with a technology directly, it’s exciting to see what’s out there, and node.js is bringing some hyper fast performance to a certain category of web applications.

During the keynote, the speaker mentioned a service to deploy applications on. I can’t name names unfortunately but it was a cloud solution on top of which you could deploy your application. Go this route
and you can do without an operations team. Avoid overhead of hiring ops, he claimed. And hey, then you can hire more developers!

To be fair I’ve heard much of the same thing at DBA or linux conferences. I can’t count the number of stories that start with “what some idiot developer did that took down our production systems…”.

Yes, it seems dev & ops are still just a tad bit adverserial.

Join 13,000 others and follow Sean Hull on twitter @hullsean.

1. My little known origins as a developer

Many colleagues and clients I’ve met in the New York City startup industry know me primarily as an operations & scalability guy. I tune databases, infrastructure and components to make things lightening fast.

I spent my earliest years at university on the computer lab operations staff. We watched and managed, made sure level zero backups were taken care of, and moved the tapes. Directly after college, I started at a software firm. I did C++ GUI development on the Mac, using the toolbox libraries with Metrowerks Codewarrior. I built split windows, and scroll bars, and displayed rows of data with nice resizable columns. All this wasn’t built into the class library, so for a lot of it we needed to roll our own solution.

We always had a long list of features coming from the business units. I also fielded many support calls, often from the windows platform as the code there hadn’t been managed and built as carefully. But that too was instructive as you could feel the pain of customers day-to-day challenges. It also illustrated the tradeoffs between new code and features, and existing bug fixes and support.

Also: Why generalists are better at scaling the web

2. A trip through the dot-com bubble as Oracle DBA

Through a circuitous path, I moved to New York in the mid-nineties and joined a startup. I had the opportunity to wear a lot of hats there, and apply my computer lab and Linux operating systems experience to the challenge of managed Oracle. I got a lot more involved with operations quick.

As the dot-com bubble grew, I saw a hot and growing demand for Oracle DBAs as most startups used Oracle, but the talent was in short supply. In one startup 80 million dollars was on the line as performance hobbled the website, and investors feared the worst.

Read: Why the Twitter IPO made a shocking admission about scalability

3. Different priorities & mandates

I remember working at Starmedia a media darling at the time. I was analyzing the database & server systems, and finding some code & jobs running during peak daytime hours. Management claimed that could not be the case. Yet for the next days and weeks I saw the same jobs running. I held strong and spoke truth to power as they say. That’s not always easy when you have a lot of investors, screaming CTOs and 100+ hour weeks. But eventually the source of the job was located, and disabled. And the website returned to it’s speedy self.

These experiences though do underline in my mind the different priorities and focus that developers and operations staff have.

Techops, system administrators & DBAs are typically averse to change. They fight it tooth and nail. That isn’t because they like to be curmudgeons though. They are typically very concerned about the business, but from a dramatically different perspective of stability, and reliability, even at 2am in the morning. They are concerned about the longevity of data, consistency, and durability of it.

Developers on the other hand have a different mandate. They are responsible for new business features, solutions to business requirements. Rapid prototyping & reactive or agile is embraced because it means you can deliver quicker to the business.

Crucially, both of these folks care very much for the business. Just with very different priorities.

Check this: Why AirBNB didn’t have to fail

4. Can developers do operations for you?

In a lot of small startups, the initial phase is obviously on building a product. That’s the build phase, and not surprisingly you hire a lot of developers. As you should. But as you grow you may find the operational tasks that are defaulting to one or more developers are taking more and more of their time. As your customer base grows and you’ve seen your first few spikes, it’s time to start thinking about hiring for a real ops role.

In summary, yes they can, but perhaps not well.

Related: How to hire a developer that you can work with

5. Volume discount, made to order or instant coffee

You may choose to go with instant coffee, by bringing someone in-house. You may find the right talent is hard to find. I wrote about this: Why techops and DBAs are in short supply.

Alternatively you may prefer a volume discount from one of the larger remote DBA or managed support solutions such as Oracle’s, Pythian or Percona. These guys all provide great service, but keep in mind how big of a fish you are. You’ll likely work through a ticketing system, and in some cases different engineers will look at your systems at different times. You will likely need either a very hands-on technical CTO or other in-house person to take ownership, and manage things closely.

The third option is a made-to-order coffee. Yes you pay more for Toby’s, Blue Bottle, or Ninth Street Espresso but you get what you pay for as they say. A boutique shop or independent consultant will provide a lot more hand holding, help your internal staff get up to speed, and communicate intimately about the process. If you’re a more non-technical CTO, or you’re very busy running the business, this solution may make a lot of sense for you.

Also: Why cloud detractors need a history lesson

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Cloud Deployment Interview

What does a cloud computing expert need to know? In part one of the cloud interview guide we covered some basic unix & Linux systems administration skills, and cloud computing and infrastructure concepts. Those are key starting points. You might also want to jump to part 3 cloud dba, architecture and management interview questions.

In this second part, let’s dig into deploying applications in the cloud, and day to day operations skills. There’s a lot of material here. We recommend picking a few questions out of the bunch and focusing on those questions, rather than trying to cover all of them.

Also while on the topic of hiring, keep in mind that Hiring is a Numbers Game.

1. Deploying in the Cloud

Deploying applications into virtual or cloud datacenters involves understanding and evaluating providers. Many just deploy on Amazon EC2 as it is far and away the largest cloud hosting solution, with the most robust offering.

You might also like our MySQL DBA Interview Guide as well.

o What sets amazon apart from the other cloud providers?

There are probably two things that set Amazon apart from other cloud infrastructure solutions. EBS or elastic block storage being one. Although the others have storage solutions, and Rackspace is working on their own virtualized storage, Amazon seems to be the furthest ahead with their offering. It is fully virtual, allows arbitrary chunks of storage to be attached to instances, and allows instances to boot of ebs volumes.

The other major point is that since Amazon has grown so large, so quickly, it has more datacenters, in more geographically dispersed areas than other providers. Since these are organized into logical resources, and can be accessed through API, it makes your application infrastructure truly virtual.

o What are some other large cloud providers?

Joyent, Rackspace cloud, Storm on Demand, GoGrid and VoxCloud. There are certainly many others. Take a look at this Quora post: Most Reliable Cloud Providers.

o Tell one vendor management story.

Everyone who has managed operations, has worked with vendors at one point or another. For example if you’ve worked with Rackspace you know that it’s pretty easy to get a human on the line. Amazon on the other hand allows you to do-it-yourself for everything, and only later added on a support service option. So their service pattern and history are different.

Also check out 3 Things CEOs should know about the cloud.

o How do you troubleshoot a problems?

There isn’t really a right or wrong answer to this question, but it’s a nice starting point to discussion. It can also help illustrate a candidates communication skills, and how specifically they walk through solving a problem. What problem they choose as an illustration, and how they work through to a resolution is an important indicator of operations experience.

Pros and cons of Amazon versus Rackspace, configuration management & automation and cloud management solutions like Scalr and Rightscale… these and other skills are a important for a cloud deployment expert.

o What is puppet and chef?

Puppet is a configuration management system which allows ops teams to build templates for servers, and deploy many servers based on those templates. It further allows centralized control of configuration, to automate the management of a large number of servers.

Chef grew out of frustrations of Puppet, and is a sort of next generation configuration management system.

The term infrastructure as code may be thrown around. Since all cloud resources can be provisioned through API calls, everything in server deployment can be *theoretically* done via code, from spinup of servers, to installing packages, to configuring, code checkout, seeding databases and more.

Also our article What is Infrastructure provisioning and why is it important.

o What are some of the pros and cons of configuration management for operations?

Pros include allowing a smaller team to automate the deployment of a large fleet of servers, standardization, and consistency. Cons include complexity when needing to do surgical, urgent changes, and complexity when coming into an existing environment that you’ve inherited.

o How is rightscale different? What does it provide?

Rightscale is a layer on top of your cloud provider. They provide a common interface and dashboard from which to deploy servers. Templating, automation, and multi-cloud support make it a great solution for teams that have less technical expertise on staff or less hands to manage things.

o How about scalr?

They’re another management solution, that supports multiple cloud providers. They offer templating, and auto-scaling too.

While you’re here, take a look at our Myth of Five Nines – Why HA is Overrated.

2. Day to day skills

o What type of programming experience do you have?

The answer is that every ops guy or girl should be able to code, just as every developer should have some basic operational experience. Should and does are often two different things, so ask for some examples.

o shell scripts

Bash, csh, Perl and Python are all part of the Linux administrators toolbox. Writing backup scripts, log rotation, automating routine tasks and so forth are all common needs of an operations expert.

Regular expressions are a part of Unix and used in scripting to search files, cronjobs, and ETL jobs. Ask for some basic examples.

o What is continuous integration?

The old model of code deployment was called waterfall, and allowed long careful planning, coding of new features, testing, and finally deployment. The cycle could take weeks or months and iterative change took a lot of time. Continuous integration also known as agile deployments, allows a much more frequent in some cases many times per day deployment of changes.

o What are metrics good for?

Just like in website visitor tracking, and business analytics, server level analytics and tracking is possible. Collecting server metrics such as load averages, memory, disk and cpu usage over time can be invaluable. When an application slows or server stalls, checking historical metrics can often quickly reveal problems or causes.

What are some examples? nagios, ganglia, cacti, munin, opennms

o What is unit testing?

This allows for software to be build in small testable compontents. When the compontents are coded, tests are also written that test whether they are operating properly, and whether dependencies are also installed and working.

Metrics, monitoring, load testing, firewalls, security & patching, Saas, Paas and IaaS there is a wide swath of skills needed to be competent as a web operations engineer. You’ve got your work cut out for you!
o What is load testing?

By performing some benchmarks, load testing can make estimates about how the application and code will perform when more users are hitting it.

o Security & networking

Sometimes a systems administrator is a generalized admin and sometimes there is a networking specialist on staff who doesn’t allow anyone else to touch that domain.

o What are firewall rules?

Unix services use port numbers to expose those services to the world. Since all servers on the internet are identified by IP addresses, firewall rules are defined around IP addresses or groups of them, and the ports they’re allowed to access.

o What is DNS?

DNS stands for domain name services. This is the sort of yellow pages of the internet. DNS allows a server name to be converted to it’s underlying IP address. It’s a very important service for any network, and generally includes many backup servers for when the primaries experience problems.

o What is a virtual private network?

A VPC provides a network link between a physical datacenter or your offices network, and your cloud provider. It allows you to elastically grow your existing datacenter using virtual resources, while treating those new boxes more like servers in your existing datacenter. IP addresses and subnets are controlled by your existing network rules and admins.

o Why is security important in web operations?

Since your business assets are primarily stored in digital form, the security of those assets depends on the security of your computer systems. Passwords, firewalls and encryption are all relevant.

o Why is patching software important?

Since security is a moving target, and vulnerabilities are constantly being discovered in software, patching and updates are important. Staying fairly current in applying patches means you network and systems will be more secure.

o What is intrusion detection?

Bugs in software open up vulnerabilities and ways into systems. Intrusion detection attempts to detect that such intrusions and avoid further damage.

o What is Saas – Software as a Service?

An example is dropbox, and other so-called hold-my-data type solutions fall into this category.

o What is Iaas – Infrastructure as a Service?

This is raw iron, the virtualized datacenters, hosting providers such as Amazon, GoGrid, Joyent, and Rackspace.

o What is Paas – platform as a service?

Solutions such as heroku, squarespace, wpengine and engineyard fall into this category. Some provide a platform such as the WordPress CMS, with arbitrary scaling options. Others like Heroku and EngineYard allow Ruby applications to be deployed without the need for a lot of fuss at the operational level.

We’re not done yet. In part three of this series, we’ll hit on dba skills, and a series of general questions that cut across the spectrum of web operations. Or jump back to part one of the cloud interview guide.

Read this far? Grab our newsletter – startup scalability.

The four-letter-word dividing Dev and Ops

devops divide

What’s that word?  RISK

Operations teams are tasked with stability and uptime. That means working against change, limiting or slowing it down where possible.

Developers are tasked with features and delivering business solutions. For that an ORM layer seems appealing for example. It speeds up & simplifies coding. At the same time it eliminates database drudgery.

For ops who are tasked with uptime, an ORM is a force against scalability. I’ve outlined five things toxic to scalability. They work against performance.

The question remains – do devops folks solve the problem?

Consider the banking crisis

Bankers are tasked with making money for their shareholders. To do this they innovate with financial products. Though you may argue they are unscrupulous at times, capitalism and shareholder value drive them to find profit.

Meanwhile the government’s job is to provide a level playing field.  They enact rules, regulate and provide oversight and auditing. As with operations, this is a conservative role, that avoids risk, and seeks stability, growth and avoidance of recessions and depressions.

These tradeoffs exist in many disciplines. The trick is how we find the balance.

There is an equally interesting question of decoupling in internet architectures. I’ll write a future piece on similar parallels I see in the economy at large.

The myth of five nines – Why high availability is overrated


Join 12,000 others and follow Sean Hull on Twitter @hullsean.

In the Internet world 24×7 has become the de facto standard. Websites must be always on, available 24 hours a day, 365 days a year. In our pursuit of perfection, performance is being measured down to three decimal places, that is being up 99.999% of the time; in short, five-nines

Just like a mantra, when repeated enough it becomes second nature and we don’t give the idea a second thought. We don’t stop to consider that while it may be generally a good thing to have, is five-nines necessary and is it realistic for the business?

Also: How to hire a developer that doesn’t suck

In my dealings with small businesses, I’ve found that the ones that have been around longer, and with more seasoned managers tend to take a more flexible and pragmatic view of the five-nines standard. Some even feel that periods of outages during off hours as – *gasp* – no problem at all! On the other hand it is a universal truth held by the next-big-idea startups that 24×7 is do or die. To them, a slight interruption in service will send the wrong signal to customers.

The sense I get is that businesses that have been around longer have more faith in their customers and are confident about what their customers want and how to deliver it.  Meanwhile startups who are building a customer base feel the need to make an impression and are thus more sensitive to perceived limitations in their service.

Of course the type of business you run might well inform your policy here. Short outages in payments and e-commerce sites could translate into lost revenue while perhaps a mobile game company might have a little more room to breathe.

Related: Why generalists are better at scaling the web

Sustaining five nines is too expensive for some

The truth is sustaining high availability at the standard of five-nines costs a lot of money. These costs are incurred from buying more servers, whether as physical infrastructure or in the cloud. In addition you’ll likely involve more software components and configuration complexity. And here’s a hard truth, with all that complexity also comes more risk.  More moving parts means more components that can fail. Those additional components can fail from bugs, misconfiguration, or interoperability issues.

What’s more, pushing for that marginal 0.009% increase in high availability means you’ll require more people and create more processes.

Read this: Why reddit didn’t have to fail

Complex architecture downtime

In a client engagement back in 2011, I worked with a firm in the online education space.  Their architecture was quite complex.  Although they had web servers and database servers—the standard internet stack—they did not have standardized operations.  So they had the Apache web server on some boxes, and Nginx on others.  What’s more they had different versions of each as well as different distributions of Linux, from Ubuntu to RedHat Enterprise Edition.  On the database side they had instances on various boxes, and since they weren’t all centralized they were not all being backed up.  During one simple maintenance operation, a couple of configurations were rearranged, bringing the site down and blocking e-commerce transactions for over an hour.  It wasn’t a failure of technology but a failure of people and processes made worse by the hazard of an overly complex infrastructure.

In another engagement at a financial media firm, I worked closely with the CTO outlining how we could architect an absolutely zero downtime infrastructure.  When he warned that “We have no room for *ANY* downtime,” alarm bells were ringing in my head already.

Also: Why RDS doesn’t support Maria DB or Percona

When I hear talk of five-nines, I hear marketing rhetoric, not real-world risk reduction.   Take for example the power grid outage that hit the Northeast in 2003.  That took out power from large swaths of the country for over 24 hours.  In real terms that means anyone hosted in the Northeast failed five-nines miserably because downtime for 24 hours would be almost 300 years of downtime at the five-nines standard!

For true high availability look at better management of processes

So what can we do in the real-world to improve availability?  Some of the biggest impacts will come from reducing so-called operator error, and mistakes of people and processes.

Before you think of aiming for five-nines,  first ask some of these questions:

o Do you test servers?
o Do you monitor logfiles?
o Do you have network wide monitoring in place?
o Do you verify backups?
o Do you monitor disk partitions?
o Do you watch load average?
o Do you monitor your server system logs for disk errors and warnings?
o Do you watch disk subsystem logs for errors? (the most likely component in hardware to fail is a disk)
o Do you have server analytics?  Do you collect server system metrics?
o Do you perform fire drills?
o Have you considered managed hosting?

If you’re thinking about and answering these questions you’re well on your way to improving availability and uptime.

Read this: Top MySQL interview questions for DBAs, hiring managers & recruiters

Want more? Grab our Scalable Startups monthly for more tips and special content. Here’s a sample

How to hire a developer that doesn't suck

Strip by Randall Munroe; xkcd.com

First things first. This is not meant to be a beef against developers. But let’s not ignore the elephant in the living room that is the divide between brilliant code writers and the risk averse operations team.

By the way we also have a MySQL DBA Interview Questions article which is quite popular.

Also take a look at our AWS & EC2 Interview questions piece.

Lastly we have a great Oracle DBA Hiring Guide.

It is almost by default that developers are disruptive with their creative coding while the guys in operations, those who deploy the code, constantly cross their fingers in the hope that application changes won’t tilt the machine. And when you’re woken up at 4am to deal with an outage or your sluggish site is costing millions in losses, the blame game and finger-pointing starts.

If you manage a startup you may be faced with this problem all the time. You know your business, you know what you’re trying to build but how do you find people who can help you build and execute your ideas with minimal risk?

Ideally, you want people who can bridge the mentality divide between the programmers eager to see feature changes, the business units pushing for them, and the operations team resistant to changes for the sake of stability. Continue reading How to hire a developer that doesn't suck

Service Monitoring – What is it and why is it important?

Data centers are complex beasts, and no amount of operator monitoring by itself can keep track of everything.  That’s why automated monitoring is so important.

So what should you monitor?  You can divide up your monitoring into a couple of strategic areas.  Just as with metrics collection, there is business & application level monitoring and then there is lower level system monitoring which is also important.

Business & Application Monitoring

  • If a user is getting an error page or cannot connect
  • If an e-commerce  transaction is failing
  • General service outages
  • If a business goal is met – or not
  • Page timeouts or slowness

Systems Level Monitoring

  • Backups completed and success
  • Error logs from database, webserver & other major services like email
  • Database replication is running
  • Webserver timeouts
  • Database timeouts
  • Replication failures – via error logs & checksum checks
  • Memory, CPU, Disk I/O, Server load average
  • Network latency
  • Network security

Tools that can perform this type of monitoring include Nagios,

Quora discussion – Web Operations Monitoring

Devops – What is it and why is it important?

Devops is one of those fancy contractions that tech folks just love.  One part development or developer, and another part operations.  It imagines a blissful marriage where the team that develops software and builds features that fit the business, works closely and in concert with an operations and datacenter team that thinks more like developers themselves.

In the long tradition of technology companies, two separate cultures comprise these two roles.  Developers, focused on development languages, libraries, and functionality that match the business requirements keep their gaze firmly in that direction.  The servers, network and resources those components of software are consuming are left for the ops teams to think about.

So too, ops teams are squarely focused on uptime, resource consumption, performance, availability, and always-on.  They will be the ones worken up at 4am if something goes down, and are thus sensitive to version changes, unplanned or unmanaged deployments, and resource heavy or resource wasteful code and technologies.

Lastly there are the QA teams tasked with quality assurance, testing, and making sure the ongoing dearth of features don’t break anything previously working or introduce new show stoppers.

Devops is a new and I think growing area where the three teams work more closely together.  But devops also speaks to the emerging area of cloud deployments, where servers can be provisioned with command line api calls, and completely scripted.  In this new world, infrastructure components all become components in software, and thus infrastructure itself, long the domain of manual processes, and labor intensive tasks becomes repeatable, and amenable to the techniques of good software development.  Suddenly version control, configuration management, and agile development methodologies can be applied to operations, bringing a whole new level of professionalism to deployments.

Sean Hull asks on Quora – What is devops and why is it important?