What does DEV OPS mean?

via GIPHY

I was recently interviewed by Victor Farcic. He asked me a lot of interesting questions.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

One really struck me that I thought was important, about the whole devops movement.

How do you define devops

Viktor Farcic: Moving on to a more general subject, how would you define DevOps? I’ve gotten a different answer from every single person I’ve asked.

Sean Hull: I have a lot of opinions about it actually. I wrote an article on my blog a few years ago called The Four-Letter Word Dividing Dev and Ops, with the implication being that the four-letter word might be a swear word, akin to the development team swearing at the operations team, and the operations team swearing at the development team. But the four-letter word I was referring to was “risk.”

Related: Why generalists are better at scaling the web

To summarize my article, in my view the development and the operations teams of old were separate silos in business, and they had very different mandates. Developers are tasked with writing code to build a product and to answer the needs of the customers, while directly building change into and facilitating a more sophisticated product. So, their thinking from day to day is about change and answering the requirements of the products team.
On the other hand, the operations team’s mandate is stability. It’s, “I don’t want these systems going down at 2:00 a.m.” So, over the long term, the operations teams are thinking about being as conservative as possible and having fewer moving parts, less code, and less new technologies. The simpler your stack is, the more reliable it is and the more robust and less likely it is to fail. I think the traditional reason why developers and operations teams were separated into silos was because of those two very different mandates.

Also: Walking the delicate balance of transparency

They’re two different ways of prioritizing your work and your priorities when you think about the business and the technology. However, the downside was that those teams didn’t really communicate very well, and they were often at each other’s throats, pushing each other in opposite directions. But to answer your question, “what is DevOps?” I think of DevOps as a cultural movement that has made efforts to allow those teams to communicate better, and that’s a really good thing.

Read: One time in 2013 I had to take the fall

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Can humility help engagements succeed?

via GIPHY

I was reading this article on Vox recently titled Intellectual Humility: the importance of knowing you might be wrong.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

It caught my attention, and I think we can expand on it a bit. Here are my thoughts.

1. Admitting when you’re wrong

Of course we’ve all had moments when we’re wrong. We make a proclamation, which turns out wrong. We measure something incorrectly. Or we forecast imprecisely.

It is hard to stand on the stage. The spotlight is on you. And when you do that you can be the object of criticism, and speculation. Just like everyone you may make mistakes, but when the spotlight is on you, it can weigh heavier.

That is exactly the time to be a bit humble, acknowledge your thought process, and where you went wrong. By standing up and admitting your mixup, you will come out the other side stronger.

Related: How can we keep cloud architectures simple

2. Admitting you might be wrong

This can be harder. As engineers we like to problem solve. We spend years exploring math & science, looking for the “truth”. The more one searches for it thought, sometimes the more illusive it can be.

Measurements are never exact. And theories and architectures often fail in the face of real world traffic. Applications fail. Servers fail. Outages happen. Customers especially paying ones will inevitably get angry, and this can backfire onto you.

Be prepared for the real world. It gets messy.

Also: What hidden things does a deposit reveal?

3. Allowing space for others to be wrong

This is a tricky one. You may know what others don’t, but it may take finesse to share that truth. You may have to sell your perspective, even while another perspective may be measurably wrong.

Be prepared to sometimes let things break a little. As hard as this is, it may allow for others to learn.

Like immunizing, sometimes failure can teach what words cannot.

Read: Can communication mixups sour an engagement?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

What types of management problems plague startups?

via GIPHY

Being an avid reader of Fred Wilson’s AVC, I’ve learned much over the years. And one thing he underscores is that *ideas* are a dime a dozen. And that great investments are in team & execution.

Join 38,000 others and follow Sean Hull on twitter @hullsean.

As a long time consultant, I’ve had the good fortune to see a lot of startups under the microscope. If you work a FT role for 2 years, over a decade you may work at 5 companies. In the same amount of time, i’ve worked at over 65 companies.

In those years I’ve encountered great teams that are super organized, and continue to move the product forward. But I have also seen a number of symptoms, that caused the business problems, and slowed down their march forward.

Low morale

One firm I worked at a few years back was in the space around education, specifically with a lot of microlearning products, with big customers doing corporate onboarding.

Their sales team was world class, closing bigger and bigger deals, but engineering had terrible and festering problems. As it went, they grew to have hundreds of employees in a matter of a year or two. Meanwhile the CTO was not a big people person. He didn’t like speaking in front of large groups, nor was he very hands on. As a small ten person startup he was super technical and talented, but as the company grew so fast, it left a leadership vacuum.

And then some bad hires grew the engineer team fast. But internally there was a lot of infighting. The original founding team worked hard and had strong direction, but the new hires all vied for control. And the ugly personalities reared their heads.

After a few short weeks, half the engineering team quit, in a matter of days. A tough blow to a team already struggling to keep up with growth.

It is not easy to right a large plane in mid flight like that, carrying plenty of technical debt besides.

Related: A CTO must never do this

Bad alignment

Another place I had the pleasure of working at was a well known digital media brand, that expanded into film production, recording and even investigative reporting. For all it’s wide ranging efforts, it presided over a huge growth business, with seemingly unlimited revenue. Impressive to be sure.

On the technology side, however things were not so sunny. As their business grew, they planned to consoldate data from many disparate divisions. And this is a process that many growing businesses go through. Finance in one platform & database, bookings & production in another, while analytics and viewer statistics in yet a third. But how to report on all of that data?

As a special crack team of big data experts, we were assigned the task of building out this centralized repository of business truth. And as we built and architect that system, we needed to work closely with the operations division.

Now in this business, they were using public cloud, Amazon Web Services like many other startups. However they had a separate team of devops who presided over these accounts.

As our team was handed strict deadlines to deliver working reports & systems, we had conference calls with the Devops team. However that team was not on board with those deadlines. They pushed back and claimed such systems would take months to setup.

As we explained expectations being pushed on our shoulders, Devops said “just push back and say no”. They advised that we “send it back up the chain”

But what if there’s a chink in the chain?

Clearly the two teams were not aligned at all on deadlines & deliverables. And that’s not a fault of either of those teams. It straightaway falls in the lap of management to align those.

And we were somehow stuck in the middle. Ugh!

Related: How to avoid legal trouble in consulting

Loose discipline

One startup I worked at had a security and authentication app.

Here teams were fairly happy on the whole. In fact they raved about having a great boss. Indeed the boss was a very kind leader, understanding, patient, and hardworking.

However, over and over, we lacked a “decider”. Here other team members were giving each other tasks. Promises were made loosely, and then forgotten one or two weeks later. And a constant lack of direction dragged down delivery.

For my money, a promise to have a meeting at 10am, is one all parties should abide. Whatever their level in the business. Not be late, have excuses about trains, or simply skip the meeting with no explanation. These types of habits cause the team to grow weary, and lower the bar of expectations.

Frustrating indeed.

Related: When you have to take the fall

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

What does the failure of Flatiron School mean for coding bootcamps?

via GIPHY

If you missed the news, New York’s AG announced a settlement with Flatiron School over operating without a license and false advertising.

The news of the coding bootcamp failure splashed all over hacker news a couple of days ago.

Join 37,000 others and follow Sean Hull on twitter @hullsean.

With the explosion of coding bootcamps in recent years, it speaks volumes, as demand for coding & software skills continues to outpace supply. What’s more the starting salaries aren’t bad either.

But how will this affect coding bootcamps going forward?

Would you like a helping of beaurocracy?

Part of the ruling was regarding licensed teachers…

“In order to obtain a SED license, a non-degree granting career school must meet a number of criteria, including using an approved curriculum and employing a licensed director and teachers.”

One thing that sets coding bootcamps apart is that they train their own teachers. And they also use their own curricula. And while protecting consumers is certainly a worthwhile goal, the ruling means bootcamps will have to navigate government bureaucracy for approval. Some have pointed out that the process can be slow & full of red tape. Which is sort of counter to the whole agile startup private industry philosophy. We’ll see!

Also: Is Amazon too big to fail?

$75k after 3 months?

One of the claims their marketing made was that many students were making $75k after a few months of study. The ruling underscored this as particularly misleading. more here

As anyone who has studied computer science knows, there’s a lot of foundational concepts in logic, mathematics & problem solving, which you don’t develop overnight. Hopefully this ruling with hammer home the idea, that it takes a little bit more time folks!

Read: Is AWS too complex for small dev teams? The growing demand for Cloud SRE

Does it please the crown?

One of the comments on hacker news asks “Does it please the crown?”. By slapping these guys on the wrist, the barrier to entry will be higher. Going forward, they will have to pass more hurdles & government beaurocracy.

One of the things that sets coding schools apart is that they can train their own teachers & build their own syllabus. We’ll see if these new hurdles slow things down or not.

Related: How to build an operational datastore on Amazon Redshift with S3

Billion dollar Google bet

Almost as if to provide a counterpoint, timely knews comes out of the google camp. They have commited to $1 billion in grants to train workers.

It seems demand for coding skills will be strong for the foreseeable future.

Related: How to build an operational datastore on Amazon Redshift with S3

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How can startups learn from the Dyn DNS outage?

storm coming

As most have heard by now, last Friday saw a serious DDOS attack against one of the major US DNS providers, Dyn.

Join 32,000 others and follow Sean Hull on twitter @hullsean.

DNS being such a critical dependency, this affected many businesses across the board. We’re talking twitter, etsy, github, Airbnb & Reddit to name just a few. In fact Amazon Web Services itself was severely affected. And with so many companies hosting on the Amazon cloud, it’s no wonder this took down so much of the internet.

1. What happened?

According to Brian Krebs, a Mirai botnet was responsible for the attack. What’s even scarier, those requests originated for IOT devices. You know, baby monitors, webcams & DVRs. You’ve secured those right? 🙂

Brian has posted a list of IOT device makers that have backdoors & default passwords and are involved. Interesting indeed.

Also: Is a dangerous anti-ops movement gaining momentum?

2. What can be done?

Companies like Dyn & Cloudflare among others spend plenty of energy & engineering resources studying attacks like this one, and figuring out how to reduce risk exposure.

But what about your startup in particular? How can we learn from these types of outages? There are a number of ways that I outline below.

Also: How do we lock down systems from disgruntled engineers?

3. What are your dependencies?

After an outage like the Dyn one, it’s an opportunity to survey your systems. Take stock of what technologies, software & services you rely on. This is something your ops team can & likely wants to do.

What components does your stack rely on? Which versions are hardest to upgrade? What hardware or services do you rely on? Which APIs do you call out to? Which steps or processes are still manual?

Related: The myth of five nines

4. Put your eggs in many baskets

Awareness around your dependencies, helps you see where you may need to build in redundancy. Can you setup a second cloud provider for DR? Can you use an alternate API to get data, when your primary is out? For which dependencies are your hands tied? Where are your weaknesses?

Read: Is AWS too complex for small dev teams?

5. Don’t assume five nines

The gold standard in technology & startup land has been 5 nines availability. This is the SLA we’re expected to shoot for. I’ve argued before (see: myth of five nines) that it’s rarely ever achieved. Outages like this one, bringing hours long downtime, kill hour 5 nines promise for years. That’s because 5 nines means only 5 ½ minutes downtime per year!

Better to be realistic that outages can & will happen, manage & mitigate, and be realistic with your team & your customers.

Also: Is AWS a patient that needs constant medication?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

How 1and1 failed me

1and1 fail

I manage this blog myself. Not just the content, but also the technology it runs on. The systems & servers are from a hosting company called 1and1.com. And recently I had some serious problems.

Join 31,000 others and follow Sean Hull on twitter @hullsean.

The publishing platform wordpress, as a few versions out of date. Because of that some vulnerabilities surfaced.

1. Malware from Odessa

While my eyes were on content, some russian hackers managed to scan my server & due to the older version of wordpress, found a way to install some malware onto the box. This would be invisible to most users, but was nevertheless dangerous. As a domain name with a fifteen year life, it has some credibility among the algorithms & search engines. There’s some trust there.

Google identified the malware, and emailed me about it. That was the first I was alerted in mid-August. That was a few days before I left for vacation, but given the severity of it, I jumped on the problem right away.

Also: Why I say Always be publishing

2. Heading off a lockout

I ordered up a new server from 1and1.com to rebuild. I then set to work moving over content, and completely reinstalled the latest version of wordpress.

Since it was within the old theme that the malware files had been hidden, I eliminated that whole directory & all files, and configured the blog with the newest wordpress theme.

Around that time I got some communication from 1and1. As it turns out they had been notified by google as well. Makes sense.

Given the shortage of time, and my imminent vacation, I quickly called 1and1. As always their support team was there & easy to reach. This felt reassuring. I explained the issue, how it occurred and all the details of how the server & publishing system had been rebuillt from the ground up.

This was August 24th timeframe. As I had received emails about a potential lockout, I was reassured by the support specialist that the problem had been resolved to their satisfaction.

Read: Do managers underestimate operational cost?

3. Vacation implosion

I happily left for vacation knowing that all my hard work had been well spent.

Meantime around August 25th, 1and1.com sent me further emails asking me for “additional details”. Apparently the “I’m going on vacation” note had not made it to their security division. Another day goes by and since they received no email from me the server was locked!

Being locked, means it is completely unreachable. Totally offline. No bueno! That’s certainly frustrating, but websites do go down. What happened next was worse.

Since I use Mailchimp to host my newsletter, I write that well in advance each month. Just like clockwork the emails go out to my 1100 subscribers on September 1st. Many of those are opened & hundreds click on the link. And there they are faced with a blank screen & browser. Nothing. Zilch! Offline!

Also: Why I use Airbnb chat even when texting is easier

4. The aftermath

As I return to connectivity, I begin sifting through my emails. I receive quite a few from friends in colleagues explaining that they couldn’t view my newsletter. I immediately remember my conversation with 1and1, their assurances that the server won’t be locked out, and that all is well. I’m thinking “I bet that server got locked out anyway”. Damn it, I’m angry.

Taking a deep breath, I call up 1and1 and get on the line with a support tech. Being careful not to show my frustration, I explain the situation again. I also explain how my server was down for two weeks and how it was offline during a key moment when my newsletter goes out.

The tech is able to reach out to the security department & explain things again. Without any additional changes to my server or technical configuration they are then able to unlock the server. Sad proof of a beurocratic mixup if there ever was one.

Also: Is Amazon too big to fail?

5. Reflections on complexity

For me this example illustrates the complexity in modern systems. As the internet gets more & more complex, some argue that we are building a sort of house of cards. So many moving parts, so many vendors, so many layers of software & so many pieces to patch & update.

As things get more complex, their are more cracks for the hackers to exploit. And patching those up becomes ever more daunting.

Related: Are we fast approaching cloud-mageddon?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

When fat fingers take down your business

apple sad mac fail

Join 14,000 others and follow Sean Hull on twitter @hullsean.

Github goes nuclear

I was flipping through reddit last night, and hit this crazy story. strange pushes on GitHub. For those who don’t know, github houses source code. It’s version control for the software world. Lots of projects use it, to keep track of change management.

Jenkins is a continuous integration platform. Someone working on the project accidentally did a force push up to the server. They overwrote not only their own work, but the work of hundreds of other plugins unrelated to his own project.

This is like doing a demolition to put up a new building, and taking down all the buildings on your block and the next. Not very neighborly, to say the list. They’re still at the time of this writing, doing cleanup, and digging through the rubble.

Read: Why DynamoDB can increase availability

How to kill a database

I worked a startup a few years back that had an interesting business model. Users would sit and watch videos, and get paid for their time. Watch the video, note the code, enter the code, earn cash. Somehow the advertisers had found a way to make this work.

The whole infrastructure ran on Amazon EC2 servers, and was managed by Rightscale. Well it was actually managed by an west coast outsourcing shop, whose specialty was managing deployments on Righscale.

The site kept it’s information in a MySQL database. They had various scripts to spinup slaves, remaster, switch roles and so forth. Of course MySQL can be finicky and is prone to throwing surprises your way from time to time.

One time this automation failed in a big way, switching over production customers to a database that took way way way too long to rebuild. As their automation didn’t perform checksums to bulletproof the setup it couldn’t know that all the data wasn’t finished moving!

Customers sure did notice though when the site fell over. Yes this was a failure of automation. But not of the Rightscale platform, but of the outsourcing firm managing the process, checking the pieces and components and ensuring the computer systems did their thing to completion. Huge fail!

Read: Why devops talent is in short supply

Your website will fail

Sites big and small fail. Hopefully these stories illustrate that fact. I’ve said over and over why perfect availability is a pipe dream.

At the end of the day, the difference between the successful sites and the sloppy ones isn’t failure and perfection. It’s *how* they fail, and how they get back up on their feet. What type of planning did they do for disaster recovery like many firms in NYC did before and after Sandy.

Also: Why startups need both devs and ops for scalability

Reducing failure

So instead of thinking about eliminating failure, let’s think about *reducing* it from happening, and when it does, reducing the fallout. One thing you can do is signup for scalable startups where we share tips once a month on the topic. Meanwhile try to put these best practices into play.

1. Test your DR plan by running real life fire drills
2. Use more than one hosting provider, data center or cloud provider
3. Give each op or end user the least privileges they need to do their job
4. Embrace a culture of caution in operations
5. Check, double check and triple check those fat fingers!

Read this: Why a four letter word divides dev and ops

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Road War Story – Hacking Inflight Solutions

 

The 2am phone call

Last summer I got my call from the president at 2am.  Actually it was my former boss at Hollywood Reporter.  I had worked there three months previous, and they had since hired an outsourced DBA solution.  Big outsource, big chops.  And big fail.

 

 

12 hours to liftoff

I was scrambling to pack my luggage to go on summer vacation.  I was bound for SF at the moment and my flight was leaving in the morning.  I was trying to wrap up loose ends and my former boss was entreating me – “Can you help us?  Our replication setup has just melted down.  We need you to cleanup the mess.”

The so-called pain point

After a few more early am Skype calls and chats, the team retired for the night and I finished packing my bags.   I snuck in an hour of sleep then headed straight for the airport.  Once through airport security, I bust out my laptop and start logging into the servers.

Although the exact cause of the replication failure remained opaque, I was asked to scan both databases and determine differences.  Out of my toolbox comes the perfect tool for the job, pt-table-checksum, and I run scans on both databases.  (For the curious, here is how) I find countless records different between the two databases.

Now my flight is boarding, so I pack up the laptop and find my seat.  As soon as the seat belt lights flash off, I’m flipping open my macbook at getting inflight wifi working.  Through the flight I’m on SKYPE with the team, with command line terminals open to the servers.  Discuss, debug, troubleshoot – rinse, repeat.

From there I write up a report and explain to the team & CTO the problem.  Syncing that many different records is too risky.  We’d have to review all the statements one-by-one.  I’d rather rebuild replication from scratch.

From there the CTO gives the go ahead, and with the help of Percona’s xtrabackup to do online hotbackups, we are able to fix replication without downtime. Amen to that!

Now with our primary MySQL database and secondary read-only one back online, things calm down a lot.  Traffic returns to a smooth predictable 2 million pageviews per day.  That’s smooth and predictable on a site that gets 50 million a month!   The database loads are calm and steady, as our all of our nerves.   In the coming days we continue to monitor the situation, and write up lengthly root cause analysis of the situation.

Freelancers & Consultants take note

To my recent Consulting 101 article I would add the following bullets:

  • Responsiveness is crucial
  • Be there when a client needs you, and your value goes up.  Be reliable, and loyal to those you’ve worked with.

  • Be an integral part of your team
  • Everyone knows eachother virtual or in real life, and are comfortable with the parts they play.  A team that can work together is crucial, whether it’s all fulltime folks, some consultants, some outsourced or wherever they may be.  Each has a role to play, and communication and team work brings it all together.

  • Have laptop will travel
  • I never turn down a job.  There will be plenty of time for vacations and rest when the dust settles.

  • Don’t break things
  • If there is any doubt in your mind, test, and test again.  Always err on the side of caution.  Check thrice and cut once!  If you haven’t done an operation ten, twenty or fifty times before, experiment a few more times with options to be sure.  And most importantly, if you don’t login to the systems you’re working on regularly, you better make damn sure you’re on the right box, flipping the right switch, and moving the right dials.  With modern internet infrastructure, there are a hundred ways to push the wrong red button!

    CTOs and Directors of Operations take note

  • Small & Nimble wins the day
  • I’ve used this value proposition before when speaking to prospects.  You can hire a big firm, and be a small fish to them.  Small fish means you’re gonna get less attention.  OR you can hire a small firm or contractor.  Then you’ll be a big fish to him or her.  Guess what?  If you’re their big fish, they’re gonna pay extra attention to every move they make, and ensure things don’t break.  They can’t afford mistakes, not to their reputation or their bottom line.  Not like the big boys can.

  • Choose passionate, yet conservative & risk averse operations folks
  • In developers you’re building technology, features, and forging ahead into new solutions.  The role is more to create waves, and break barriers.  How can we enable new business processes and so forth?

    In hiring operations personnel you want stability.  Look for individuals who are more risk averse.  This conservative streak is a countering force.  Ops teams are tasked with that job of bringing a steady state to your business services.  They don’t want to wake up at 2am in the morning.