How organizations can move faster with devops – a16z Sonal Chokshi interviews Nicole Forsgren & Jez Humble

via GIPHY

We hear a lot about devops these days, and the promise is temendous. It originally evolved out of Agile operations. But how to get those benefits at *my* organization?

Join 38,000 others and follow Sean Hull on twitter @hullsean.

How do we become a high performing organization, to move faster and build more secure and resilient systems? That’s the $64,000 question!

A16Z strikes again! Andreeson Horowitz’s epic podcast hosts world class guests around all sorts of startup & new technology topics. This week they interview Jez Humble and Nicole Forsgren. They run Dora which is DevOps Research and Assessment, which shows organizations just how to get the advantages of devops in the real world.

Technology does not drive organizational performance

Check out section 16:04 in the podcast…


“the point of distinction comes from how you tie process and culture together technology through devops”

It’s the classic Amazon model. They’re running hundreds of experiments in production at any one time!

Related: The 4 letter word dividing dev and ops

Day one is short, day two is long

The first interesting quote that caught my attention was at 4:40…


“Day one is when we create all of these systems. Day two is when we deploy to production. We have to deploy and maintain forever and ever and ever. We hope that day two is really long.”

As a long time op, this really really resonates for me. Brownfield deployments, which have already seen a wave of developers finish, and leave, and trying to manage that. Not easy!

Related: Why generalists are better at scaling the web

Mainframes of Kubernetes?

What about tooling? Is that important? Here’s what Jez has to say. Jump to 29:30…


“Implementing those technologies does *not* give you those outcomes. You can achieve those results with Mainframes. Equally you can use Kubernetes, Docker and microservices and not achieve those outcomes.”

Related: Is Amazon too big to fail?

Reducing Friction

Fast forward to timecode 28:45…


“Conways Law: Organizations which design systems are constrained to produce designs that are copies of the communication structures of these organizations.”

ie your software code looks like the shape of organization itself, and how we communicate. Super interesting. ๐Ÿ™‚

Related: 6 devops interview questions

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Five More Things Deadly to Scalability

The.Rohit - Flickr
The.Rohit – Flickr

Join 19k others and follow Sean Hull on twitter @hullsean.

1. Slow Disk I/O – RAID 5 – Multi-tenant EBS

Disk is the grounding of all your servers, and the base of their performance. True with larger and larger main memory, much is available in cache, a server still needs to constantly read from disk and flush things from memory. So it’s a very very important component to performance and scalability.

What’s wrong with Raid 5?

Raid 5 was designed to give you more space, using fewer disks. It’s often used in a server with few slots or because ops misunderstand how bad it will impact performance. On a database server it can be particularly bad.

All writes see a performance hit. What’s worse is if you lose a disk, the RAID though technically still on line, will perform SO SLOWLY as to be offline. And a rebuild takes many hours. Worse still is the risk to lose another drive during that rebuild. What if you have order a drive and it takes a couple of days?

RAID 10 is the solution. Mirror each set of two drives, then stripe over those. Even with only four slots available, it’s worth it. Good read performance, good write performance, and protection.

What the heck is multi-tentant?

In the cloud, you share servers, network & disk just like you do apartments in a building. Hence the name. Amazon’s EBS or elastic block storage, extends this metaphor, offering you the welcome flexibility of a storage network. But your bottleneck can be fighting with other tenants on that same network.

Default servers do have this problem, however AWS has addressed this serious problem with a little known but VERY VERY useful feature called Provisioned IOPS. It’s a technical name, but means you can lock in reliable disk I/O. Just what the scalability doctor ordered.

Check out our original post: 5 Things Toxic to Scalability

2. Using the database for Queuing

MySQL is good at a lot of things, but it’s not ideal for managing application queues. Do you have a table like JOBS in your database, with a status column including values like “queued”, “working”, and “completed”? If so you’re probably using the database to queue work in your application.

It’s not a great use of MySQL because of locking problems that come up, as well as the search and scan to find the next task.

Luckily their are great solutions for developers. RabbitMQ is a great queuing solution, as is Amazon’s SQS solution. What’s more as external services they’re easier to scale.

[quote]
Scalability becomes key to your business, as you customer base grows. But it doesn’t have to be impossible. Disk I/O, caching, queuing and searching are all key areas where you can make a big dent, in a manageable way. Juggle your technical debt too, and you’re golden!
[/quote]

Also take a look at: Why Generalists are Better at Scaling the Web

3. Using Database for full-text searching

Oracle has full text search support, why shouldn’t we assume the same in MySQL? Well MySQL *does* have this, but in many versions only with the old MyISAM storage engine. It has it’s set of corruption problems, and isn’t really very performant.

Better to use a proven search solution like Apache Solr. It is built specifically for search, includes excellent library support for developers of most modern web languages and best of all is easy to SCALE! Just add more servers on your network, or distributed globally.

For folks interested in the bleeding edge, Fulltext is coming to Innodb crash safe & transactional storage engine in 5.6. That said you’re still probably better off going with an external solution like Solr or Sphinx and the MySQL Sphinx SE plugin.

[mytweetlinks]

How to find A Mythical MySQL DBA

4. Insufficient Caching at all layers

Cache, cache, and cache some more. Your webservers should use a solid memcache or other object cache between them & the database. All those little result sets will sit in resident memory, waiting for future web pages that need them.

Next use a page cache such as varnish. This sits in front of the webserver, think of it as a mini-webserver that handles very simple pages, but in a very high speed way. Like a pack of motorbikes riding down an otherwise packed freeway, they speed up your webserver to do more complex work.

Browser caching is also important. But you can’t get at your customers browsers, or can you? Well not directly, but you can instruct them what things to cache. Do that with proper expires headers. Have your system administrator configure apache to support this.

Also: Tweaking Disqus to Find Experts & Drive Traffic

5. Too much technical debt

Technical debt can bite. What is it? As you’re developing an new idea, you’ll build prototypes. As those get deployed to customers, change gets harder, and past things you glossed over because problems. One team leaves, and another inherits the application, and the problems multiple. Overtime you’re building your technical debt as your team spends more time supporting old code and fixing bugs, and less time building new features. At some point a rewrite of problem code becomes necessary.

Also: How I increased my blog pagerank to 5

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Here’s a sample