Categories
All Cloud Computing Cloud Migrations Consulting CTO/CIO

What can we learn from NASA’s AWS fail?

via GIPHY

I was just reading the Register, which is sort of the UK’s version of Slashdot, and they had a jaw dropping title. NASA moved 247 petabytes into AWS and then later learned about EGRESS costs

OMG! Face palm. Wow.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

To say this is a disaster is an understatement. Could it have been prevented? Not likely by 100% strategic thinking. I believe a certain amount of real-world testing & prototyping is the only way.

Here are my thoughts…

1. Expect hidden costs

Everytime I check there are more AWS services. Just now I did some googling, and the number stands at 170. Not only is it tough to keep up with all of those, but the offerings are constantly evolving, getting new features and so forth. That means the pricing and costs are evolving too.

All this means an infinitely complex web of interconnecting pieces, so it is near impossible to predict prices in advance. The solution? Prototype.

And this would have helped save NASA. Because you would build, feature test and load test too. In all of that would have come a small estimate of cost which would include EGRESS costs.

There are no guarantees in this game, but it is surely getting complicated.

Read: How can 1% of something equal nothing?

2. Vendor lock-in is not dead

With the receding of some of the big old world vendors like Oracle, many have forgotten the shark like tactics they used with startup companies.

The model was something like this. Send in the big guns, nicely dressed to get you on board. Finesse the sale. Offer deep discounts, and get the customer on Oracle. After a year, maybe too, start squeezing. You’d be surprised how much blood comes out of diamonds.

These days we feel more free to port our applications to different cloud vendors. Even if mostly everybody is on Amazon already. But this NASA story really highlights the great organizational cost to migrate to the cloud. You architect your application, you do cost planning, and so on. So once you’re there, it’s hard to unravel.

Related: Is Fred Wilson right about dealing in an honest, direct and transparent way?

3. New possible hacking vector

Since costs are tied to usage in the public cloud, this could have implications for hacking. If a bad actor wants to cause you harm, then can now just use your service more.

Don’t like company A? Write some bots to access them from obscure locations, and ramp up those egress costs. With all the complexity of the cloud, are most firms monitoring for this sort of thing? I don’t see it in my engagements.

Something similar happened to me. I wrote: when mailchimp fraudulently charged my credit card. It really happened. Do I think it was intentional? You’ll have to read the article to get my 360 degree take on it.

Related: What mistakes did you make when starting as a consultant?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
All Cloud Computing Consulting CTO/CIO Devops

Viktor Farcic Interview excerpts

via GIPHY

I recently did an interview with Viktor Farcic all about operations, DBA & Devops. Here are some excerpts: What does Dev-Ops mean?

Join 35,000 others and follow Sean Hull on twitter @hullsean.

Continuing where I left off, I’ve included a few more highlights below. Enjoy!

1. Can I use a tool to migrate to the public cloud?

Viktor Farcic: I’ve seen quite a few of these tools that tell you if you buy our tool, we’re going to transfer whatever you have to the cloud. For example, Docker announced in the last DockerCon that they’re going to put in containers without a single change and everything will work. What do you think about that?

Sean Hull Salespeople often simplify things quite a bit in order to sell a product; in my experience, the devil is in the detail. It’s not to say that an automation tool like that might not be valuable and useful. It might be a good first step to getting your application in the cloud, and it might be an easier way than to rebuild everything one by one. But I doubt that it’s going to work magically just by one script.
EC2 instances, for example, have different performance characteristics, not only in terms of the disk I/O, memory, and CPU, but in smaller instances, they actually throttle the network access so you might spin up an instance and it just might not behave well. It might take time. In fact, all sorts of things could happen. You might have written MySQL scripts that assume you have root access to the server and then you rebuild that in an RDS and you get errors because you don’t have access to those resources on RDS. There’s a lot of things to consider.

Read: What happened when I offered advice outside my pay grade?

2. How do you adapt to change?

Viktor Farcic: I have the impression that the speed with which new things are coming is only increasing. How do you keep up with it, and how do companies you work with keep up with all that?

Sean Hull: I don’t think they do keep up. I’ve gone to a lot of companies where they’ve never used serverless. None of their engineers know serverless at all. Lambda, web tasks, and Google Cloud functions have been out for a while, but I think there are very few companies that are able to really take advantage of them. I wrote another article blog post called Is Amazon Web Services Too Complex for Small Dev Teams? where I sort of implied that it is.
I do find a lot of companies want the advantage of on-demand computing, but they really don’t have the in-house expertise yet to really take advantage of all the things that Amazon can do and offer. That’s exactly why people aren’t up to speed on the technology, as it’s just changing so quickly. I’m not sure what the answer is. For me personally, there’s definitely a lot of stuff that I don’t know. I know I’m stronger in Python than I am with Node.js. Some companies have Node.js, and you can write Lambda functions in Java, Node.js, Python, and Go. So, I think Amazon’s investment in new technology allows the platform to evolve faster than a lot of companies are able to really take advantage of it.
Read: What did Matt Ranney discover scaling Uber to 1000 microservices?

3. What is the future of Devops?

Viktor Farcic: I’m going to ask you a question now that I hate being asked, so you’re allowed not to answer. Where do you see the future, let’s say a year from now?

Sean Hull:
I see more fragmentation happening across the technology landscape, and I think that that is ultimately making things more fragile because, for example, with microservices, companies don’t think twice about having Ruby, Python, Node.js, and Java. They have 10 different stacks, so when you hire new people, either you have to ask them to learn all those stacks or you have to hire people with each of those individual areas of expertise. The same is true with all these different clouds with their own sets of features: there’s a fragmentation happening.
Let’s look at the iPhone as an example. Think about how complex application testing is for Android versus the iPhone. I mean, you have hundreds of different smartphones that run Android, all with different screen sizes, different hardware, different amounts of memory, and the underlying stuff. Some may even have some extra chips that others don’t have, so how do you test your application across all those different platforms?
When you have fragmentation like that, it means the applications end up not working as well. I think the same thing is happening across the technology spectrum today that happened 10 to 15 years ago, where for your database backend there was Oracle, SQL Server, MySQL, and Postgres. Maybe somebody who’s a DB2 enterprise customer uses DB2, but now there are hundreds of open source databases, graph databases, and DynamoDB versus Cassandra, and so on and so on. There’s no real deep expertise in any of those databases.
What ends up happening is you have cases like what happened with customers who were using MongoDB. They found out the hard way about all of the weird behaviors and performance problems it had, because there just weren’t people around with deep knowledge of what was happening behind the scenes, whereas in Oracle’s space, for example, there are career DBAs that are performance experts that specialize in Oracle internals, so you can hire somebody to solve particular problems in that space.
There aren’t, as far as I know, a lot of people with MongoDB internals expertise. You’d have to call MongoDB themselves; maybe they have a few engineers that they can send out, so what’s the future? I see a lot of fragmentation and complexity, and that makes the internet and internet applications more fragile, more brittle, and more prone to failure.

Related: Can humility help you in your career?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
Cloud Computing Cloud Migrations CTO/CIO

Should we right size instances in the public cloud?

via GIPHY

I’m a big fan of Corey Quinn’s Last Week in AWS newsletter.

Recently he wrote a piece titled Right Sizing your instances is nonesense

Not to be outdone, a blogger Joe at Sunshower.io wrote a counterpoint piece Why Right Sizing instances is not nonsense.

Join 35,000 others and follow Sean Hull on twitter @hullsean.

So what’s the verdict here? Is Corey wrong and Joe right? Or is Joe right and Corey wrong?

I would argue it depends. Corey’s piece emphasizes the big picture, essentially that technical changes can buy more trouble than the money they save. While

1. Corey emphasizes the 300 foot view

Corey’s article uses a broad brush, and in doing so some of his specifics are incorrect. For example the point about older versions of Operating Systems and hypervisor. In most cases your OS won’t know it’s running on a hypbervisor at all, so this seems a very rare edge case indeed.

That said his big picture conclusion seems spot on. Changing instance sizes can be a huge risk if you don’t do so regularly. With legacy apps, who knows how they might behave. In this you carry the same burden as changing instances at all. Your AMI may not be ready, or you may have had some manual steps to build the box again.

Related: How can we keep cloud architectures simple

2. Joe gets down in the weeds of technical specifics

The first thing about Joe’s post that caught my attention was to use the console to change the instance size. You shouldn’t be manually changing instances through the console to start with. What, everything is code, all the way down? Hehe…

Also his points about cost savings seemed cherry picked. If you can save that much money by changing instance sizes, it is well worth the cost to do regression, integration & disaster recovery tests to make sure it will all work. Get on it!

Also: What hidden things does a deposit reveal?

3. Use infrastructure as code, test & retest it

IAC is really the way to go. But that still doesn’t mean you’re out of the woods. Be cautious when changing instance sizes!

With code, there is testing. So change your Terraform code and then verify it works. Now just because you have a variable indicating the instance size, doesn’t mean changing it won’t break something.

Read: Can communication mixups sour an engagement?

4. Weigh the cost savings with the risk of breaking things

Changing an instance size and redeploying could break all manner of things. It’s possible you used a variable for the instance size in some places and hard coded it in others. Or made some weird reference in an autoscaling group.

It may be the AMI you’ve built works on one type of instance but not another. Or that your AMI is deployed in one region but not another. Or that your old instance size is available in us-east-2, but your new instance size is not yet available there. Yes the console wouldn’t have offered it, but your Terraform code didn’t know.

Check out: How I use 5 daily habits to stay on track

5. Put up some guiderails

As you’ll see further down in the comments, Joe suggests to put limits around instance size changes. That makes sense. After you’ve done testing, you’ll have an idea of what these limits should be. No instance size less than 30GB memory? No instance size greater than X. None in region Y. Etc.

It’ll require tweaking your terraform infrastructure code, so it’s not a free change. But it’ll pay dividends if your costs savings are in the thousands.

Also: Can daily notes help you work better with clients?

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters

Categories
All Cloud Computing Cloud Migrations Devops

How do we migrate our business to the public cloud?

via GIPHY

The public cloud is no longer a bleeding edge technology for the trailblazers. It’s mainstream now. As you think about it, you consider your customers and the SLAs they’ve come to expect.

Join 38,000 others and follow Sean Hull on twitter @hullsean.

It’s not if, but when to move to the cloud, how to get there, and how fast will be the transition?

Here are my thoughts on what to start thinking about.

1. Ramp up team, skills & paradigm thinking

Teams with experience in traditional datacenters have certain ways of architecting solutions, and thinking about problems. For example they may choose NFS servers to host objects, where in the cloud you will use object storage such as S3.

S3 has all sorts of new features, like lifecycle policies, and super super redundant eleven 9’s of durability. But your applications may need to be retrofitted to work with it, and your devs may need to learn about new features and functionality.

What about networking? This changes a lot in the cloud, with VPCs, and virtual appliances like NATs and Gateways. And what about security groups?

Interacting with this new world of cloud resources, requires new skillsets and new ways of thinking. So priority one will be getting your engineering teams learning, and upgrading skills. I wrote a piece about this how do I migrate my skills to the cloud?

Related: When you have to take the fall

2. Adapt to a new security model

With the old style datacenter, you typically have a firewall, and everything gets blocked & controlled. The new world of cloud computing uses security groups. These can be applied at the network level, across your VPC, or at the server level. And of course you can have many security groups with overlapping jurisdictions. Here’s how you setup a VPC with Terraform

So understanding how things work in the public cloud is quite new and challenging. There are ingress and egress rules, ways to audit with network flow logs, and more.

However again, it’s one thing to have the features available, it’s quite another to put them to proper use.

Related: When clients don’t pay

3. Adapt to fragile components & networks

While the public cloud collectively is extremely resilient, the individual components such as EC2 instances are decidedly not reliable. It’s expected that they can and will die frequently. It’s your job as the customer to build things in a self-healing way.

That means VPCs with multiple subnets, across availability zones (multi-az). And that means redundant instances for everything. What’s more you front your servers with load balancers (classic or application). These themselves are redundant.

Whether you are building a containerized application and deploying on ECS or a traditional auto-scaling webserver with database backend, you’ll need to plan for failure. And that means code that detects, and reacts to such failures without downtime to the end user.

Related: Why i ask for a deposit

4. Build infrastructure as code

You’ve heard about devops, now it’s time to put it into practice. Building your complete stack in code, is very possible with tools like Terraform. But you may have trouble along the way. I wrote I tried to write infra as code with Terraform and AWS and it didn’t go as expected

So there’s a learning curve. Both for your operations teams who have previously called Rackspace to get a new server provisioned. And also for your business, learning what incurs an outage, and the tricky finicky sides to managing your public cloud through code.

Related: Why i ask for a deposit

5. Audit, log & monitor

As you automate more and more pieces, you may have less confidence in the overall scope of your deployments. How many servers am I using right now? How many S3 buckets? What about elastic IPs?

As your automation can itself spinup new temporary environments, those resource counts will change from moment to moment. Even a spike in user engagement or a sudden flash sale, can change your cloud footprint in an instant.

That’s where heavy use of logging such as ELK (elasticsearch, logstash and kibana) can really help. Sure AWS offers CloudWatch and CloudTrail, but again you must put it all to good use.

Related: Why i ask for a deposit

Get more. Grab our exclusive monthly Scalable Startups. We share tips and special content. Our latest Why I don’t work with recruiters