I’ve been on the phone with a lot of companies lately. You might be surprised that some of the challenges firms struggle with in the cloud, are repeated over and over.
Join 38,000 others and follow Sean Hull on twitter @hullsean.
I put together some of the most common ones I’ve heard, and my thoughts on the right solution.
1. How would you load test?
Here’s an interesting question. Do you talk about tools? How do you approach the problem?
The first thing I talk about is simulating real users. If your site normally has 1000 active users, how will it behave when it has 5000, 10,000, 100,000 or 1million? We can simulate this by using a load testing tool, and monitoring the infrastructure and database during that test.
But how accurate are those tests? What do active users do? Login to the site? Edit and change some data? Where do active users spend most of their time? Are there some areas of the site that are busier than others? What about some dark corner of the site or product that gets less use, but is also less tuned? Suddenly a few users decide that want that feature, and performance slides!
Real world usage patterns are unpredictable. There is as much art as science to this type of load testing.
2. Why is Amazon S3’s 99.999999999% promise *not* enough??
I’ve heard people say before that S3 is extremely reliable. But is it?
According to their SLA, the durability guarantee is 11 nines. What does this mean? Durability is confidence that a file is saved. That you will not lose it. It’s on storage, and that storage has redundant copies. Great. You can be confident you will never lose a file.
What about uptime? That SLA is 99.99% or an hour a month. Surprise! That amounts to an hour of DOWNTIME per month. And if your product fails when S3 files are missing, guess what, your business is down for an hour a month.
That’s actually quite a *lot* of downtime.
Solution: You better be doing cross-region replication. You have the tools and the cloud, now get to work!
3. Why is continuous integration not about tools?
I hear a lot of talk about continuous integration. I’ve even seen it as a line item on a todo list I was handed. Hmmm…
I asked the manager, “so it says here setup CI/CD. Are there already unit tests written? What about integration tests?” Turns out the team is not yet on board with writing tests. I gently explain that automated builds are not going to get very far without tests to run. 🙂
CI/CD requires the team to be on-board. It’s first a cultural change in how you develop code.
It means more regular code checkins. It means every engineer promises not to break the build.
It means write enough tests for good code coverage.
4. What can VPC peering do for you?
Amazon has made VPC peering a lot easier. But what the heck is it?
In the world of cloud networking, everything is virtual. You have VPCs and inside those you have public and private subnets. What if you have multiple AWS accounts? VPC peering can allow you to connect backend private subnets, without going across the public internet at all.
As security becomes front & center for more businesses, this can be a huge win.
What’s more it’s easier now because it is semi managed by AWS.
5. Why go with a New York based resource?
As more and more small startups put together teams to build their MVP, the offshore market has never been hotter. And there are very talented engineers in faraway places, from Eastern Europe, to India and China. And South America too.
At ⅓ to ¼ the price, why hire US based people? Well one reason might be compliance. If you have sensitive data, that must be handled by US nationals, that might be one reason.
Why New York based? Well there is the value of being face-to-face and working side by side with teams. It may also ease the language barrier & communication. And timezone challenges sometimes make communication difficult.
And lastly ownership. With resources that are focused solely on you, and for which you are a big customer, you’re likely to get more personalized focused attention.
6. What are some common antipatterns in the cloud
Antipatterns are interesting. Because you see them regularly, and yet they are the *wrong* way to solve a problem, either they’re slower, or there is a better more reliable way to solve it.
o Using EFS Amazon’s NFS solution, instead of putting assets in S3.
It might help you avoid rewriting code, but in the end S3 is definitely the way to go.
o Hardcoded IPs in security group rules instead of naming a group.
Yes I’ve seen this. If you specify each webserver individually, what happens when you autoscale? Answer, the new nodes break! The solution is to put all the webservers in a group, and then add a security group rule allowing access from that group. Voila!
o Passing credentials around instead of using AWS instance level roles
Credentials are the bane of applications. You hardcode them and things break later. Or they create a vulnerability that you forget about. That’s why AWS invented roles. But did you know a server *itself* can have a role? That means that server and any software running on it, has permissions to certain APIs within the amazon universe. You can change a servers roles or it’s underlying policies, while the server is still running. No restart required!
Implement CI/CD as a task item
Don’t forget culture & process are the big hurdles. Installing a tool is easy. Getting everyone using it everyday is the challenge!
Reducing and managing docker image bloat
As you change your docker images, you add layers. Even as you delete things, the total image size only grows! Seems counterintuitive. What’s more when you do all that work with yum or apt-get those packages stay lying around. One way is to install packages and then cleanup all in one command. Another way is to export and import an finished image.
ssh-ing into servers or giving devs kubectl
Old habits die hard! I was watching Kelsey Hightower’s keynote at KubCon. He made some great points about kubernetes. If you give all the devs kubectl, then it’s kind of like allowing everybody to SSH into the boxes. It’s not the way to do it!
Related: Which tech do startups use most?