One of the things that is exciting about the cloud is the reduced need for operations staff. There seem to be two drivers of this trend. One is devops, and all the automation that comes with it. As we formalize configurations, things become repeatable, and fewer people can manage greater armies of servers.
The second is by moving to a cloud hosting provider, we essentially outsource the operations to their team.
1. Pretty abstractions? still hardware buried somewhere
That’s right, beneath all the virtual EC2 instances & VPCs there is physical hardware. Huge datacenters sit in North Virginia, Oregon, Ireland, London and many other cities. Within them there are racks upon racks of servers. The hypervisor layer, the abstraction built on top of that, orchestrates everything.
Although we outsource the management of those datacenters to Amazon, there are still responsibilities we carry. Let’s dig into those more.
2. Full-stack dev – demand for generalists?
These days we see the demand for a full stack developer. That is someone who does not only front end dev, but also backend. In turn, they are often asked to wear the had of ops. Spinup EC2 instance, decide on the capacity & size, choose proper disk I/O, place it within the right subnet & vpc & then configure the security groups properly.
All of these tasks would previously been managed by a dedicated ops team, but now those responsibilities are being put on developers shoulders. In some cases, such as with microservices, devs also carry the on-call duties of their applications.
Lastly there is likely ops to handle automation. Devops will formalize configurations, into ansible playbooks or chef recipes, so they can be checked into version control. At this point infrastructure can even be unit tested.
3. Design, resiliency, instrumentation, debugging
In previous eras, ops teams would be heavily involved with design of applications & architecture to support that. Now that may be handed to devs, but it still needs to happen.
Furthermore resiliency is said to be the customers responsibility. In the pre-cloud days, hardware was more reliable. It had a slower failure rate. With virtual machines, they’re expected to fail, and all the components to make your applications resilient are given to you. But it’s your job to architect them together.
That means your applications need to be self-healing. Failures need to be detected, taken out of autoscaling groups, and replaced. All automatically. Code or not, that is certainly operations.
Check this: Which engineering roles are in top demand?
4. It’s amazon’s fault we’re down!
I’ve seen quite a few outages in the past year, from Dropbox to Airbnb, and DYN themselves. Ultimately these outages could be tied back to a failure with Amazon. But when your business customers are relying on your service, it is *YOUR* business that answers to it’s own SLA.
In the news we see many of these firms pointing the finger at Amazon, “hey it’s not our fault, our cloud provider went down!”. Ultimately your customers don’t care. They don’t want excuses. If using multiple regions in AWS is not sufficient, you’ll need to build your application to be multi-cloud.
5. It’s hard to outsource your expertise
Remember, while you outsource your operations to Amazon, you’re getting very professional management of those systems. However they will optimize for their many customers. Your particular problems are less of a concern.
6. Only you can thinking holistically about interdependancies
Your application more than likely uses a number of APIs to capture data, perhaps do single sign on or even a third party database like Firebase. It’s your responsibility to do integration testing. All that becomes more complex in the cloud.
7. How do services complicate things?
SaaS solutions are everywhere now. auth0, firebase and an infinite variety of third party apis complicate reliability, security, storage, performance, integration testing & debugging?
Security is a traditional responsibility held by the operations hat. Much of that becomes more complex in the cloud. With serverless applications for example you may use a few APIs, plus an authentication broker, and a backend database. As this list of services grows, the code you write may decrease. But testing & securing it all becomes much more complex.
With more services like this, the attack vector or surface area becomes greater. Each of those services, can and will have bugs. What if a zero day is found in the authentication broker, allowing a hacker to break into a broad cross section of applications across the internet? How do you discover this? What if your vendor hasn’t found out yet?
8. How does co-tenancy impact performance tuning?
Back to point #1 above, all these virtual servers sit on real physical servers. That affects customers in two ways. One you may be sharing the same host. That is if you use a very small vm, it may sit along side another customer with a small vm. If those eat up CPU cycles or network on that box, neighbors or co-tenants will suffer.
There are many other instance types where you get your own dedicated hardware. With those you have your own nic as well, so no competition. Except wait there’s network storage! That’s right all the machines in the AWS environment use EBS now, which is all co-tenant. So your data is sitting alongside other customers, and you are all fighting for usage of the same disk read heads.
One way to mitigate this is to configure specific provisioned IOPS for your servers. But that costs more. It’s normally reserved for database instances where disk I/O is really crucial.
Granted the NewRelics of the world will certainly help us with this process. But they’re not giving us a hypervisor or global view of those servers, network or storage. So we can’t see how the overall systems performance may be impacted.
9. Operations can be invisible
When security is done well, you don’t have breakins, you don’t have data stolen, everything just runs smoothly. Operations is like that too. When it is done well it can be invisible.
It can also be invisible in a different way. When you deploy your application on serverless, all the servers & autoscaling is completely abstracted away. So when you get some weird outage because the farm of servers is offline, or because you hit some account limit in the number of functions you can run at once, then it quickly comes into focus.
Beware of invisible operations, because it’s harder to see what to monitor, and know how to stay ahead of outages.
10. We can’t oursource true ownership
At the end of the day you can’t outsource ownership of your application or your business. The holistic view of your application in totality can only be understood by your engineers.
And that in the end is what operations is all about, no matter who’s wearing the hat!
Get my monthly newsletter for more thoughts on data, startups & innovation. Scalability. Automation. Amazon cloud.
Also published on Medium.