I think there's two kinds of software-producing-organizations:
There's the small shops where you're running some kind of monolith generally open to the Internet, maybe you have a database hooked up to it. These shops do not need dedicated DevOps/SRE. Throw it into a container platform (e.g. AWS ECS/Fargate, GCP Cloud Run, fly.io, the market is broad enough that it's basically getting commoditized), hook up observability/alerting, maybe pay a consultant to review it and make sure you didn't do anything stupid. Then just pay the bill every month, and don't over-think it.
Then you have large shops: the ones where you're running at the scale where the cost premium of container platforms is higher than the salary of an engineer to move you off it, the ones where you have to figure out how to get the systems from different companies pre-M&A to talk to each other, where you have N development teams organizationally far away from the sales and legal teams signing SLAs yet need to be constrained by said SLAs, where you have some system that was architected to handle X scale and the business has now sold 100X and you have to figure out what band-aids to throw at the failing system while telling the devs they need to re-architect, where you need to build your Alertmanager routing tree configuration dynamically because YAML is garbage and the routing rules change based on whether or not SRE decided to return the pager, plus ensuring that devs have the ability to self-service create new services, plus progressive rollout of new alerts across the organization, etc., so even Alertmanager config needs to be owned by an engineer.
I really can't imagine LLMs replacing SREs in large shops. SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.
stackskipton makes a good point about authority. SRE works at Google because SREs can block launches and demand fixes. Without that organizational power, you're just an on-call engineer who also writes tooling.
The article's premise (AI makes code cheap, so operations becomes the differentiator) has some truth to it. But I'd frame it differently: the bottleneck was never really "writing code." It was understanding what to build and keeping it running. AI helps with one of those. Maybe.
If the agent swarm is collectively smarter and better than the SRE, they'll be replaced just like other types of workers. There is no domain that has special protection.
The thing about C-suite executives is they usually have short tenures, however the management levels below them are often cozy in their bureaucracy, resist change, often trying to outlast the new management.
I actually argue that AI will therefore impact these levels of management the most.
Think about it, if you were employed as a transformational CEO would you risk trying to fight existing managers or just replace them with AI?
>I actually argue that AI will therefore impact these levels of management the most.
Not AI but bad economy and mass layoffs tend to wipe out management positions the most. As a decent IC, in case of layoffs in bad economy, you'll always find some place to work at if you're flexible with location and salary because everyone still needs people who know how to actually build shit, but nobody needs to add more managers in their ranks to consume payroll and add no value.
A lot of large companies lay off swags of technical staff regularly (or watch them leave), and rotate CEOs but their middle management have jobs for life - as the Peter Principe states, they are promoted to their highest respective incompetence and stay there because no CEO has time to replace them.
Disagree with the "jobs for life" part for management. Only managers who are there thanks to connection, nepotism or cronyism, are there for life as long as those shielding them also stay in place. THose who got in or got promoted to management meritocratically don't have that protection and are the first to be let go.
At all large MNCs I worked at, management got hired and fired mostly on their (or lack thereof) connections and less on what they actually did. Once they got let go, they had near impossible time finding another job without connections in other places.
I was an old school SRE before the days of containerization and such. Today, we have one who is a YAML wizard and I won't even pretend to begin to understand the entire architecture between all the moving pieces(kube, flux, helm, etc).
That said, Claude has absolutely no problem not only answering questions, but finding bugs and adding new features to it.
In short, I feel they're just as screwed as us devs.
Operational excellency was always part of the job, regardless of what fancy term described it, be it DevOps, SRE or something else. The future of software engineering is software engineering, with emphasis on engineering.
I agree. In many cases it's probably easier for a developer to become more of a product person, than for a product person to become a dev. Even with LLM's you still need to have some technical skills & be able to read code to handle technical tasks effectively.
Of course things might look different when the product is something that requires really deep domain knowledge.
As an SRE I can tell you AI can't do everything. I have done a little software development, even AI can't do everything. What we are likely to see is operational engineering become the consolidated role between the two. Knows enough about software development and knows enough about site reliability... blamo operational engineer.
I knew what an SRE was and found the article somewhat interesting with a slightly novel (throwaway), more realistic take, on the "why need Salesforce when you can vibe your own Salesforce convo."
But not defining what an SRE is feels like a glaring, almost suffocating, omission.
> And you definitely don't care how a payments network point of sale terminal and your bank talk to each other... Good software is invisible.
> ...
> Are you keeping up with security updates? Will you leak all my data? Do I trust you? Can I rely on you?
IMO, if the answers to those questions matter to you, then you damn well should care how it works. Because even if you aren't sufficiently technically minded to audit the system, having someone be able to describe it to you coherently is an important starting point in building that trust and having reason to believe that security and privacy will work as advertised.
As someone who works in Ops role (SRE/DevOps/Sysadmin), SREs are something that only works at Google mainly because for Devs to do SRE, they need ability to reject or demand code fixes which means you need someone being a prompt engineer who needs to understand the code and now they back to being developer.
As for more dedicated to Ops side, it's garbage in, garbage out. I've already had too many outages caused by AI Slop being fed into production, calling all Developers = SRE won't change the fact that AI can't program now without massive experienced people controlling it.
Most devs can't do SRE, in fact the best devs I've met know they can't do SRE (and vice versa). If I may get a bit philosophical, SRE must be conservative by nature and I feel that devs are often innovative by nature. Another argument is that they simply focus on different problems. One sets up an IDE and clicks play, has some ephemeral devcontainer environment that "just works", and the hard part is to craft the software. The other has the software ready and sometimes very few instructions on how to run it, + your typical production issues, security, scaling, etc. The brain of each gets wired differently over time to solve those very different issues effectively.
I don’t understand this take - if all engineers go on call, they learn real quick what happens when their coworkers are too innovative. It is a good feedback loop that teaches them not to make unreliable software.
SREs are great when the problem is “the network is down” or “kubernetes won’t run my pods”, but expecting a random engineer to know all the failure modes of software they didn’t build and don’t have context on never seems to work out well.
It's possible to do both, you just need to be cognizant of what you're doing in both positions.
A tricky part becomes when you don't have both roles for something, like SRE-developed tools that are maintained by the ones writing them, and you need to strike the balance yourselves until/unless you wind up with that split. If you're not aware of both hats and juggling wearing them intentionally, in that case, you can wind up with tools out of SRE that are worse than any SWE-only tool might ever be, because the SREs sometimes think they won't make the same mistakes, but all the same feature-focused things apply for SRE-written tools too...
There were several cheaper than programmers options to automate things, Robot Processing Automation being probably the most known, but it never get the expected traction.
Why (imo)? Senior leaders still like to say: I run a 500 headcount finance EMEA organization for Siemens, I am the Chief People Officer of Meta anf I lead an org of 1000 smart HR pros. Most of their status is still tight to the org headcount.
It only matters if any of those can promise reliability and either put their own money where their mouth is or convince (and actually get them to pay up) a bigger player to insure them.
Ultimately hardware, software, QA, etc is all about delivering a system that produces certain outputs for certain inputs, with certain penalties if it doesn’t. If you can, great, if you can’t, good luck. Whether you achieve the “can” with human development or LLM is of little concern as long as you can pay out the penalties of “can’t”.
AI will not get much better than what we have today, and what we have today is not enough to totally transform software engineering. It is a little easier to be a software engineer now, but that’s it. You can still fuck everything up.
What? Maybe OPs future. SWE is just going to replace QA and maybe architects if the industry adopts AI more, but there's a lot of hold outs. There's plenty of projects out there that are 'boring' and will not bother.
Operational excellence will always be needed but part of that is writing good code. If the slop machine has made bad decisions it could be more efficient to rewrite using human expertise and deploy that.
My take (I'm an SRE) is that SRE should work pre-emptively to provide reproducible prod-like environments so that QA can test DEV code closer to real-life conditions. Most prod platforms I've seen are nowhere near that level of automation, which makes it really hard to detect or even reproduce production issues.
And no, as an SRE I won't read DEV code, but I can help my team test it.
> And no, as an SRE I won't read DEV code, but I can help my team test it.
I mean to each their own. Sometimes if I catch a page and the rabbit hole leads to the devs code, I look under the covers.
And sometimes it's a bug I can identify and fix pretty quickly. Sometimes faster than the dev team because I just saw another dev team make the same mistake a month prior.
You gotta know when to cut your losses and stop searching the rabbit hole though, that's true.
I agree with your nuance, but that's not my default mode, unless I know the language and the domain well I am not going to write an MR. I'm going to read the stack trace to see it it's a conf issue though.
There's the small shops where you're running some kind of monolith generally open to the Internet, maybe you have a database hooked up to it. These shops do not need dedicated DevOps/SRE. Throw it into a container platform (e.g. AWS ECS/Fargate, GCP Cloud Run, fly.io, the market is broad enough that it's basically getting commoditized), hook up observability/alerting, maybe pay a consultant to review it and make sure you didn't do anything stupid. Then just pay the bill every month, and don't over-think it.
Then you have large shops: the ones where you're running at the scale where the cost premium of container platforms is higher than the salary of an engineer to move you off it, the ones where you have to figure out how to get the systems from different companies pre-M&A to talk to each other, where you have N development teams organizationally far away from the sales and legal teams signing SLAs yet need to be constrained by said SLAs, where you have some system that was architected to handle X scale and the business has now sold 100X and you have to figure out what band-aids to throw at the failing system while telling the devs they need to re-architect, where you need to build your Alertmanager routing tree configuration dynamically because YAML is garbage and the routing rules change based on whether or not SRE decided to return the pager, plus ensuring that devs have the ability to self-service create new services, plus progressive rollout of new alerts across the organization, etc., so even Alertmanager config needs to be owned by an engineer.
I really can't imagine LLMs replacing SREs in large shops. SREs debugging production outages to find a proximate "root" technical cause is a small fraction of the SRE function.
The article's premise (AI makes code cheap, so operations becomes the differentiator) has some truth to it. But I'd frame it differently: the bottleneck was never really "writing code." It was understanding what to build and keeping it running. AI helps with one of those. Maybe.
Edit: Or maybe he is fully aware and just need to push some books before it's too late.
I actually argue that AI will therefore impact these levels of management the most.
Think about it, if you were employed as a transformational CEO would you risk trying to fight existing managers or just replace them with AI?
Not AI but bad economy and mass layoffs tend to wipe out management positions the most. As a decent IC, in case of layoffs in bad economy, you'll always find some place to work at if you're flexible with location and salary because everyone still needs people who know how to actually build shit, but nobody needs to add more managers in their ranks to consume payroll and add no value.
AI will transform this.
At all large MNCs I worked at, management got hired and fired mostly on their (or lack thereof) connections and less on what they actually did. Once they got let go, they had near impossible time finding another job without connections in other places.
That said, Claude has absolutely no problem not only answering questions, but finding bugs and adding new features to it.
In short, I feel they're just as screwed as us devs.
Look at the 'Product Engineer' roles we are seeing spreading in forward-thinking startups and scaleups.
That's the future of SWE I think. SWEs take on more PM and design responsibilities as part of the existing role.
Of course things might look different when the product is something that requires really deep domain knowledge.
But not defining what an SRE is feels like a glaring, almost suffocating, omission.
> ...
> Are you keeping up with security updates? Will you leak all my data? Do I trust you? Can I rely on you?
IMO, if the answers to those questions matter to you, then you damn well should care how it works. Because even if you aren't sufficiently technically minded to audit the system, having someone be able to describe it to you coherently is an important starting point in building that trust and having reason to believe that security and privacy will work as advertised.
As for more dedicated to Ops side, it's garbage in, garbage out. I've already had too many outages caused by AI Slop being fed into production, calling all Developers = SRE won't change the fact that AI can't program now without massive experienced people controlling it.
SREs are great when the problem is “the network is down” or “kubernetes won’t run my pods”, but expecting a random engineer to know all the failure modes of software they didn’t build and don’t have context on never seems to work out well.
A tricky part becomes when you don't have both roles for something, like SRE-developed tools that are maintained by the ones writing them, and you need to strike the balance yourselves until/unless you wind up with that split. If you're not aware of both hats and juggling wearing them intentionally, in that case, you can wind up with tools out of SRE that are worse than any SWE-only tool might ever be, because the SREs sometimes think they won't make the same mistakes, but all the same feature-focused things apply for SRE-written tools too...
Why (imo)? Senior leaders still like to say: I run a 500 headcount finance EMEA organization for Siemens, I am the Chief People Officer of Meta anf I lead an org of 1000 smart HR pros. Most of their status is still tight to the org headcount.
Ultimately hardware, software, QA, etc is all about delivering a system that produces certain outputs for certain inputs, with certain penalties if it doesn’t. If you can, great, if you can’t, good luck. Whether you achieve the “can” with human development or LLM is of little concern as long as you can pay out the penalties of “can’t”.
AI will not get much better than what we have today, and what we have today is not enough to totally transform software engineering. It is a little easier to be a software engineer now, but that’s it. You can still fuck everything up.
Wow, where did this come from?
From what just comes to my mind based on recent research, I'd expect at least the following this or next year:
* Continuous learning via an architectural change like Titans or TTT-E2E.
* Advancement in World Models (many labs focusing on them now)
* Longer-running agentic systems, with Gas Town being a recent proof of concept.
* Advances in computer and browser usage - tons of money being poured into this, and RL with self-play is straightforward
* AI integration into robotics, especially when coupled with world models
And no, as an SRE I won't read DEV code, but I can help my team test it.
I mean to each their own. Sometimes if I catch a page and the rabbit hole leads to the devs code, I look under the covers.
And sometimes it's a bug I can identify and fix pretty quickly. Sometimes faster than the dev team because I just saw another dev team make the same mistake a month prior.
You gotta know when to cut your losses and stop searching the rabbit hole though, that's true.