I don't know if any of this is true, but as a user of Azure every day this would explain so much.
The Azure UI feels like a janky mess, barely being held together. The documentation is obviously entirely written by AI and is constantly out of date or wrong. They offer such a huge volume of services it's nearly impossible to figure out what service you actually want/need without consultants, and when you finally get the services up who knows if they actually work as advertised.
I'm honestly shocked anything manages to stay working at all.
I remember being impressed with the Azure docs... until I spend a week implementing something, only to have it completely fail when deployed to the test environment because GraphAPI did not work as documented. The beautiful docs were a complete lie.
These days I don't even bother looking at the docs when doing stuff with Azure.
The only good thing Microsoft azure ever did for me was provide a very easy way to exploit their free trial program in the early 2010s to crypto mine for free. It couldn’t do much, but it was straight up free real estate for CPU mining. $200 or 2 weeks per credit/debit card.
We migrated some services to AKS because the upper management thought it was a good deal to get so many credits, and now pods are randomly crashing and database nodes have random spikes in disk latency. What ran reliably on GCP became quite unpredictable.
Interesting!
We're using AKS with huge success so far, but lately our Pods are unresponsive and we get 503 Gateway Timeouts that we really can't trace down.
And don't get me started on Azure Blob Tables...
In our case, we spent to much time of engineer time just to put up with Azure but there’s no good ROI. It took sometime for the upper management to realize Azure is shit and cut the cost
Exactly what I was thinking. But then again, from what I've seen, the persons responsible for monitoring uptimes are often much further removed from the C suite in these "committed-spend" companies.
What are we reading here? These are extraordinary statements. Also with apparent credibility. They sound reasonable. Is this a whistleblower or an ex employee with a grudge? The appearance is the first. Is it? They’ve put their name to some clear and worrying statements.
> On January 7, 2025… I sent a more concise executive summary to the CEO. … When those communications produced no acknowledgment, I took the customary step of writing to the Board through the corporate secretary.
Why is that customary? I have not come across it, and though I have seen situations of some concern in the past, I previously had little experience with US corporate norms. What is normal here for such a level of concern?
More, why is this public not a court case for wrongful termination?
Is Azure really this unreliable? There are concrete numbers in this blog. For those who use Azure, does it match your external experience?
>Is Azure really this unreliable? There are concrete numbers in this blog. For those who use Azure, does it match your external experience?
IME, yes.
I'm currently working as an SRE supporting a large environment across AWS, Azure, and GCP. In terms of issues or incidents we deal with that are directly caused by cloud provider problems, I'd estimate that 80-90% come from Azure. And we're _really_ not doing anything that complicated in terms of cloud infrastructure; just VMs, load balancers, some blob storage, some k8s clusters.
Stuff on Azure just breaks constantly, and when it does break it's very obvious that Azure:
1. Does not know when they're having problems (it can take weeks/months for Azure to admit they had an outage that impacted us)
2. Does not know why they had problems (RCAs we're given are basically just "something broke")
3. Does not care that they had problems
Everyone I work with who interacts with Azure at all absolutely loathes it.
In my experience Azure is full of consistency issues and race conditions. It's enough of an issue that I was talking about new OpenAI models becoming available via Bedrock on AWS and how convenient that was since I wouldn't have to deal with Azure and my colleague in enterprise architecture went on an unprompted rant about these exact issues. It's not the first time something like this has happened and I've experienced these issues first hand, so yes. I'd say reliability is a critical issue for Azure and it hasn't gotten better each time I've gone back to check.
Large orgs make decisions that prioritize short-term metrics over long-term quality all the time and nobody tracks whether those tradeoffs actually paid off. The decision to ship fast and fix later sounds reasonable in a meeting setting until articles like this surface and the reality comes through clearly.
I recall seeing some pretty damning reports from a security pentester that was able to escape from a container on Azure and found the management controller for the service was years old with known critical unpatched vulnerabilities. Always been a bit sceptical of them since then
The CEO is accountable to the board. If they are derelict in their obligations to the company, that's where you need to raise a stink so they can fix it.
Well, yeah, that’s what a board does, but I think the issue is whether it is customary to go to the board directly in this situation. The answer is a resounding NO. Very odd, but cool idea and approach.
Yeah I thought that was extreme. An engineer going to the board of any corporation let alone Microsoft is not normal or customary IME. That could explain why they got no response.
This is insane, when you say azure OpenAI, do you mean like github copilot, microsoft copilot, hitting openai’s api, or some openai llm hosted on azure offering that you hit through azure? This is some real wild west crap!
The post is so dramatized and clearly written by someone with a grudge such that it really detracts from any point that is trying to be made, if there is any.
From another former Az eng now elsewhere still working on big systems, the post gets way way more boring when you realize that things like "Principle Group Manager" is just an M2 and Principal in general is L6 (maybe even L5) Google equivalent. Similarly Sev2 is hardly notable for anyone actually working on the foundational infra. There are certainly problems in Azure, but it's huge and rough edges are to be expected. It mostly marches on. IMO maturity is realizing this and working within the system to improve it rather than trying to lay out all the dirty laundry to an Internet audience that will undoubtedly lap it up and happily cry Microslop.
Last thing, the final part 6 comes off as really childish, risks to national security and sending letters to the board, really? Azure is still chugging along apparently despite everything being mentioned. People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.
> Last thing, the final part 6 comes off as really childish, risks to national security and sending letters to the board, really?
That struck me too. Maybe i've never worked high enough in an org (im unclear how highly ranked the author of the piece is) but i've never been in an org where going over your boss's boss's boss's boss's head and writing a letter to the board was likely to go well.
That said, i could easily believe that both Azure is an absolute mess and that the author of the piece was fired because of how he went about things.
AWS and Google Cloud are both huge and are significantly better in UX/DX. My only experience with Azure was that it barely worked, provided very little in the way of information about why it didn't. I only have negative impressions of Azure whereas at least GC and AWS I can say my experiences are mixed.
Microsoft is the go to solution for every government agency, FEDRAMP / CMMC environments, etc.
> People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.
This I'm more sympathetic to. I really don't think his approach of "here's what a rewrite would look like" was ever going to work and it makes me think that there's another side to this story. Thinking that the solution is a full reset is not necessarily wrong but it's a bit of a red flag.
At no point during the reading I got sense that he's suggesting something radical. Where specifically is he pointing out rewrite?
"The practical strategy I suggested was incremental improvement... This strategy goes a long way toward modernizing a running system with minimal disruption and offers gradual, consistent improvements. It uses small, reliable components that can be easily tested separately and solidified before integration into the main platform at scale." [1]
> People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.
Or… you’ve just normalised the deviation.
One of the few reliable barometers of an organisation (or their products) is the wtf/day exclaimed by new hires.
After about three or four weeks everyone adapts, learns what they can and can’t criticise without fallout, and settles into the mud to wallow with everyone else that has become accustomed to the filth.
As an Azure user I can tell you that it’s blindingly obvious even from the outside that the engineering quality is rock bottom. Throwing features over the fence as fast as possible to catch up to AWS was clearly the only priority for over a decade and has resulted in a giant ball of mud that now they can’t change because published APIs and offered products must continue to have support for years. Those rushed decisions have painted Azure into a corner.
You may puff your chest out, and even take legitimate pride in building the second largest public cloud in the world, but please don’t fool yourself that the quality of this edifice is anything other than rickety and falling apart at the seams.
Remind me: can I use IPv6 safely yet? Does it still break Postgres in other networks? Can azcopy actually move files yet, like every other bulk copy tool ever made by man? Can I upgrade a VM in-place to a new SKU without deleting and recreating it to work around your internal Hyper-V cluster API limitations? Premium SSDv2 disks for boot disks… when? Etc…
You may list excuses for these quality gaps, but these kinds of things just weren’t an issue anywhere else I’ve worked as far back as twenty years ago! Heck, I built a natively “all IPv6” VMware ESXi cluster over a decade ago!
> One of the few reliable barometers of an organisation (or their products) is the wtf/day exclaimed by new hires.
Eh, I don't think this is exactly as reliable as you'd expect.
My previous job had a fairly straight forward code base but had fairly poor reliability for the few customers we had, and the WTF portions usually weren't the ones that caused downtime.
On the other hand, I'm currently working on a legacy system with daily WTFs from pretty much everyone, with a greater degree of complexity in a number of places, and yet we get fewer bug reports and at least an order of magnitude if not two more daily users.
With all of that said... I don't think I've used any of Microsoft's new software in years and thought to myself "this feels like it was well made."
In fairness the SECWAR is hardly a computing expert.
But in this case the SECWAR has been properly advised. If anything it's astonishing that a program whereby China-based Microsoft engineers telling U.S.-based Microsoft engineers specific commands to type in ever made it off the proposal page inside Microsoft, accelerated time-to-market or not.
It defeats the entire purpose of many of the NIST security controls that demand things like U.S.-cleared personnel for government networks, and Microsoft knew those were a thing because that was the whole point to the "digital escort" (a U.S. person who was supposed to vet the Chinese engineer's technical work despite apparently being not technical enough to have just done it themselves).
Some ideas "sell themselves", ideas like these do the opposite.
> If anything it's astonishing that a program whereby China-based Microsoft engineers telling U.S.-based Microsoft engineers specific commands to type in ever made it off the proposal page inside Microsoft, accelerated time-to-market or not.
> It defeats the entire purpose of many of the NIST security controls that demand things like U.S.-cleared personnel for government networks, and Microsoft knew those were a thing because that was the whole point to the "digital escort" (a U.S. person who was supposed to vet the Chinese engineer's technical work despite apparently being not technical enough to have just done it themselves).
> The post is so dramatized and clearly written by someone with a grudge such that it really detracts from any point that is trying to be made, if there is any
I guessed that from the title on the main hn page. Glad to see it confirmed.
Some previous colleague of mine has to work with Azure on their day to day, and everything explained in this article makes a lot of sense when I get to hear about their massive rantings of the platform.
12 years ago I had to choose whether to specialize myself in AWS, GCP or Azure, and from my very brief foray with Azure I could see it was an absolute mess of broken, slow and click-ops methodology. This article confirms my suspicions at that time, and my colleague experience.
> Microsoft, meanwhile, conducted major layoffs—approximately 15,000 roles across waves in May and July 2025 —most likely to compensate for the immediate losses to CoreWeave ahead of the next earnings calls.
This is what people should know when seeing massive layoffs due to AI.
I honestly thought this was one of the weaker points of the article.
The OpenAI deal almost certainly related purely to GPU capacity, which had little to do with the article. The layoffs would have happened regardless.
IMO - churn, and generalization is the root cause. Engineers are thrown on projects for a year with little prior experience, leave others to pickup the pieces, etc. There's no longer a sense of ownership, and I'm sure the recent wave of layoffs isn't helping with this.
The only time I used Azure was for setting up Microsoft as a provider for authentication. Put me through a never-ending loop of asking for a Government of India issued document that was already submitted. Human support was non-existent. Decided never to use Azure in any product after that horrible experience.
If you cannot even get auth right I shudder to think what the rest of the product will be like to deal with should issues arise.
> The direct corollary is that any successful compromise of the host can give an attacker access to the complete memory of every VM running on that node. Keeping the host secure is therefore critical.
> In that context, hosting a web service that is directly reachable from any guest VM and running it on the secure host side created a significantly larger attack surface than I expected.
Why would an Azure customer need to query this service at all? I was not aware this service even exists- because I never needed anything like it. AFAI can tell, this service tells services running on the VM what SKU the VM is. But how is this useful to the service? Any Azure users could tell how they use IMDS? Thanks!
This reads pretty bad, and I believe it was. I worked on (and was at least partly responsible for) systems that do the same thing he described. It took constant force of will, fighting, escalation, etc to hold the line and maintain some basic level of stability and engineering practice.
And I've worked other places that had problems similar to the core problems described, not quite as severe, and not at the same scale, but bad enough to doom them (IMO) to a death loop they won't recover from.
My most memorable anecdote from working in Azure is that they had two products named Purview and the internal MS people I talked to never figured out which one I was trying to use.
I had the misfortune of having to use Azure back in 2018 and was appalled at the lack of quality, slowness. I was in GitHub forums, helping other customers suffering from lack of basic functionality, incredible prices with abysmal performance. This article explains a lot honestly.
Google’s Cloud feels like the best engineered one, though lack of proper human support is worrying there compared to AWS.
I thought that about GCP until I used it more seriously and kept running into issues where they didn’t have some feature AWS had had for ages, and our Google engineers kept saying the answer was to run your own service in Kubernetes rather than use a platform service which did not give me confidence that they understood what the business proposition was.
GCP's support is abysmal. Our assigned customer support agent has changed 3 times in as many months. it's really a dice roll if our quota increase requests are even acknowledged or we can get clarification on undocumented system limitations.
Microsoft Azure has always been a clown show. I've found so many obvious bugs. The quality is not there and never will be. No serious companies rely on it. Use virtually any other vendor or host it yourself.
The personal account makes a lot of sense, although I could easily see why the OP was not successful. Even if you are an excellent engineer, making people do things, accept ideas, and in general hear you requires a completely different skill altogether - basically being a good communicator.
The second thing is that this series of blog posts (whether true or not, but still believable) provides a good introduction to vibe coders. These are people who have not written a single line of code themselves and have not worked on any system at scale, yet believe that coding is somehow magically "solved" due to LLMs.
Writing the actual code itself (fully or partially) maybe yes. But understanding the complexity of the system and working with organisational structures that support it is a completely different ball game.
> Even if you are an excellent engineer, making people do things, accept ideas, and in general hear you requires a completely different skill altogether - basically being a good communicator.
I was thinking like this for a while but, now, I think this expectation is majorly false for a senior individual contributor. Especially when someone who can push out a detailed series of blogposts and has tried step-wise escalation.
Communication is a two-way street. Unlike the individual contributors, the management is responsible for listening and responding to risk assesments by the senior members and also ensuring that the technical competence and experienced people are retained in a tech company. If a leader doesn't want to keep an open ear, they do not belong there. If there is a huge attrition of highly senior people from non-finalized projects, you do not belong leadership either. Both cases are mentioned in the article.
Unfortunately our socioeconomic and political culture in the West has increasingly removed responsibilities and liabilities from the leadership of the companies. This causes people with lackluster technical, communication and risk assesment mentality being promoted into leadership positions.
So outside of a couple completely privately owned companies or exceptionally well organized NGOs, it will be increasingly difficult to find good leaders.
The truth is, only small companies build good stuff. Once a company becomes big enough, the main product that it originally started on is the only good thing that is worth buying from them - all new ventures are bound to be shit, because you are never going to convince people to break out of status quo work patterns that work for the rest of the company.
The only exception to this has been Google, which seems to isolate the individual sectors a lot more and let them have more autonomy, with less focus on revenue.
I did not get that impression at all. He mentioned quite a few conversations with partner level employees, technical fellow, principal managers.
The impression I got is he tried to fix things, but the mess is so widespread and decision makers are so comfortable in this mess that nobody wants to stick their necks out and fix things. I got strong NASA Challenger vibes when reading this story…
Well, part 3 at least explains something I've observed; the platform is incredibly unstable. The same calls, with the same parameters, will often randomly fail with HTTP 400 errors, only to succeed later(hopefully without involving support). That made provisioning with terraform a nightmare.
I won't even dive too much into all the braindead decisions. Mixing SKUs often isn't allowed if some components are 'premium' and others are not, and not everything is compatible with all instances. In AWS, if I have any EBS volume I can attach it to any instance, even if it is not optimal. There's no faffing about "premium SKUs". You won't lose internet connectivity because you attached a private load balancer to an instance. Etc...
At my company, I've told folks that are trying to estimate projects on Azure to take whatever time they spent on AWS or GCP and multiply by 5, and that's the Azure estimate. A POC may take a similar amount of time as any other cloud, but not all of the Azure footguns will show themselves until you scale up.
I was always very curious why people are using azure. Clunky difficult to setup and crazy prices. I know a person being very happy with them because of the credits they gave it to him. I felt I probably don't have a model that explains what is going on there and that would be cool to know why people pay them vs the competion
> Worse, early prototypes already pulled in nearly a thousand third-party Rust crates, many of which were transitive dependencies and largely unvetted, posing potential supply-chain risks.
Rust really going for the node ecosystem's crown in package number bloat
Rust is nowhere close to Node in terms of package number bloat. Most Rust libraries are actually useful and nontrivial and the supply chain risk is not necessarily as high for the simple reason that many crates are split up into sub-crates.
For example, instead of having one library like "hashlib" that handles all different kinds of hashing algorithms, the most "official" Rust libraries are broken up into one for sha1, one for sha2, one for sha3, one for md5, one for the generic interfaces shared by all of them, etc... but all maintained by the same organization: https://github.com/rustcrypto/
Most crypto libraries do the same. Ripgrep split off aho-corastick and memchr, the regex crate has a separate pcre library, etc.
Maybe that bumps the numbers up if you need more than one algorithm, but predominantly it is still anti-bloat and has a purpose...
There is a difference between individual packages coming out of a single project (or even a single Cargo workspace) vs them coming out of completely different people.
The former isn't a problem, it is actually desirable to have good granularity for projects. The latter is a huge liability and the actual supply chain risk.
For example, Tokio project maintains another popular library called Prost for Protobufs. I don't think having those as two separate libraries with their own set of dependencies is a problem. As long as Tokio developers' expertise and testing culture go into Prost, it is not a big deal to have multiple packages. Similarly different components of the Tokio itself can be different crates, as long as they are built and tested together, them being separate dependencies is GOOD.
Now to use Prost with a gRPC server, I need a different project: tonic which comes from a different vendor: Hyperium. This is an increased supply chain risk that we need to vet. They use Prost. They also use the "h2" crate. Now, I need to vet the code quality and the testing cultute of multiple different organizations.
I have a firm belief that the actual People >>> code, tooling, companies and even licensing. If a project doesn't have (or retain) visionary and experienced developers who can instill good culture, it will ship shit code. So vetting organizations >> vetting indiviual libraries.
Power Platform is of the same quality, I’d avoid it if possible.
I was a principal engineer in the Power Platform org and it always felt like a disorganized mess. Multiple reorganizations per year, changing priorities and service ownership.
> That entire 122-strong org was knee-deep in impossible ruminations involving porting Windows to Linux to support their existing VM management agents.
> My day-one problem was therefore not to ramp up on new technology, but rather to convince an entire org, up to my skip-skip-level, that they were on a death march.
> I later researched this further and found that no one at Microsoft, not a single soul, could articulate why up to 173 agents were needed to manage an Azure node
This is most corporates. I'm sure this was celebrated as as a successful project and congratulations to everyone, along with big bonuses, RSU, raises, and promotions, mostly to other orgs to bring this kind of 'success' to other projects (or other companies). These people mostly are gone in less than 2 years. They continue to take 'wins'.
The VPs are dumb as shit, but they need 'successful' projects that have fancy names that they can present to their exec team.
The 173 agents are to give wins to a large number of people and teams, all these people contributed to this successful project.
If it continues, there will be a lessons learned powerpoint, followed by 10x growth in headcount, promotions to everyone and double down. 270 people can deliver a baby in 1 day and all that.
> This group was now tasked with moving their inherited stack to the new Azure Boost accelerator environment, an effort Microsoft had publicly implied was well underway at Ignite conferences since 2023.
The goal is to attach your projects to something announced by the CEO and ride the career rocketship!
I just do not understand how Azure has the scale it does. You only need to login and click around for a bit to see this is not a coherent system designed by competent people. Let alone try and actually build something on it.
From my old experience in IT - people just default to Microsoft for everything. They don't want to hassle with learning anything else and assume better the devil you know. Glad I'm out of that world but its wild what people will put up with.
People and organizations that built things on top of Microsoft tech. Especially with a long history going back to NT times.
HN, YC, startup environment or academia is a Unix bubble. They all feed into each other. Especially because Linux is gratis which helped all of those to deploy projects/products/papers cheaply. Unix systems traditionally lack much of the upper layers, so it is the responsibility of the company, persons, developers to deal with the OS minutea. You need sysadmins, devops, SREs. Those are common roles again in this Unix bubble. The dependency chains here are usually flatter since it keeps mid-term costs lower.
Other organizations like governments and bigger orgs like banks prioritize having somebody else liable (i.e. they can blame) and they prefer to not hire technical competence in their orgs but rely on other companies. This is where Microsoft gets a lot of clients. You buy a bunch of server licenses. Your Microsoft support person installs them and installs IIS via GUI. And then you just upload your code every now and then. The OS updates, IIS server etc are all the responsibility of Microsoft and the middlemen companies. Minimal competence from the orginal org is required. There are multiple middlemen businesses who all give zero fucks about anything but whatever the immediate downstream from them. This is more usual in already publicly traded huge businesses. Moreover the investors actually mandate certain things that only this kind of layers of irresponsibility can deliver :) So you see this kind of switch happening towards IPOs.
Azure is the cloud labeling and forcing the first paradigm over the second paradigm for Microsoft products. It got lots of support because shareholders liked it. I don't think the original NT design and Microsoft's business model was bad, it actually worked very well. However, shareholders gonna shareholder. So they pushed hard for Microsoft and their clients to move to the "cloud". Microsoft executives saw the huge profit and share value potential of pushing Azure the brand too. It was the AI of 2010s afterall.
I've said it before and I'll say it again. I'm glad rust has good package management I really am. However given that aspect, it ends up forming a dependency heavy culture. In situations like this it's hard to use dependencies due to the amount of transitive dependencies some items pull in. I really which this would change. Of course this is a social problem so I don't expect a good answer to come of this....
Any complex system - and these cloud systems must be immensely complex - accumulate cruft and bloat and bugs until the entire thing starts to look like an old hotel that hasn’t been renovated in 30 years.
It’s not inevitable. Absolutely this is true without significant effort, but if you’ve been around the traps for long enough (in enough organisations), you get to see that the level of quality can vary widely. Avoiding the mud-pit does require a whole org commitment, starting from senior leadership.
This story is more interesting, in my opinion, in how quickly things devolved and also how unwilling the more senior layers of the org were to address it. At a whole company level, the rot really sets in when you start to lose the key people that built and know the system. That seems to be what’s happening here, and it does not bode well for MS in the medium term.
The comment comes from the input field on the post form. Not clear it would show up as a comment. The old thread you refer to had little to do with Microsoft per se. Let me known if I can help with the inconsistencies you mention?
> Why do you speak about yourself in the third person?
When you submit a link to HN, there is an entry field for text in addition to the url.
It does not really describe what the text is used for. For links, the content of that field is simply added as the first comment.
Someone who is unfamiliar with the submission process may assume this field should describe what they are submitting, and not format it like a comment.
Then that text gets posted as the first comment and tons of people downvote it, jumping to the conclusion that the weird summary comment is from an AI, and not the submitter describing their own submission.
(I also assumed these comments were AI until someone else pointed this out)
I downvoted this comment for sounding like a summarizing LLM, not adding anything substantial beyond the title of the post, before realizing you were the poster and author.
What's your assessment of AWS and GCP? Do you think it's likely they suffer from some of the same issues (eg the manual access of what should be highly secure, private systems, the instability, the lack of security)?
As a former GCP engineer, no, the systems are not generally unstable or insecure.
There is definitely manual access of data - it requires what was termed “break glass” similar to the JIT mechanism described by the author. However, it wasn’t quite so loose; there were eventually a lot of restrictions on who could approve what, what access you got after approval, and how that was audited.
It was difficult to get into the highest sensitivity data; humans reviewed your request and would reject it without a clear reason. And you could be 100% sure humans would review your session afterwards to look for bad behavior.
I once had to compile a large list of IP addresses that accessed a particular piece of data to fulfill a court order. It took me days of effort to get and maintain the elevated access necessary to do this.
I have a lot of respect for GCP as an engineering artifact, but a significantly less rosy opinion of GCP as an organization and bureaucratic entity. The amount of wasted effort expended on engaging with and navigating the bureaucracy is truly mind-boggling, and is the reason why a tiny feature that took a day to code could take months to release.
His writing style is fairly over the top (he is Swiss, and I have seen this before, but not most of the time), but most of the technical content seems true to me.
The Azure UI feels like a janky mess, barely being held together. The documentation is obviously entirely written by AI and is constantly out of date or wrong. They offer such a huge volume of services it's nearly impossible to figure out what service you actually want/need without consultants, and when you finally get the services up who knows if they actually work as advertised.
I'm honestly shocked anything manages to stay working at all.
These days I don't even bother looking at the docs when doing stuff with Azure.
> On January 7, 2025… I sent a more concise executive summary to the CEO. … When those communications produced no acknowledgment, I took the customary step of writing to the Board through the corporate secretary.
Why is that customary? I have not come across it, and though I have seen situations of some concern in the past, I previously had little experience with US corporate norms. What is normal here for such a level of concern?
More, why is this public not a court case for wrongful termination?
Is Azure really this unreliable? There are concrete numbers in this blog. For those who use Azure, does it match your external experience?
IME, yes.
I'm currently working as an SRE supporting a large environment across AWS, Azure, and GCP. In terms of issues or incidents we deal with that are directly caused by cloud provider problems, I'd estimate that 80-90% come from Azure. And we're _really_ not doing anything that complicated in terms of cloud infrastructure; just VMs, load balancers, some blob storage, some k8s clusters.
Stuff on Azure just breaks constantly, and when it does break it's very obvious that Azure:
1. Does not know when they're having problems (it can take weeks/months for Azure to admit they had an outage that impacted us)
2. Does not know why they had problems (RCAs we're given are basically just "something broke")
3. Does not care that they had problems
Everyone I work with who interacts with Azure at all absolutely loathes it.
https://x.com/DaveManouchehri/status/2037001748489949388
Nobody seems to care.
From another former Az eng now elsewhere still working on big systems, the post gets way way more boring when you realize that things like "Principle Group Manager" is just an M2 and Principal in general is L6 (maybe even L5) Google equivalent. Similarly Sev2 is hardly notable for anyone actually working on the foundational infra. There are certainly problems in Azure, but it's huge and rough edges are to be expected. It mostly marches on. IMO maturity is realizing this and working within the system to improve it rather than trying to lay out all the dirty laundry to an Internet audience that will undoubtedly lap it up and happily cry Microslop.
Last thing, the final part 6 comes off as really childish, risks to national security and sending letters to the board, really? Azure is still chugging along apparently despite everything being mentioned. People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.
That struck me too. Maybe i've never worked high enough in an org (im unclear how highly ranked the author of the piece is) but i've never been in an org where going over your boss's boss's boss's boss's head and writing a letter to the board was likely to go well.
That said, i could easily believe that both Azure is an absolute mess and that the author of the piece was fired because of how he went about things.
I'm really struck that they have such Jr people in charge of key systems like that.
Microsoft is the go to solution for every government agency, FEDRAMP / CMMC environments, etc.
> People come in all the time crying that everything is broken and needs to be scrapped and rewritten but it's hardly ever true.
This I'm more sympathetic to. I really don't think his approach of "here's what a rewrite would look like" was ever going to work and it makes me think that there's another side to this story. Thinking that the solution is a full reset is not necessarily wrong but it's a bit of a red flag.
"The practical strategy I suggested was incremental improvement... This strategy goes a long way toward modernizing a running system with minimal disruption and offers gradual, consistent improvements. It uses small, reliable components that can be easily tested separately and solidified before integration into the main platform at scale." [1]
[1] https://isolveproblems.substack.com/p/how-microsoft-vaporize...
Or… you’ve just normalised the deviation.
One of the few reliable barometers of an organisation (or their products) is the wtf/day exclaimed by new hires.
After about three or four weeks everyone adapts, learns what they can and can’t criticise without fallout, and settles into the mud to wallow with everyone else that has become accustomed to the filth.
As an Azure user I can tell you that it’s blindingly obvious even from the outside that the engineering quality is rock bottom. Throwing features over the fence as fast as possible to catch up to AWS was clearly the only priority for over a decade and has resulted in a giant ball of mud that now they can’t change because published APIs and offered products must continue to have support for years. Those rushed decisions have painted Azure into a corner.
You may puff your chest out, and even take legitimate pride in building the second largest public cloud in the world, but please don’t fool yourself that the quality of this edifice is anything other than rickety and falling apart at the seams.
Remind me: can I use IPv6 safely yet? Does it still break Postgres in other networks? Can azcopy actually move files yet, like every other bulk copy tool ever made by man? Can I upgrade a VM in-place to a new SKU without deleting and recreating it to work around your internal Hyper-V cluster API limitations? Premium SSDv2 disks for boot disks… when? Etc…
You may list excuses for these quality gaps, but these kinds of things just weren’t an issue anywhere else I’ve worked as far back as twenty years ago! Heck, I built a natively “all IPv6” VMware ESXi cluster over a decade ago!
Eh, I don't think this is exactly as reliable as you'd expect.
My previous job had a fairly straight forward code base but had fairly poor reliability for the few customers we had, and the WTF portions usually weren't the ones that caused downtime.
On the other hand, I'm currently working on a legacy system with daily WTFs from pretty much everyone, with a greater degree of complexity in a number of places, and yet we get fewer bug reports and at least an order of magnitude if not two more daily users.
With all of that said... I don't think I've used any of Microsoft's new software in years and thought to myself "this feels like it was well made."
Really. Apparently the Secretary of War agrees with him.
But in this case the SECWAR has been properly advised. If anything it's astonishing that a program whereby China-based Microsoft engineers telling U.S.-based Microsoft engineers specific commands to type in ever made it off the proposal page inside Microsoft, accelerated time-to-market or not.
It defeats the entire purpose of many of the NIST security controls that demand things like U.S.-cleared personnel for government networks, and Microsoft knew those were a thing because that was the whole point to the "digital escort" (a U.S. person who was supposed to vet the Chinese engineer's technical work despite apparently being not technical enough to have just done it themselves).
Some ideas "sell themselves", ideas like these do the opposite.
> It defeats the entire purpose of many of the NIST security controls that demand things like U.S.-cleared personnel for government networks, and Microsoft knew those were a thing because that was the whole point to the "digital escort" (a U.S. person who was supposed to vet the Chinese engineer's technical work despite apparently being not technical enough to have just done it themselves).
That is beyond bad. Proof of this?
IMHO the country should not capitulate to Trump's power grabs, even if Congress refuses to perform their oversight duties.
I guessed that from the title on the main hn page. Glad to see it confirmed.
12 years ago I had to choose whether to specialize myself in AWS, GCP or Azure, and from my very brief foray with Azure I could see it was an absolute mess of broken, slow and click-ops methodology. This article confirms my suspicions at that time, and my colleague experience.
> Microsoft, meanwhile, conducted major layoffs—approximately 15,000 roles across waves in May and July 2025 —most likely to compensate for the immediate losses to CoreWeave ahead of the next earnings calls.
This is what people should know when seeing massive layoffs due to AI.
The OpenAI deal almost certainly related purely to GPU capacity, which had little to do with the article. The layoffs would have happened regardless.
IMO - churn, and generalization is the root cause. Engineers are thrown on projects for a year with little prior experience, leave others to pickup the pieces, etc. There's no longer a sense of ownership, and I'm sure the recent wave of layoffs isn't helping with this.
If you cannot even get auth right I shudder to think what the rest of the product will be like to deal with should issues arise.
and
"I also see I have 2 instances of Outlook, and neither of those are working." -Artemis II astronaut
That's 2 too many.
https://en.wikipedia.org/wiki/Theories_of_humor#Incongruity_...
> In that context, hosting a web service that is directly reachable from any guest VM and running it on the secure host side created a significantly larger attack surface than I expected.
That is quite scary
Why would an Azure customer need to query this service at all? I was not aware this service even exists- because I never needed anything like it. AFAI can tell, this service tells services running on the VM what SKU the VM is. But how is this useful to the service? Any Azure users could tell how they use IMDS? Thanks!
And I've worked other places that had problems similar to the core problems described, not quite as severe, and not at the same scale, but bad enough to doom them (IMO) to a death loop they won't recover from.
Google’s Cloud feels like the best engineered one, though lack of proper human support is worrying there compared to AWS.
I've just listened to Longhorn story on Monday and have heard the same thing.
The second thing is that this series of blog posts (whether true or not, but still believable) provides a good introduction to vibe coders. These are people who have not written a single line of code themselves and have not worked on any system at scale, yet believe that coding is somehow magically "solved" due to LLMs.
Writing the actual code itself (fully or partially) maybe yes. But understanding the complexity of the system and working with organisational structures that support it is a completely different ball game.
I was thinking like this for a while but, now, I think this expectation is majorly false for a senior individual contributor. Especially when someone who can push out a detailed series of blogposts and has tried step-wise escalation.
Communication is a two-way street. Unlike the individual contributors, the management is responsible for listening and responding to risk assesments by the senior members and also ensuring that the technical competence and experienced people are retained in a tech company. If a leader doesn't want to keep an open ear, they do not belong there. If there is a huge attrition of highly senior people from non-finalized projects, you do not belong leadership either. Both cases are mentioned in the article.
Unfortunately our socioeconomic and political culture in the West has increasingly removed responsibilities and liabilities from the leadership of the companies. This causes people with lackluster technical, communication and risk assesment mentality being promoted into leadership positions.
So outside of a couple completely privately owned companies or exceptionally well organized NGOs, it will be increasingly difficult to find good leaders.
The truth is, only small companies build good stuff. Once a company becomes big enough, the main product that it originally started on is the only good thing that is worth buying from them - all new ventures are bound to be shit, because you are never going to convince people to break out of status quo work patterns that work for the rest of the company.
The only exception to this has been Google, which seems to isolate the individual sectors a lot more and let them have more autonomy, with less focus on revenue.
The impression I got is he tried to fix things, but the mess is so widespread and decision makers are so comfortable in this mess that nobody wants to stick their necks out and fix things. I got strong NASA Challenger vibes when reading this story…
I won't even dive too much into all the braindead decisions. Mixing SKUs often isn't allowed if some components are 'premium' and others are not, and not everything is compatible with all instances. In AWS, if I have any EBS volume I can attach it to any instance, even if it is not optimal. There's no faffing about "premium SKUs". You won't lose internet connectivity because you attached a private load balancer to an instance. Etc...
At my company, I've told folks that are trying to estimate projects on Azure to take whatever time they spent on AWS or GCP and multiply by 5, and that's the Azure estimate. A POC may take a similar amount of time as any other cloud, but not all of the Azure footguns will show themselves until you scale up.
> Worse, early prototypes already pulled in nearly a thousand third-party Rust crates, many of which were transitive dependencies and largely unvetted, posing potential supply-chain risks.
Rust really going for the node ecosystem's crown in package number bloat
For example, instead of having one library like "hashlib" that handles all different kinds of hashing algorithms, the most "official" Rust libraries are broken up into one for sha1, one for sha2, one for sha3, one for md5, one for the generic interfaces shared by all of them, etc... but all maintained by the same organization: https://github.com/rustcrypto/
Most crypto libraries do the same. Ripgrep split off aho-corastick and memchr, the regex crate has a separate pcre library, etc.
Maybe that bumps the numbers up if you need more than one algorithm, but predominantly it is still anti-bloat and has a purpose...
Start with tokio. Please vend one dependency battery included, and vendor in/internalize everything, thanks.
The former isn't a problem, it is actually desirable to have good granularity for projects. The latter is a huge liability and the actual supply chain risk.
For example, Tokio project maintains another popular library called Prost for Protobufs. I don't think having those as two separate libraries with their own set of dependencies is a problem. As long as Tokio developers' expertise and testing culture go into Prost, it is not a big deal to have multiple packages. Similarly different components of the Tokio itself can be different crates, as long as they are built and tested together, them being separate dependencies is GOOD.
Now to use Prost with a gRPC server, I need a different project: tonic which comes from a different vendor: Hyperium. This is an increased supply chain risk that we need to vet. They use Prost. They also use the "h2" crate. Now, I need to vet the code quality and the testing cultute of multiple different organizations.
I have a firm belief that the actual People >>> code, tooling, companies and even licensing. If a project doesn't have (or retain) visionary and experienced developers who can instill good culture, it will ship shit code. So vetting organizations >> vetting indiviual libraries.
I was a principal engineer in the Power Platform org and it always felt like a disorganized mess. Multiple reorganizations per year, changing priorities and service ownership.
Uh...yeah. I think we all realized that years ago.
> My day-one problem was therefore not to ramp up on new technology, but rather to convince an entire org, up to my skip-skip-level, that they were on a death march.
> I later researched this further and found that no one at Microsoft, not a single soul, could articulate why up to 173 agents were needed to manage an Azure node
This is most corporates. I'm sure this was celebrated as as a successful project and congratulations to everyone, along with big bonuses, RSU, raises, and promotions, mostly to other orgs to bring this kind of 'success' to other projects (or other companies). These people mostly are gone in less than 2 years. They continue to take 'wins'.
The VPs are dumb as shit, but they need 'successful' projects that have fancy names that they can present to their exec team.
The 173 agents are to give wins to a large number of people and teams, all these people contributed to this successful project.
If it continues, there will be a lessons learned powerpoint, followed by 10x growth in headcount, promotions to everyone and double down. 270 people can deliver a baby in 1 day and all that.
> This group was now tasked with moving their inherited stack to the new Azure Boost accelerator environment, an effort Microsoft had publicly implied was well underway at Ignite conferences since 2023.
The goal is to attach your projects to something announced by the CEO and ride the career rocketship!
Uh...yeah. I think we all realized that years ago.
Who are the customers? Who is buying this shit?
HN, YC, startup environment or academia is a Unix bubble. They all feed into each other. Especially because Linux is gratis which helped all of those to deploy projects/products/papers cheaply. Unix systems traditionally lack much of the upper layers, so it is the responsibility of the company, persons, developers to deal with the OS minutea. You need sysadmins, devops, SREs. Those are common roles again in this Unix bubble. The dependency chains here are usually flatter since it keeps mid-term costs lower.
Other organizations like governments and bigger orgs like banks prioritize having somebody else liable (i.e. they can blame) and they prefer to not hire technical competence in their orgs but rely on other companies. This is where Microsoft gets a lot of clients. You buy a bunch of server licenses. Your Microsoft support person installs them and installs IIS via GUI. And then you just upload your code every now and then. The OS updates, IIS server etc are all the responsibility of Microsoft and the middlemen companies. Minimal competence from the orginal org is required. There are multiple middlemen businesses who all give zero fucks about anything but whatever the immediate downstream from them. This is more usual in already publicly traded huge businesses. Moreover the investors actually mandate certain things that only this kind of layers of irresponsibility can deliver :) So you see this kind of switch happening towards IPOs.
Azure is the cloud labeling and forcing the first paradigm over the second paradigm for Microsoft products. It got lots of support because shareholders liked it. I don't think the original NT design and Microsoft's business model was bad, it actually worked very well. However, shareholders gonna shareholder. So they pushed hard for Microsoft and their clients to move to the "cloud". Microsoft executives saw the huge profit and share value potential of pushing Azure the brand too. It was the AI of 2010s afterall.
This story is more interesting, in my opinion, in how quickly things devolved and also how unwilling the more senior layers of the org were to address it. At a whole company level, the rot really sets in when you start to lose the key people that built and know the system. That seems to be what’s happening here, and it does not bode well for MS in the medium term.
Also, after this:
https://news.ycombinator.com/item?id=20341022
You continued to work at Microsoft and now there is this takedown?
I'm no friend of MS (to put it very mildly) but it seems to me your story is a bit inconsistent as well as the 7 year break between postings.
When you submit a link to HN, there is an entry field for text in addition to the url.
It does not really describe what the text is used for. For links, the content of that field is simply added as the first comment.
Someone who is unfamiliar with the submission process may assume this field should describe what they are submitting, and not format it like a comment.
Then that text gets posted as the first comment and tons of people downvote it, jumping to the conclusion that the weird summary comment is from an AI, and not the submitter describing their own submission.
(I also assumed these comments were AI until someone else pointed this out)
There is definitely manual access of data - it requires what was termed “break glass” similar to the JIT mechanism described by the author. However, it wasn’t quite so loose; there were eventually a lot of restrictions on who could approve what, what access you got after approval, and how that was audited.
It was difficult to get into the highest sensitivity data; humans reviewed your request and would reject it without a clear reason. And you could be 100% sure humans would review your session afterwards to look for bad behavior.
I once had to compile a large list of IP addresses that accessed a particular piece of data to fulfill a court order. It took me days of effort to get and maintain the elevated access necessary to do this.
I have a lot of respect for GCP as an engineering artifact, but a significantly less rosy opinion of GCP as an organization and bureaucratic entity. The amount of wasted effort expended on engaging with and navigating the bureaucracy is truly mind-boggling, and is the reason why a tiny feature that took a day to code could take months to release.
Microsoft should have promoted this guy instead of laying him off.
Did Microsoft really lose OpenAI as a customer?
It didn’t get any better.