When the Cloud Goes Dark: Keeping the Lights On 

In this week’s newsletter: a deep dive into the recent AWS and Microsoft cloud outages, what caused them, their ripple effects on businesses, and how (theoretically) to keep the lights on next time. As usual I also attempt to cover key Microsoft product updates (Azure, M365, Copilot) and a brief recap on our fortnightly (or bi-weekly, if you prefer) podcast, Cloudy with a Chance of Insights, which this week also had a slightly more personal feel than usual.  

Deep Dive: Outages at AWS and Microsoft – Causes & Impact 

AWS’s “Daylong” Outage – What Happened? 

In the early hours of 20th October, AWS experienced a severe incident starting in its US EAST 1 (North Virgina) region. AWS first reported “DNS problems” with DynamoDB at 3:11 am ET, indicating that many services couldn’t reach this core database. By around 5 am ET, over 70 AWS services were impacted, and user reports of broken websites spiked into the millions. AWS’s own status dashboard noted an “operational issue” affecting multiple systems, and engineers were “working on multiple parallel paths to accelerate recovery.” Despite some early signs of improvement, issues persisted through the morning. AWS eventually acknowledged that a fault in an internal network subsystem (specifically, the EC2 load balancer health check system) was the root cause, which in turn  had caused those DynamoDB DNS failures which then rippled across AWS’s cloud. 

AWS engineers took drastic steps to stabilise things: they even throttled new EC2 instance launches (essentially pausing certain customer workloads) to prevent further strain while fixing the core problem. By late afternoon, recovery was well underway, and by 6 pm ET AWS announced all services “returned to normal operations.” In total, the outage and subsequent knock-on issues lasted for around 15 hours. 

What was the Impact? 

The disruption was global. Thanks to AWS’s ubiquity, a who’s who of the internet went down. Users found Disney+, Reddit, Snapchat, the New York Times, Ring doorbell cams, McDonald’s app, Venmo, Coinbase, and even Slack either completely unresponsive or severely degraded. British government sites like GOV.UK and HMRC were knocked offline. Major airlines (United, Delta) had their booking systems and apps malfunctioning; students couldn’t submit homework because Canvas (a learning platform) was down; gamers saw Roblox, Fortnite, and even some PlayStation network features fail. Crucially, AWS’s own operations were impacted, Amazon’s retail site had intermittent issues, and internal tools at Amazon warehouses stopped working .  One report tallied over 1,000 services affected, with an astonishing 6.5 million outage reports logged by users worldwide. This was easily one of the most disruptive cloud incidents in recent memory, and it dramatically demonstrated how much of our digital world relies on AWS behind the scenes. 

AWS has promised a detailed post-mortem (which many of their biggest customers will no doubt pore over), and industry experts are already calling the outage “inevitable” given AWS’s scale and complexity. Still, “inevitable” doesn’t soften the blow for the businesses that effectively lost a day of revenue and customer trust. 

Microsoft’s Azure Outage – A 30% Capacity Crash 

Just 11 days before AWS’s issues, Microsoft had its own cloud hiccup on 9th October. Starting at about 07:40 UTC that day, the Azure Portal (the web interface for managing Azure services) became intermittently unavailable. Administrators across Europe, Middle East and Africa found that portal pages wouldn’t load, or they encountered bizarre errors. Microsoft quickly identified a significant capacity loss (~30%) in Azure Front Door (AFD) – which is Azure’s global content delivery and load balancing network. In plainer terms, nearly a third of the servers that route traffic for Azure’s front-end went down, primarily affecting those EMEA regions. 

Root Cause

Surprisingly (and perhaps embarrassingly for Microsoft), the culprit was a crash in some underlying Kubernetes instances that support Azure’s edge networking. In an official update, Microsoft confirmed the outage was “due to a dependency on some underlying Kubernetes instances that crashed.” Importantly, they also noted no new code deployment triggered it – this wasn’t a case of a buggy update; it was the platform itself failing unexpectedly. Still, losing one-third of capacity because a few K8s nodes crashed is, as The Register quipped, “less than ideal.” A robust design should have self-healed or redistributed load automatically – apparently that didn’t happen fully, which raises questions about Azure’s resiliency in this part of its architecture. 

What was the Impact? 

This Azure incident, while serious, was more contained than the AWS one. It mainly affected management and admin services rather than consumer apps. The Azure Portal was the biggest victim – many admins couldn’t access their cloud resources through the usual web interface for several hours. Related admin hubs like Microsoft Entra ID (Azure AD) and some Microsoft 365 admin centers, which also use AFD, were similarly unreliable. End-user applications hosted on Azure saw mixed effects: some web apps that rely on Front Door had issues (like users getting timeouts or certificate errors), but core Azure services running in the backend kept running. In essence, Azure’s control plane was hit, not the data plane. Still, for IT teams in Europe that morning, it was a pain – imagine not being able to manage or troubleshoot your cloud services during a workday. 

Microsoft’s status updates recommended using alternative methods (like CLI commands or PowerShell) to manage resources while the portal was down. On social media, a few people humorously vented: one user was trying to cancel an Xbox Game Pass subscription but couldn’t because the account portal was down; others trying to check on cloud workloads got “connectivity lost” messages. Not life-or-death situations, but frustrating. The good news is that Microsoft’s engineers resolved the issue relatively quickly. Essentially, they hit the big red restart button on the affected Kubernetes clusters and rebalanced the traffic. Services began recovering incrementally; by late morning, the Azure Portal was largely working again for most, and Microsoft even executed a failover of the M365 admin portal to bypass Front Door temporarily. By early afternoon on 9th Oct, Microsoft declared the incident fully mitigated. They did not, however, give a detailed explanation of why those K8s instances crashed or why the automatic failover didn’t kick in – at least not publicly. (We may have to wait for a post-incident report for those juicy details.) 

Outages at a Glance 

To put these two incidents side by side, here’s a quick comparison of the AWS and Azure outages

Cloud Incident Date (Duration) Cause Notable Impacts (Scope) 
AWS (US-East-1) 20 Oct 2025 (≈15 hours) Faulty internal network subsystem for load balancer health checks ⇒ DNS failures in core DB service (DynamoDB). 1,000+ services affected globally. Major websites/apps down (Disney+, Reddit, Snapchat, Ring, Coinbase, Delta, etc.). Even Amazon’s own services (Amazon.com, warehouse systems) were disrupted. Essentially a large chunk of the internet went offline for much of the day. 
Microsoft Azure 9 Oct 2025 (few hours, AM) Crash of underlying Kubernetes instances supporting Azure Front Door (causing ~30% capacity loss). No bad update – a platform failure. Azure Portal and related admin portals down across EMEA. Some customer web apps using Front Door saw errors (timeouts, TLS issues). Impact mostly limited to management interfaces rather than end-user services. Significant inconvenience for admins, but not a consumer-facing meltdown. 

As the table shows, the AWS outage was far more extensive in impact and duration. Azure’s was serious but chiefly an administrative headache and region-specific. However, both incidents underscore the vulnerabilities of cloud infrastructure – even the best engineered systems have weak points, and when they fail, the consequences can be far-reaching. 

Why Do These Outages Matter? (Lessons for Business) 

When two cloud giants stumble so close together, it’s a stark reminder for anyone relying on cloud services (i.e., almost every organisation): outages happen – you need to plan for them. Here are a few key lessons and reflections: 

In short, no cloud is immune to failure. The digital economy’s backbone rests on systems that, while highly reliable, are not unbreakable. For businesses, an hour of downtime can mean millions in lost revenue or a dent in customer confidence. So, these events reinforce the importance of resilience: ensuring your organisation can weather a cloud outage with minimal damage. That brings us to the next topic – how to keep things running when the cloud (or a part of it) goes dark. 

Building Resilience: How to Keep the Lights On 

What can organisations do to mitigate the impact of cloud outages? Here are some strategies for resilience and business continuity in a cloud-centric world: 

  • Multi-Region Deployments (within one cloud): If you’re all-in on a single cloud provider, at least avoid being all-in on a single region. Deploy critical applications across multiple regions of that provider. For example, a workload might primarily run in AWS’s Dublin region but fail over to AWS London if needed. This protects against an outage confined to one data centre or region (like Azure’s issue in EMEA). However, be aware that some failures can be cross-regional (e.g., if they involve global services or control planes). Still, multi-region architecture is a fundamental first step for higher availability. Just be sure to test your failover process regularly – you don’t want the first failover to be during a real incident. 
  • Hybrid Cloud Setup: Don’t put all your eggs in the public cloud basket. Many enterprises maintain some on-premises or private cloud infrastructure as a backup for critical workloads. In a hybrid model, if your cloud provider has an outage, you can temporarily shift certain operations back to your own data centre. For instance, a bank might run its core transaction system on Azure normally, but have an on-premises system that can kick in if Azure goes down. Hybrid setups can ensure at least skeleton operations continue during a public cloud blip. The downside is cost and complexity – essentially running parallel environments – but for truly mission-critical systems, the investment can be worth it. Think of it as having a generator when the power grid fails. 
  • Multi-Cloud Strategy: This is the ultimate form of redundancy: using two or more cloud providers so that if one goes down, the other can handle your workloads. The appeal is obvious – you’re never totally at the mercy of one vendor. In fact, the idea got a PR boost this month: after Microsoft’s recent outages, Google actively pitched a business continuity plan where Google Workspace could run alongside Microsoft 365 to keep companies going if Microsoft has an incident. Multi-cloud can mean actively dividing your services across providers (e.g., your app runs simultaneously in AWS and Azure and users switch to the one that’s up) or having a hot-standby (Cloud B is idle until Cloud A fails). Many organisations love the concept of multi-cloud resiliency, but it’s challenging to implement. You need to design applications to be cloud-agnostic, synchronise data across clouds, and double up on expertise (AWS and Azure have different quirks, for example). It also tends to be pricier, since redundancy is expensive by nature. A practical compromise is to go multi-cloud for only your most critical customer-facing services, or to use one cloud as primary and another as a disaster recovery target. The key is to balance risk vs. cost: if an hour of downtime costs you more than running two clouds, then multi-cloud might be a smart move. 
  • Business Continuity Planning & Drills: Technology aside, having a solid business continuity plan (BCP) is vital. This means defining procedures for various outage scenarios – not just technical steps to recover systems, but also communication plans (who informs customers and how), manual workarounds for staff (can your sales team take orders on paper if your cloud-based system is down?), and so on. A plan is just paper unless practiced, so regular drills or simulations are recommended. Some companies do “game days” where they intentionally simulate a cloud provider outage to see how their teams and systems respond. These drills often reveal gaps (perhaps a dependency you overlooked, or confusion in roles) that can be fixed before a real incident. When done well, a prepared team can turn what would be a panicky blackout into a more controlled, even routine failover event. The motto here is “hope for the best, plan for the worst.” We’ve seen in these outages that those who had emergency playbooks (and perhaps alternate systems ready) fared much better than those scrambling on the fly. 

To summarise the approaches, here’s a quick comparison: 

Strategy What It Involves Pros Cons & Challenges 
Multi-Region (Single Cloud) Deploying your application across multiple regions of one cloud provider (e.g., split traffic between AWS N. Virginia and AWS Oregon). Protects against a datacentre or regional outage; Data stays within same provider (ease of integration). May not help with provider-wide issues (e.g., if AWS’s control plane fails globally); Requires architecture to handle data replication and latency between regions; slightly higher cost to run in multiple regions continuously. 
Hybrid Cloud Using a mix of public cloud and private/on-premises systems. Keep critical functions on-prem as backup. Full control over backup environment; Not relying solely on cloud; Can use existing infrastructure investments; Data sovereignty assured on-prem. Complex to maintain two environments; Data consistency between cloud and on-prem needs effort; Fallback may be limited in capacity compared to cloud scale; Typically slower to scale on-prem if needed. 
Multi-Cloud Utilizing two or more different cloud providers for the same application or service, in active-active or active-passive modes. No single point of failure on a provider; Leverage best-of-breed services from each; Competitive leverage (you aren’t locked into one vendor); Can achieve truly high uptime if done right. Highest complexity – different APIs, skill sets, and infrastructures to manage; Application design must be cloud-neutral (or use multi-cloud abstractions) which can limit using unique features; Significant duplication of data and efforts; Higher costs (paying two providers, data egress charges to sync between clouds, etc.). 

There’s no one-size-fits-all answer; often the right approach is a blend. For example, a company might be primarily on AWS (single cloud) but runs its database in multi-AZ mode (multi-region) and has an offline export daily that could be loaded into an on-prem system (hybrid DR) if absolutely necessary. The common theme is avoiding total reliance on any one component. Given the recent outages, many businesses are re-evaluating their architectures. Even Microsoft is in on the discussion – that Google pitch for Workspace alongside 365 was a not-so-subtle nod at multi-cloud thinking. At minimum, every organisation should review their failure modes: what actually happens if provider X goes down? What’s our backup plan? It’s better to workshop those scenarios now than to learn the hard way later. 


Microsoft Cloud Updates – Oct 2025 Round-Up 

Alright, on to some brighter news: despite the turbulence, the tech world keeps innovating. Here are a few Microsoft cloud updates from the past fortnight that our enterprise readers might find interesting: 

  • Azure AI & Automation: Microsoft’s AI push continues. Azure AI Foundry (the AI development suite on Azure) has introduced support for next-gen models, including GPT-5 Codex for advanced code generation and a new GPT “realtime” model for ultra-fast inferencing. They’ve also added nifty browser-automation tools to make building intelligent web workflows easier. Meanwhile, AKS Automatic (an autopilot mode for Azure Kubernetes Service) has reached General Availability. This essentially lets Azure handle a lot of the “plumbing” of K8s for you – scaling, patching, node management – so developers can spend more time on apps and less on cluster babysitting. Given the Azure outage above was K8s-related, one hopes AKS Automatic has some extra resilience baked in! 
  • Multi-Cloud & Data: The previously unlikely Microsoft-Oracle friendship is thriving. Oracle Database@Azure – which lets you run Oracle’s database services natively on Azure – now spans 23 Azure regions (up from 12 at launch). This expanded regional availability means customers can deploy Oracle workloads closer to home in more parts of the world, improving performance and offering better failover options. Essentially, Microsoft and Oracle are jointly addressing enterprise demand for multi-cloud solutions: you get Oracle’s tech but on Azure’s global infrastructure, which appeals to companies trying to simplify their cloud estate. In a related note, new data integration features were announced that link Oracle DB with Microsoft Fabric (Azure’s data analytics platform) for real-time analytics without complex ETL. It’s clear Microsoft sees value in playing nice with other ecosystem giants when it helps customers’ cloud migrations. 
  • Microsoft 365 & Copilot Updates: On the M365 front, the big buzz is AI integration. Microsoft’s new AI assistant, Copilot, is getting even smarter and more widespread. A recent under-the-hood upgrade has boosted Copilot’s core AI model to GPT-5, meaning it can produce more accurate and contextually nuanced results when helping you draft emails, create PowerPoints, analyse data, etc. Perhaps more noticeable for users, Microsoft announced it will start automatically installing the Microsoft 365 Copilot app on Windows 11 devices that have Microsoft 365 apps. This rollout began in early October and will continue through mid-November. In practice, that means if you’re a Microsoft 365 subscriber, you’ll suddenly see a Copilot icon appear on your taskbar or Start menu one day – ready to assist you. (Administrators can opt out if they wish, but Microsoft is being pretty assertive in pushing Copilot to everyone). The goal is to integrate AI assistance seamlessly into daily workflows. Early testers report huge time-savings on tasks like summarising long documents or generating first drafts of proposals via a simple prompt to Copilot. We’re watching this closely – it could be transformative, but it also raises questions about user training and trust in AI outputs, which many organisations are now grappling with. 
  • Azure and Beyond: A quick miscellany – Microsoft Ignite, the big annual tech conference, kicks off on 18th November. Expect major announcements around AI, cloud services, and perhaps some security features (given the spotlight on reliability lately, maybe something about new failover capabilities?). Azure’s partnership with VMware is evolving: new licensing changes now allow customers to bring their own VMware licenses to Azure VMware Solution (potentially lowering costs for those hybrid setups). And on the developer tooling side, GitHub (owned by Microsoft) is integrating further with Azure DevOps, making it easier to manage CI/CD pipelines for AI applications – a nod to the trend of AI-driven software development. We’ll cover the Ignite highlights in our next edition, but it’s safe to say AI and cloud scalability will remain front and centre. 

Cloudy with a Chance of Insights | EP22

This week, Richard kicks things off with a personal update, before launching into a philosophical ramble about why he’s doing a podcast at all—spoiler: it’s not for the likes, but if you fancy pressing that button, it does help others find us though 🙂

David then dives into Microsoft’s TypeChat (think: making AI output less of a fever dream and more of a structured answer), DeepSpeed (the unsung hero keeping Copilot’s LLMs from collapsing under their own weight), and Mark It Down (finally, a tool to turn your document chaos into Markdown sanity).

Cyrus brings his usual blend of technical depth and existential musings, covering, the big shift: Microsoft Entra as the new source of authority for identity, Passwordless guidance, Intune’s new security baselines, Purview Insider Risk’s new network-based detection for cloud apps and GenAI

Along the way, we reminisce about Commodore 64s, DOS pranks, and the joys of wiring up PC speakers (or not), the state of search engines, the rise of GenAI, and why none of us are likely to become TikTok influencers anytime soon.

YouTube: https://www.youtube.com/@CloudyWithaChanceofInsights
Spotify: https://spoti.fi/3D5jBLs
Apple: https://apple.co/49kBxxL


Read more: 

Leave a comment

Website Built with WordPress.com.

Up ↑