Postmortem Library

Optus: Fatal impact of firewall upgrade on 000 Calls

Company & product overview

Optus is Australia’s second-largest telecom provider and a critical carrier for routing emergency Triple Zero (000) calls. On 18 September 2025, a network change at Optus disrupted the special call path used to carry 000 calls, while normal voice calls largely continued to work.

What happened?

During a routine firewall upgrade, a technical failure prevented some 000 calls from being carried on the Optus network. The impact spanned South Australia (SA), Western Australia (WA), the Northern Territory (NT), and a small number of calls near South Australia and New South Wales (NSW) routed via SA towers. Optus later confirmed that approximately 600 emergency calls were affected. Critically, authorities and media reported at least three, and later four, deaths linked to the outage period; coronial inquiries will determine causality.

Timeline

Start (approx.): Thursday, 18 Sep 2025 ~12:30 am local time, during a firewall upgrade.
Detection and escalation: Initial testing reportedly showed normal calls connecting and national call-volume monitors did not alarm. Issue visibility grew as failed 000 call reports emerged; customers and agencies raised concerns throughout the day. Reports indicate customer and agency warnings were missed, and escalation procedures weren’t followed promptly. Formal regulatory investigation commenced in the following days.
Resolution (approx.): Thursday, 18 Sep 2025, ~1:30 pm local time. Total window ~13–14 hours of degraded 000 carry capability.
TTD (time to detect): Unclear/under investigation; publicly available accounts suggest hours elapsed before full incident comprehension and escalation.
TTR (time to resolve): ~13–14 hours from change start to restoration.
Who was affected: Approximately 600 attempted emergency calls were impacted.

Impact on human life

‍

Multiple fatalities were reported during the outage window, including cases in SA and WA. Officials and media have stated at least three deaths initially, later four, with coronial investigations to confirm causation. Regardless, the event demonstrates that emergency-service outages translate directly into loss of life, not just financial or reputational damage.

How the company responded

Optus launched an internal investigation, issued public updates, and conducted welfare checks on households with failed 000 attempts after service restoration. Leadership publicly apologised and acknowledged the severity of the failure. Optus has stated it is enhancing monitoring and controls specific to emergency call routing. Regulatory probes by ACMA are in progress.

Communication during the outage

Communications from Optus and notifications to governments and regulators drew criticism for timeliness and completeness. Australian Communications and Media Authority noted serious concern and opened a fresh compliance investigation. Government leaders at state and federal levels publicly condemned the failure and signaled additional oversight (e.g., a proposed Triple-Zero Custodian with new powers; real-time outage reporting).

Key learnings for other organizations

Classify life-critical services. Identify workflows where failure can harm people (health, safety, security). Give them stricter change control, independent approvals, and a rehearsed, time-boxed rollback.
Prove changes with end-to-end tests. Run synthetic transactions on the actual critical path before and after each change, per region/user segment. Gate deployment on pass/fail “go/no-go” checks.
Design for graceful degradation. Build and continuously verify fallbacks (alternate providers, paths, features). Alarm on any diversion failure, not just total outages.
Monitor what matters, not just volumes. Create dedicated SLOs for critical paths (success rate, latency to start and complete, abandonment and ring-no-answer patterns).
Escalate early, even with partial data. Define SLAs to notify internal leadership, frontline teams, regulators, partners, and affected customers as soon as credible impact is detected. Use pre-approved templates and contact lists.
Protect people during impact. Have a “welfare-check” or “at-risk user” playbook: quickly extract failed/abandoned attempts, triage by risk, and hand off to the right responders via secure channels.
Map dependencies and single points of failure. Keep an up-to-date service map (infrastructure, vendors, authentication, DNS, payments, messaging). Add circuit breakers, bulkheads, and rate limits to contain the blast radius.
Instrument auditability. Log every step on critical paths with enough detail for rapid triage and for compliance reviews; retain evidence linked to change tickets and runbooks.
Commit to post-incident accountability. Run blameless root cause analysis with clear owners, deadlines, and verification of fixes. Track regulatory and contractual obligations at the board level.
Keep humans at the center. In status pages, IVRs, and apps, offer practical alternatives (backup channels, emergency contacts, offline instructions) and update frequently – clarity and empathy reduce harm.

Summary

On 18 Sep 2025, an Optus firewall upgrade fault disrupted the emergency call path for Triple Zero (000), blocking roughly 600 calls over ~13–14 hours across several regions; while investigations continue, multiple deaths were reported during the outage window. The failure stemmed from a technical defect specific to the emergency-routing path, which is distinct from normal voice service, and monitoring didn’t surface the problem early enough for rapid mitigation. Optus apologized, initiated welfare checks, and cooperated with investigations as governments signaled stronger oversight. This was not only a costly operational breakdown but a human tragedy.

Find more Postmortems:

SEV-1

28.08.2025

PagerDuty: Customers experienced delayed notifications

A logging bug triggered millions of Kafka producer connections per hour, overloading PagerDuty’s cluster and causing API errors, delayed notifications, and slower incident handling, especially in US regions.

SEV-1

20.10.2025

AWS: US-EAST-1 load-balancer glitch sparks internet-scale outage

A major US-EAST-1 incident at AWS triggered a global disruption on October 20–21, degrading or taking offline thousands of apps and sites across social, finance, gaming, government, retail, and more.

SEV-2

03.09.2025

1Password: Sign-in outage blocks logins

A one-hour 1Password outage on Aug 5, 2025 blocked sign-ins. Here’s what happened: the timeline, comms, and actionable lessons to harden your incident response.

Ready to elevate your incident management?

Start for free