Postmortem Library

Salesforce: Unsanctioned OS update disrupted networking

An automated OS update broke network routes across Heroku hosts, disrupting many Salesforce clouds and services.

Company

Salesforce provides multi‑cloud CRM, commerce, analytics, and platform services, including Core Salesforce, Service Cloud, Commerce Cloud, Marketing Cloud, Tableau, MuleSoft, Revenue Cloud, and Heroku.

What happened during the Salesforce service disruption?

Salesforce detected widespread login failures and degraded functionality across several clouds.

‍

An automated operating‑system update, run by a process that should have been disabled in production, restarted networking services on Heroku Private Space hosts, but the required routes were not reapplied.

‍

The resulting loss of outbound connectivity caused cascading failures in Private Space apps, parts of Common Runtime, databases, MFA flows, and internal tooling (including the Heroku status site).

‍

Mitigation steps disabled further updates, restored routes, and recycled hosts; all services were declared fully recovered on June 11, 2025, at 05:50 UTC.

Timeline

When did the Salesforce incident start?

‍

The incident began at 06:03 UTC on June 10 when automated monitoring detected rising login failures and API time‑outs across several Salesforce clouds, marking the official start of the disruption.

How was the Salesforce incident detected and escalated?

‍

By 06:47 UTC, engineers confirmed that key internal response tools, including the Heroku status site, were also failing, indicating a platform‑wide issue.

A cross‑service war‑room opened at 07:00 UTC. Packet tracing at 10:24 UTC showed containers could not reach their hosts, narrowing the fault to host‑level routing.

‍

At 11:54 UTC engineers identified missing network routes on unhealthy hosts, and by 13:42 UTC they traced the trigger to an automated operating‑system update that had restarted networking services without reapplying routes.

When was the Salesforce incident resolved?

‍

Mitigation began at 15:03 UTC when the update token was revoked and host images were rebuilt with updates disabled. Rolling restarts from 15:03 UTC to 19:18 UTC sharply reduced error rates.

The Heroku Dashboard was fully restored at 21:54 UTC, and a fleet‑wide host recycle was completed at 05:50 UTC on June 11, at which point all services were declared fully recovered.

‍

TTD: 44 minutes (06:03 → 06:47 UTC)

‍
TTR: 23 hours 47 minutes (06:03 UTC Jun 10 → 05:50 UTC Jun 11)

How did Salesforce respond to the disruption?

‍

Salesforce’s technology team brought together engineers from Heroku, networking, and other cloud groups, plus an upstream vendor.

‍

Their immediate tasks were to disable the automated updates, restore missing network routes, rebuild host images, revoke the update token, and orchestrate rolling restarts across the fleet.

‍

While the main status site was unavailable, progress updates were posted through the herokustatus account on X until normal communications were restored.
‍

Who was affected by the Salesforce service disruption, and how bad was it?

Most severely affected:

Private Space applications lost outbound connectivity, disrupting customer apps and dashboards.
Approximately 9% of Postgres databases failed over automatically; some High Availability failovers were paused.

Partial failures:

‍

1% of Common Runtime apps experienced networking issues.
MFA‑based logins for Marketing Cloud, MuleSoft, Tableau, and Commerce Cloud intermittently failed.
Order placement (OCI/SOM) and Service Cloud messaging were delayed.

How did Salesforce communicate during the outage?

Salesforce issued frequent updates via its status page until that site was impacted, then shifted to the Heroku status X account and later email notifications once services stabilised. Updates were factual and apologetic, outlining mitigation progress and recovery milestones.

What patterns did the Salesforce service disruption reveal?

Environment‑control gaps: Production hosts allowed unsanctioned OS updates.
Network routing single‑point failure: Missing routes caused complete connectivity loss for affected hosts.
Shared infrastructure dependency: Status page and email systems ran on the same impacted platform, delaying comms.
Manual remediation burden: Lack of automated fleet‑wide restart tooling prolonged mitigation.

Planned actions include stricter immutability for host images, isolated operational tooling, enhanced monitoring for network route health, and fleet‑wide remediation automation.

Quick summary

On June 10–11, 2025, an automated OS update on Heroku hosts removed crucial network routes, disrupting multiple Salesforce clouds for almost 24 hours.

Disabling updates, revoking tokens, and recycling hosts restored services, highlighting the need for strict update controls, redundant comms infrastructure, and automated recovery tooling.

‍

Find more Postmortems:

SEV-1

20.10.2025

AWS: US-EAST-1 load-balancer glitch sparks internet-scale outage

A major US-EAST-1 incident at AWS triggered a global disruption on October 20–21, degrading or taking offline thousands of apps and sites across social, finance, gaming, government, retail, and more.

SEV-0

18.09.2025

Optus: Fatal impact of firewall upgrade on 000 Calls

A critical Optus upgrade fault blocked ~600 emergency “000” calls for ~13 hours across multiple states – an incident linked to multiple deaths. What we learned.

SEV-2

03.09.2025

1Password: Sign-in outage blocks logins

A one-hour 1Password outage on Aug 5, 2025 blocked sign-ins. Here’s what happened: the timeline, comms, and actionable lessons to harden your incident response.

Ready to elevate your incident management?

Start for free