Postmortem Library

Google Cloud outage June 2025: quota policy change impacted many

Google Cloud experienced a global outage caused by an internal quota policy change, disrupting key services worldwide. Discover what happened, impacted services, Google's response, remediation steps, and critical lessons learned for improving future reliability.

Link to the source

Company

Google Cloud provides global cloud computing infrastructure used by enterprises, governments, and developers. Its platform supports compute, storage, databases, analytics, and AI services across regions. Key offerings include Compute Engine, BigQuery, Cloud Run, and Vertex AI, which serve millions of workloads daily.

What happened during the Google Cloud outage?

Google Cloud experienced a global outage lasting approximately 3 hours. A faulty policy change in its quota enforcement system, Service Control, introduced malformed metadata that triggered crash loops in binaries responsible for validating API traffic.

‍

The bug propagated quickly due to global replication. Hundreds of Google Cloud and Google Workspace services, including Compute Engine, Cloud Storage, BigQuery, and Gmail, saw elevated 503 errors and degraded access. Most regions stabilised within 2 hours, while us-central1 recovered later due to infrastructure strain from retry traffic.

Timeline

When did the Google Cloud incident start?

‍

The issue began at 10:49 PDT on June 12 when malformed quota data was deployed. Crash loops in regional binaries followed within seconds, returning 503 errors for many services.

‍

How was the Google Cloud incident detected and escalated?

‍

SRE teams began triage at 10:53 PDT. The root cause was narrowed to Service Control’s policy processing within 10 minutes. A global mitigation switch was deployed at 11:30 PDT, restoring most APIs.

‍

When was the Google Cloud outage resolved?

‍

us-central1 took longer to recover due to retry floods and throttling delays. By 13:49 PDT, API traffic had stabilised across all regions.

MTTD: ~4 minutes

MTTR: ~3 hours

‍

How did Google Cloud respond to the outage?

Google’s SRE team responded quickly after user-facing errors were detected. Triage was initiated within minutes, and mitigation began using an internal override mechanism. A rollback plan was in place and executed globally. Google's team maintained ownership throughout the incident, with region-based coordination and recovery sequencing helping stabilise services quickly.

Who was affected by the Google Cloud outage?

The outage affected core APIs across multiple Google Cloud regions. Impacted services included:

‍

Most severely affected:

Compute Engine, Cloud Storage, BigQuery, IAM: API errors and degraded performance.
Cloud Run, Cloud Functions, Vertex AI: Deployment delays and intermittent access.
Google Workspace (Gmail, Drive, Meet): Slower access and failed background tasks.

Partial failures:

Streaming and IaaS services remained largely online but experienced control plane disruption.
Dashboards and quota-enforced services failed to load or deploy.

The issue affected developers, operations teams, and enterprise users relying on quota-based APIs and metadata replication.

How did Google Cloud communicate during the outage?

Google published its first public update nearly an hour into the incident, after stabilising internal monitoring systems. Once the communication infrastructure recovered, status updates resumed on the Google Cloud Dashboard, followed by a detailed root cause analysis the next day. While messaging was clear after restoration, initial delays limited early visibility, especially for customers without direct API-level insight.

‍

This underscores the importance of decoupled monitoring and communication paths during high-blast-radius incidents. As Google’s SRE playbook highlights, strong incident communication requires not just accuracy but also timing, tone, and empathy. Google met many of these standards, though earlier transparency could have reduced uncertainty for indirectly affected users.

What patterns did the Google Cloud outage reveal?

The incident reflected common risks seen in large-scale cloud environments:

Global propagation of invalid config. Replicated metadata spread the bug system-wide
Single point of failure in control plane binaries. One system impacted many APIs.
Crash loops without backoff logic. Retreats made recovery harder.
Tight coupling between monitoring and infrastructure. Delayed visibility to users.

Quick summary

On June 12, 2025, Google Cloud experienced a 3-hour global outage caused by a malformed policy in its Service Control system. The failure disrupted key APIs and products across cloud and workspace services. Google responded quickly with a global rollback, but early detection and monitoring gaps delayed user communication. The incident highlights common challenges in control plane design, config propagation, and crash loop containment.

Find more Postmortems:

SEV-2

16.05.2025