Postmortem Library

1Password: Sign-in outage blocks logins

On Aug 5, 2025, 1Password customers couldn’t sign in for roughly an hour. This overview covers what happened, how it was handled publicly, and the playbook updates you can adopt now to protect your primary user journey.

Link to the source

Company & product

1Password is a cross-platform password manager used by individuals and enterprises to store and autofill credentials, passkeys, and secrets across devices and browsers.

What happened

On August 5, 2025, 1Password experienced an incident preventing customers from signing in. Users saw errors like “Can’t sign in. The request took too long.” 1Password mitigated and then resolved the issue the same day. 1Password did not publish a root cause on the incident page.

Timeline

Start: Tue, Aug 5, 2025 16:46 EDT (20:46 UTC / 22:46 CEST).
Resolution: Tue, Aug 5, 2025 17:59 EDT (21:59 UTC / 23:59 CEST).
TTD (time to detect): Public status page showed the incident started at 16:46 EDT. Third-party monitors indicated user reports began ~11 minutes earlier.
TTR (time to resolve): 1h 13m.

How 1Password responded

Triage and identification began with confirming the issue after an initial investigation window, followed by mitigation where 1Password “rolled out changes to mitigate” and shifted the incident to monitoring. The service recovered and was marked resolved at 17:59 EDT. For continuity, customers were advised they could access items offline in the app (if permitted by admins), with the caveat that changes wouldn’t sync until recovery—an update also mirrored by IsDown.

How 1Password communicated

Channels: Status page carried the investigation → identified → monitoring → resolved sequence with clear, plain-language updates and a practical workaround (offline access).
Cadence: Multiple updates occur across the ~1-hour window, culminating in the issuance of a resolved notice.

Key learnings for other teams

Protect your primary user journey (auth) with canary checks. Run synthetic sign-ins per region and tenant type; alert on elevated auth latency and error spikes to cut TTD.
Design for offline resilience. If your client apps can safely operate read-only offline, document and pre-approve this path so support can share it immediately (as 1Password did).
Stage mitigations behind feature flags. Being able to “roll out changes to mitigate” quickly implies preflighted toggles and safe rollback—make this standard.
Own the comms narrative. Include a short impact summary (scope, percentage of failures, regions), known workarounds, and next update time to set expectations.
Capture auth dependencies. Map third-party/infra dependencies (IdP, network edges, DBs). Pre-define degraded modes (rate limits, circuit breakers) to hold the line during partial failures.

How ilert can help

Reliable escalation policies: Layered on-call schedules and service-based routing notify the right responder fast, with automatic handoffs if there’s no acknowledgement. Fail-safe fallbacks across voice, SMS, push, and chat ensure no alert is dropped.
AI-assisted incident communications: ilert drafts clear status page updates and stakeholder summaries in seconds, ensuring a consistent tone across all channels.
Reports for better post-incident learnings: Out-of-the-box dashboards track MTTA/MTTR, alerts, and escalation effectiveness so you can see what’s working and what isn’t. Trend and SLO/SLA impact views prioritize the fixes that matter most.

Find more Postmortems:

SEV-1

08.01.2026

n8n: How the "Ni8mare" flaw left 100,000 servers open to total takeover

In early 2026, n8n disclosed a critical security vulnerability, tracked as CVE-2026-21858 and nicknamed "Ni8mare." The flaw allowed unauthenticated attackers to achieve Remote Code Execution (RCE) on self-hosted instances. By exploiting a "Content-Type" confusion bug in how n8n handled webhooks and form submissions, attackers could bypass authentication, read sensitive server files, and ultimately gain full control over the host system.

SEV-1

09.01.2026

Intercom: Empty database routing map triggers total US service outage

This article examines the January 2026 Intercom outage in the US region, where a database routing failure caused a total service blackout. We explore how a logic bug in a sharded database layer can disconnect an entire application from its data.

SEV-1

26.11.2025