Sourcegraph Cloud infrastructure outage

Incident Report for Sourcegraph

Postmortem

Summary

On June 12 2025, Sourcegraph hosted services (the “service”) experienced a site-wide outage due to an upstream providers incident. During the outage, some users were unable to access the service or were experiencing transient errors. The service was restored after the Cloud provider mitigated the underlying issue. 

The affected service included:

  • Amp
  • Cody Free/Pro
  • Cody Gateway
  • Sourcegraph Cloud
  • Sourcegraph Workspaces

Timeline

  • 2025-06-12 11:08 AM PDT We created an internal incident after we were notified of an anomaly reported by our monitoring alert policies.
  • 2025-06-12 11:16 AM PDT We concluded that the outage was caused by the Cloud provider and communicated through our public status page at https://sourcegraphstatus.com/ and https://ampcodestatus.com/. We were actively monitoring for updates and assessing the health of individual services to ensure they can be safely restored later.
  • 2025-06-12 6:40 PM PDT All services were restored and validated. We closed the incident.

Root cause

GCP incident

Our service utilized multiple GCP services to host the Sourcegraph application, e.g., Cloud SQL, Google Kubernetes Engine, Cloud Run, Cloud Storage. Service Control, one of the GCP’s internal services, was on the critical path for almost all public and internal API requests, and was responsible for authentication, authentication, and quota enforcement. During the incident, Service Control was down due to an application issue and affected all downstream GCP services. 

Our service relies on several GCP APIs to maintain basic functionality. For example, we used GCP Identity and Access Management (IAM) to permit workload to access the GCP-hosted datastore, such as Cloud SQL, and Object Storage. As these API endpoints were all affected by the Service Control outage, our service was inaccessible shortly after the Cloud provider breakage.

We confirmed all services were recovered at 2025-06-12 6:40 PM PDT.

Cloudflare incident

In addition to the GCP service, one of our services, Sourcegraph Workspaces, was affected by Cloudflare outage. The service relied on Cloudflare Workers as a centralized router for all user requests. Between 2025-06-12 10:52 AM PDT and 2025-06-12 1:28 PM PDT Cloudflare Workers experienced a downtime where almost all users' requests were failing. 

Our service remained inaccessible after Cloudflare Workers was restored due to the GCP incident above.

What are we doing about it?

There are no follow-up actions to this incident. Our team has previously done tabletop exercises for this scenario where GCP recovery may take multiple days. As a worldwide outage we are susceptible to these scenarios and will bring back our services as soon as we can.

Posted Jun 17, 2025 - 17:04 UTC

Resolved

This incident has been resolved.
Posted Jun 12, 2025 - 22:34 UTC

Update

We're seeing improvement and service recovery, we will keep monitoring.
Posted Jun 12, 2025 - 20:45 UTC

Update

We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.
Posted Jun 12, 2025 - 18:16 UTC

Identified

Our upstream service provider is experiencing an outage and services are affected.
Posted Jun 12, 2025 - 18:14 UTC
This incident affected: Sourcegraph Cloud.