Google links massive cloud outage to API management issue

On Thursday, a significant incident affected Google Cloud services, identified as an API management issue. This outage not only disrupted Google’s offerings but also had cascading effects on numerous third-party platforms reliant on Google Cloud infrastructure.

The outage commenced at approximately 10:49 ET and persisted for over three hours, concluding around 3:49 ET, impacting millions of users globally. Key services affected included Gmail, Google Calendar, Google Chat, Google Cloud Search, Google Docs, Google Drive, Google Meet, Google Tasks, Google Voice, Google Lens, Discover, and Voice Search.

The repercussions extended beyond Google’s direct services, significantly affecting third-party applications such as Spotify, Discord, Snapchat, NPM, Firebase Studio, and select Cloudflare services that utilized the Workers KV key-value store.

In a statement addressing the incident, Google expressed regret for the disruption, acknowledging the trust placed in Google Cloud by businesses of all sizes. The company emphasized its commitment to improving its reliability and service quality.

While Google is preparing a comprehensive incident report, it has disclosed that the root cause involved an increase in 503 error responses for external API requests during the outage. The underlying issue was traced back to the Google Cloud API management platform, which encountered invalid data. This problem remained undetected for too long due to inadequate testing and error-handling mechanisms.

According to the company, “The issue stemmed from an invalid automated quota update to our API management system, which was disseminated globally, leading to rejected external API requests. As part of our recovery efforts, we bypassed the problematic quota check, enabling recovery in most regions within two hours.” However, the quota policy database in the us-central1 region became overwhelmed, leading to extended recovery times in that area. Some products even experienced moderate residual effects, including backlogs lasting up to an hour post-mitigation.

Impact on Cloudflare Services

Following the restoration of its services, Cloudflare confirmed through a post-mortem that the outage was not a result of a security breach, and no data was compromised. The company outlined that the cause of the disruption was linked to failures within the underlying storage infrastructure supporting its Workers KV service, a key component for many Cloudflare functionalities—including configuration, authentication, and asset delivery.

Part of the affected infrastructure is supported by a third-party cloud provider, which experienced an outage that directly impacted the availability of Cloudflare’s KV service. While Cloudflare did not disclose the name of the third-party provider, a spokesperson clarified that only those services relying on Google Cloud were impacted.

In light of this incident, Cloudflare announced plans to migrate its KV’s central store to its own R2 object storage, aiming to reduce dependency on external providers and mitigate the risk of future disruptions.

Patching previously involved tedious scripts and prolonged hours of work. However, advancements in automation are revolutionizing this process. Tines outlines how modern IT organizations are enhancing efficiency through automation solutions that expedite patching, minimize overhead, and allow teams to concentrate on strategic tasks without the hassle of complex scripts.

To learn more about automating patching processes and improving operational efficiency, access the comprehensive guide from Tines.