Date: Jan 16, 2025

Incident Summary

In the early morning UTC hours, our load balancing service encountered two significant traffic spikes, a routine occurrence for our systems. These spikes took place between 3:00 AM and 4:15 AM UTC and later between 6:05 AM and 7:20 AM UTC. Normally, our rate-limiting service scales up and down quickly in response to global traffic patterns. However, during these periods of increased traffic, approximately 5–10% of requests in one of our regions, europe-west4, began to fail or experienced increased latency. Under typical circumstances, our load balancer would identify the misbehaving region and route traffic to other healthy regions, but this mechanism did not function as expected. As a result, some requests to Zuplo APIs either failed outright or took longer than normal to process.

What Happened?

The issue began with two routine traffic spikes during which our cloud autoscaling service did not provision resources in the europe-west4 region quickly enough to handle the load. This failure to scale led to increased latencies—reaching up to five seconds—and some dropped requests. Compounding the problem, our load balancer did not remove the failing region from rotation. Although some traffic was diverted, the system continued to send requests to the struggling region because the health checks were still passing. Finally, deployed API gateways using the rate-limiting service were not configured with a timeout. As a result, when the rate-limiting service in europe-west4 became slow or unresponsive, user requests were delayed while waiting for responses, exacerbating the impact of the incident.

Why Did This Happen?

The incident stemmed from three key issues:

  1. Autoscaling Behavior: Our autoscaling configuration did not perform as expected during the traffic spikes. Upon investigation with Google Support, we learned that the autoscaler behaved within Google’s documented norms but outside our assumptions of how it would handle rapid changes in traffic. This led to insufficient scaling in the europe-west4 region.
  2. Load Balancer Health Checks: The load balancer failed to take the struggling region out of rotation because the health checks it relied on were still passing. Despite the increased error rates, the system interpreted the region as “healthy” and continued to route traffic there.
  3. Lack of Timeout on Rate Limiting Requests: API gateways using the rate-limiting service were not set up to timeout requests. When the rate-limiting service became slow, the gateways waited indefinitely for responses, causing elevated latency for end-user requests.

Resolution and Recovery

During the incident, after we identified the failing region and noticed the high error rate and latency we scaled up the region manually. This temporary fix stopped the immediate cause of the problem.

What We Are Doing to Prevent Future Issues

This incident highlighted areas where our infrastructure and configurations can be made more robust.

Conclusion

We sincerely apologize for the inconvenience this incident caused. We take these disruptions seriously and are committed to improving our systems to ensure reliability and performance. We remain focused on enhancing the resilience of our platform and will continue to provide transparent communication about our progress. If you have any questions or concerns, our support team is available to assist you.