Date: Jan 16, 2025
In the early morning UTC hours, our load balancing service encountered two significant traffic spikes, a routine occurrence for our systems. These spikes took place between 3:00 AM and 4:15 AM UTC and later between 6:05 AM and 7:20 AM UTC. Normally, our rate-limiting service scales up and down quickly in response to global traffic patterns. However, during these periods of increased traffic, approximately 5–10% of requests in one of our regions, europe-west4, began to fail or experienced increased latency. Under typical circumstances, our load balancer would identify the misbehaving region and route traffic to other healthy regions, but this mechanism did not function as expected. As a result, some requests to Zuplo APIs either failed outright or took longer than normal to process.
The issue began with two routine traffic spikes during which our cloud autoscaling service did not provision resources in the europe-west4 region quickly enough to handle the load. This failure to scale led to increased latencies—reaching up to five seconds—and some dropped requests. Compounding the problem, our load balancer did not remove the failing region from rotation. Although some traffic was diverted, the system continued to send requests to the struggling region because the health checks were still passing. Finally, deployed API gateways using the rate-limiting service were not configured with a timeout. As a result, when the rate-limiting service in europe-west4 became slow or unresponsive, user requests were delayed while waiting for responses, exacerbating the impact of the incident.
The incident stemmed from three key issues:
During the incident, after we identified the failing region and noticed the high error rate and latency we scaled up the region manually. This temporary fix stopped the immediate cause of the problem.
This incident highlighted areas where our infrastructure and configurations can be made more robust.
We sincerely apologize for the inconvenience this incident caused. We take these disruptions seriously and are committed to improving our systems to ensure reliability and performance. We remain focused on enhancing the resilience of our platform and will continue to provide transparent communication about our progress. If you have any questions or concerns, our support team is available to assist you.