NetBird

Report a problemSubscribe to updates
Powered by
Privacy policy

·

Terms of service
Write-up
Unstable peer connection to management service
Full outage
View the incident
Incident Report: Management Service Disruption (April 20, 2026)

Date: April 20, 2026

Status: Resolved

Prepared by: NetBird Team

Summary

On April 20, we experienced a disruption affecting the NetBird Management Service. The issue originated from an unexpected spike in signaling traffic, which first caused degraded performance in a single region and later resulted in a broader service interruption.

The incident has been fully resolved, and we have already implemented several improvements, with additional safeguards currently in progress.

Impact
  • Partial disruption (US region): ~2 hours of degraded service performance (approximately 17:13–19:30 UTC)

  • Full service disruption (global): ~1 hour within that window (approximately 19:30-20:30 UTC)

  • Recovery: ~1:30 hours of degraded service performance during recovery (approximately 20:30-22:00 UTC)

  • Existing peer-to-peer connections: Remained stable and continued operating as expected

During this period, users may have experienced:

  • Delays when establishing new connections

  • Intermittent disconnects during client reconnection attempts

  • Increased latency when interacting with the management API

Importantly, already established relayed and peer-to-peer tunnels were not affected and continued to pass traffic normally.

What Happened

The incident was triggered by a rapid increase in signaling traffic related to peer-to-peer (P2P) connection negotiation. NetBird relies on a signaling layer to coordinate direct connections between peers, including exchanging connection metadata and assisting in NAT traversal.

During this event, the volume of signaling requests increased significantly—well beyond typical operating levels—driven by a combination of new connection attempts and repeated reconnection cycles.

This resulted in:

  • A surge in P2P negotiation requests across the signaling layer

  • Rapid reconnect loops from clients attempting to re-establish connectivity

  • Elevated load across the distributed load balancer infrastructure and management components

While the platform is designed to scale dynamically, the rate of increase in traffic outpaced the system’s ability to adjust in real time. This mismatch led to increased connection churn and cascading pressure on critical coordination paths, ultimately impacting overall system responsiveness.

Resolution

Once the underlying cause was identified, the team took targeted actions to stabilize the platform and reduce the load on critical components.

Mitigation steps included:

  • Introducing rate limiting to smooth out bursts in signaling traffic

  • Scaling infrastructure both vertically and horizontally to absorb increased demand

  • Applying configuration adjustments to improve traffic distribution and system behavior under load

Stabilization was achieved shortly after these changes were applied, and service performance returned to normal levels.

During the incident, identifying the root cause required deeper investigation than expected. Initial mitigation efforts focused on symptoms rather than the primary driver, which extended the time to full recovery. This highlighted the need for improved observability and faster correlation of system-wide signals.

Following accurate identification of the issue, corrective actions were implemented quickly, leading to a rapid stabilization of the platform.

What We’re Improving

We are using this incident to further strengthen the platform. The following improvements are already underway:

  • Improved handling of rapid reconnect scenarios to prevent cascading load during high connection churn

  • Faster scaling response to sudden traffic increases to better match rapid changes in demand

  • Additional safeguards around signaling and coordination paths to ensure stability under extreme load patterns

  • Enhanced monitoring and system observability to accelerate detection and root cause analysis

  • More adaptive traffic shaping mechanisms to smooth bursty connection patterns

  • Optimization of high-load API paths to improve resilience under stress

  • Refined default configurations to better support large-scale environments

  • Ongoing validation through load and resilience testing

Moving Forward

Reliability and transparency are core priorities for us. While this incident did not impact active network traffic, it highlighted areas where we can improve the robustness of our control plane under extreme conditions.

We are confident that the improvements already implemented — along with those in progress — significantly reduce the likelihood of similar incidents in the future.

If you experienced issues or have further questions, our support team is available to assist you.