Brilliaz

How to troubleshoot intermittent WAN link failures between sites due to flapping routes or MTU issues.

When sites intermittently lose connectivity, root causes often involve routing instability or MTU mismatches. This guide outlines a practical, layered approach to identify, quantify, and resolve flapping routes and MTU-related WAN disruptions without causing service downtime.

By Brian Adams

August 11, 2025

Intermittent WAN failures between sites can seem elusive, yet most cases reveal a pattern once you step back and observe the network behavior over time. Start by gathering baseline metrics from your edge devices, including route advertisements, interface statistics, and MTU settings. Look for bursts of route churn, flaps, or sudden increases in retransmissions that coincide with outages. Centralized logging and netflow-like data can help correlate events across multiple devices. Document the timing of outages and the affected prefixes to determine whether the problem is localized to a single link, a regional peering issue, or a broader routing instability. A disciplined data trail makes the diagnosis tractable.

Once you have a data-backed view of the outages, segment the investigation into three domains: routing stability, tunnel and encapsulation health, and MTU consistency. In routing, focus on BGP or IGP convergence events, route dampening behavior, and any policy changes that could trigger rapid withdrawals and re-announcements. For tunnels and encapsulation, inspect GRE/IPsec or MPLS/VPN paths for instability, including spice like misordered packets or occasional drops that may indicate hardware limitations. Finally, MTU requires both end-to-end and path MTU discovery checks. By dividing the problem space, you avoid chasing random symptoms and instead confirm the root cause before applying fixes.

MTU and packet handling must align across the network.

A robust first step is to stabilize routing behavior and verify transport paths through repeatable tests. Begin by enabling graceful restart features on internal routers and ensuring route dampening does not overly suppress legitimate changes. Monitor for flaps confined to specific prefixes, which can indicate a dialing-up or peering issue rather than a general network fault. Next, validate that the transport paths preserve packet order and timing, especially across WAN edges. Run controlled traffic tests that mimic real workloads, observing whether bursts of traffic coincide with route withdrawals or re-advertisements. If you see stable routing but continued hiccups, the problem likely lies beyond basic routing logic, in the underlay or MTU chain.

With routing stabilized, inspect the physical and virtual transport layers for anomalies. Check queue depths, interface errors, and error counters on every relevant link. A misbehaving interface can stall or intermittently throttle traffic, making routes appear unstable even though the problem is layer one or two. For tunnels, examine the encapsulation headers, tunnel MTU, and fragmentation behavior. If an MTU mismatch exists, packets may be dropped or fragmented in unpredictable ways, causing retransmissions that look like flaps. Use path MTU discovery where supported, supported with explicit MTU tuning, to align endpoints. The combined evidence from these checks helps confirm if MTU is driving the instability.

The golden rule is consistent MTU and predictable routing behavior.

MTU issues often lurk beneath the surface, unnoticed until traffic patterns reveal them. Start by auditing the configured MTU on every device along the WAN path, including customer edge gear, routers, switches, and any VPN gateways. Look for inconsistencies that could produce fragmentation or dropped frames. Compare the MTU settings with the path MTU, using tools that probe the maximum transmissible unit without fragmentation. If you detect oversize packets entering a tunnel, reduce the MTU on the affected interfaces and enable don’t-fragment bits where possible. After adjusting MTU, re-test under both steady-state and bursty conditions to determine if flapping subsides and throughput improves.

In addition to endpoint MTU values, consider how middle-mile devices handle fragmentation and reassembly. Some devices may impose stricter MTU for tunneled traffic than for regular IP transit, creating a bottleneck that becomes visible only during peak loads. Review firewall and NAT rules that could inadvertently modify or strip headers, changing the effective MTU and triggering fragmentation. Monitor for asymmetric paths where one direction traverses a smaller MTU than the return path, as this often leads to retransmission storms and route churn. Implement consistent MTU profiles across sites to minimize hidden discrepancies that provoke unpredictable behavior.

Collaboration with providers accelerates root-cause identification.

When MTU alignment is confirmed, re-examine routing policies that may still provoke instability under load. Hanging on to old route dampening or aggressive withdrawal thresholds can create a cycle of flaps that masquerade as WAN outages. Tighten policy changes to require multiple confirmations before taking a new route, and implement stable, incremental updates whenever possible. Consider tightening BGP best path selection to prefer consistent paths with proven performance, while avoiding overreactive path shifts during normal convergence. Document any policy changes and schedule follow-up tests to ensure that revised rules reduce turbulence without compromising failover capabilities.

Another vital angle is examining peering and upstream infrastructure for hidden constraints. WAN instability can originate outside your control, such as at upstream routers, peering exchanges, or provider edge devices. Contact your carriers with gathered metrics showing the timing and duration of outages, the affected destinations, and the traffic volumes involved. Request confirmation of any maintenance windows, routing changes, or known issues on those links. Often, problems are transient and resolved quickly once providers adjust filters or re-balance capacity. A collaborative approach with clear data yields faster root-cause resolution and reduces the time you spend fishing for symptoms.

Plan, test, and deploy changes with a focus on safety and traceability.

Before escalating, reproduce the conditions that lead to outages in a controlled lab or staging environment if possible. Simulate the same traffic patterns, route flaps, and MTU variations to determine whether the observed issues occur under synthetic loads as well. This controlled experimentation helps separate genuine network faults from misconfigurations, misinterpretations, or timing-related glitches. Ensure that the lab environment mirrors the production topology as closely as possible, including routing tables, tunnel configurations, and MTU settings. By validating hypotheses in a safe space, you prevent unnecessary changes that could destabilize live services and gain confidence in the corrective actions you plan to deploy.

When ready to implement changes, prioritize incremental, reversible steps. Start with non-disruptive tweaks such as adjusting MTU on suspect links, reinforcing MTU consistency, and tuning dampening thresholds in a cautious manner. Avoid sweeping reconfigurations that could trigger simultaneous outages across multiple sites. After each change, monitor the network for a full cycle of traffic, including peak hours, to confirm improvement without introducing new issues. Maintain a detailed changelog, including rationale, expected outcomes, and rollback procedures. A disciplined deployment strategy minimizes risk while delivering measurable reductions in flaps and outages.

Finally, build a long-term verification and maintenance plan that prevents recurrence. Establish a baseline of healthy routing stability metrics, MTU alignment, and transport path characteristics for each site. Set up alerting that notifies you of abnormal route churn, unusual error rates, or MTU non-conformance before users notice. Regularly review policy settings, hardware capabilities, and firmware versions to ensure they remain compatible with evolving traffic patterns. Train operations teams to recognize early signs of instability and to execute standardized diagnostic playbooks. A proactive posture reduces mean time to detect and resolve issues, keeping inter-site WANs reliable and predictable.

Integrate your insights into a repeatable playbook that teams can execute during future incidents. Include a clear decision tree: confirm routing stability, validate transport health, verify MTU alignment, and, only then, apply targeted fixes. Store diagnostic data, configurations, and test results in a centralized repository for future reference. Emphasize communication with stakeholders, providing status updates and expected timelines throughout the recovery process. With a documented methodology and practiced procedures, your organization becomes better prepared to handle intermittent WAN link failures caused by flapping routes or MTU issues, reducing downtime and preserving service levels.

How to fix inconsistent image EXIF metadata after editing and exporting across different photo editors.

Discover reliable methods to standardize EXIF metadata when switching between editors, preventing drift in dates, GPS information, and camera models while preserving image quality and workflow efficiency.

Get marketing news you’ll actually want to read