Outbound "leaks" through Backup WAN

Brad_Jolly · January 31, 2022, 6:36am

I recently discovered that there seems to be an issue with the handling of traffic with backup WANs in my Balance 380X. I am using Firmware 8.1.3 build 5162 and all configuration was done from factory defaults on that firmware (i.e., not configured then upgraded).

My WAN connections are pretty simple:

Ethernet WAN (cable) as Always on (Priority 1)
Cellular WAN via a FlexModule (EXM-MINI-1GLTE-G) as Backup (Priority 2), set to Remain Connected in Standby.
All other WANs disabled.

I discovered that traffic was very frequently and seemingly randomly going through the cellular WAN while it was in Standby. It first came to light when I was working with ddclient (for a secondary dynamic dns domain) and found a web-based IP check would pull up the cellular WAN IP just about every other check even though the cable WAN was up and cellular WAN in Standby. When I looked at Active Sessions, I found there were 10 varying Outbound sessions (everything from Bittorrent to DNS, to Google to HTTP/S) through the cellular WAN.

Besides interfering with ddclient, it was apparently causing significant internet trouble, including web sites not loading properly (especially if they downloaded information from multiple subdomains/connections). Sites would hang until they timed out, requiring a “Refresh” or “Reload” to display the site after the timeout (asynchronous requests seemed to be particularly problematic). It also may be why I suddenly was running out of data on the cellular plan (haven’t tracked this enough to verify, but it’s likely the culprit).

The only Outbound Policies set were the default HTTPS_Persistence, an Enforced policy for accessing the cable modem gui (specif to the modem IP), and Priority policies for VOIP/SIP (specific to the UDP ports used). So, no Load Balancing or anything like that which would direct general traffic to the cellular WAN.

I found that if I put the Standby mode of cellular on “Disconnect,” it, of course, stopped the outbound traffic, resolved the ddclient IP issue, and made websites load properly and normally. However, that is not a very good solution since it results in a bad delay rolling over to the backup WAN as the cellular has to connect first.

With the cellular set to Remain Connected in Standby, I tested an Outbound Policy Priority rule, setting all traffic with cable as the highest priority, followed by cellular. The 10 outbound sessions on cellular disappeared right away and no cellular WAN outbound sessions have appeared since. Also, the ddclient IP issues and Internet/website loading problems have all disappeared.

What I’m wondering: Is the setup of an Outbound Policy of Priority supposed to be required for proper backup WAN configuration or is there a bug in the handling of outbound traffic vis-a-vis backup WANs? Or am I totally missing something and just mis-configured things from the outset?

While the Outbound Policy rule seems to have “fixed” the problem, it does not seem like that should be required for properly handling backup WAN configuration. If it is, what’s the point of setting backup WANs and Priorities? The Outbound Policy would seem to do that alone and there would be no need for designating backup WANs or priorities for them – you may as well keep all WANs as Always on (Priority 1) and handle traffic by Outbound Policy Priority.

Any help/input appreciated.

WillJones · January 31, 2022, 11:07am

Probably not a problem with your rules but might be worth sharing a screen grab of them to let people sanity check - remember they are processed top down in order and the first rule to match traffic will apply, so if nothing matches traffic could be falling through to the default rule which uses the “lowest latency” algorithm by default.

That does not however account for the traffic then being sent to a WAN in standby, which does sound incorrect to me. too…

Are there any health check failures for your main WAN that might indicate a reason for some sessions to be sent to the other WAN?

The two configuration elements work in lock/step with each other, there are plenty of use cases where you may have multiple WANs configured as both active and standby and different priorities of standby where outbound policy is then used on top to make more granular forwarding decisions.

I tend to configure my own “catch all” policy at the bottom of the outbound policy list just as belt and braces to make sure traffic is being forwarded as I expect it to, rather than relying on traffic hitting the default rule.

I would suggest that it might be worth opening a support ticket with Peplink to let them work through this with you to see if there is something happening internally that might be the cause.

Brad_Jolly · February 1, 2022, 12:56am

Thanks. I actually considered including a screenshot of the Outbound Policy and should have – my being lazy. Here is the current outbound policy. The “Enforce WAN Backup” is the rule I added to resolve the issue (though I have considered it ends up obviating the need for the Restore SIP/Restore VOIP rules which I had added before to handle issues with the VOIP device registering – which also may have been this same issue).

I wasn’t getting any health check failures, though that crossed my mind as well. It seems pretty good at sending me emails when there’s any issue – even if it’s less than a minute failure – and I wasn’t getting emails of the cable being down. It was an issue over several weeks in terms of the Internet issue – it was only the ddclient IP that made me bother to look in this direction.

I hadn’t considered the Default rule – forgot it was there as it’s on “Auto.” That’s a good catch. I didn’t realize it was using “Lowest Latency.” If that’s the case, I could see that perhaps causing this. I do see some “spikes” in the cable latency – nothing like the cellular, but it is possible a high latency moment in the cable could cross a low latency moment in the cellular. I may have to mess with that. But, I agree, that shouldn’t necessarily be the way it is handled with a WAN in Standby.

I will probably open a ticket as well, as you suggest, since I’m also wondering if something under the hood is impacting things. I thought I’d see if anyone could find anything I was missing or doing wrong before overly bothering support.

Thanks again.