Balance 20 - Load Balancing Broken?

(Edit: The issue appears to be resolved - check the last post for details)

I’ve got the following router:
[TABLE=“class: form_table, width: 1”]

System Information

Router Name
Balance_ADEF

Model
Peplink Balance 20

Hardware Revision
1

Serial Number
1824-5820-ADEF

Firmware
5.4.9 build 1732

[TR=“class: tablecontent2b, bgcolor: #E6E6EB”]
Modem Support Version
1011 (Modem Support List)
[/TR]

Uptime
0 day 0 hour 57 minutes

System Time
Mon Apr 27 14:32:31 EDT 2015

[TR=“class: tablecontent2, bgcolor: #E6E6EB”]
Diagnostic Report
Download
[/TR]
[/TABLE]

The issue I’m having is the router works fine for a day or so, then all the traffic gets routed to WAN2 and WAN1 gets almost no traffic. This results in all sorts of problems with throughput, streaming services not working, etc. as the customer needs both WAN connections in order to do their work.

When this happens, if I reboot the router, it works fine for a day or two, and then the issue recurrs.

What should I be looking for in order to diagnose the issue?

Apparently the last reboot didn’t help - here’s a pic of the router’s traffic for the last little while:


Hi Tim,

Can u send a copy of the outbound policy rules that you have on Network // Outbound policy.

Check that the speed Wan speed on Network//Wan (Upload Bandwidth and Download Bandwidth). I have used upload speed instead of download speed and peplink reduce sending traffic on that wan because a least used policy.

AG

Here’s the outbound rules:

Oddly enough, the router seems to be doing the load-balance thing now. I’m puzzled as I didn’t change anything.

also - no persistence rules or anything like that.
both WAN ports have been configured with the same upload/download speeds.

Hi Tim,

What device behind Balance 20 (LAN side)? A group of users or firewall?

LAN side is a bunch of PCs - windows, linux, and some mobile devices.

Hi Tim,

May I know both WAN links are up and running when this happen?

Any Outboud rule above the Default Rule? Can you share the screen shot?

The router reported both WAN links were up and running when this happened.

There used to be some SSL persistent rules - I removed all of them when the problem started so there’s only the default rule now.

Hi Tim,

Possible to upgrade to latest firmware version then try again?

Please ensure your B20 is under warranty since unlock key is needed for major version upgrade.

not really - this router isn’t under warranty, so it’s running 5.4.9 build 1732 which is as far it’s going. Considering it’s worked fine for years and this appears to be recent behavior, I’m thinking an upgrade shouldn’t be necessary.

What on earth? Now I’m getting a “you’ve been blocked” from Cloudflare??!

Reading the docs on load-balancing, the router allocates on the assigned bandwidth of each WAN connection. From what I’m seeing, I’m thinking that if both WAN connections have the same bandwidth, then the router locks all the traffic to one WAN connection regardless of the actual usage on both WAN lines. This can result in one WAN line being overloaded while the other WAN line is empty of traffic.

This is the only explanation I can come up with, and it really pisses me off to think I’ve been using this router for so long and have been getting effective use of only one WAN line’s bandwidth when I’ve been paying for 2 all these years.

The router has been set to Normal Application Compatibility and things seem to be doing better.

I’ll update this page after I’ve observed more behavior.

(And I really, really hate having Cloudflare stop me from posting because I tried to use quote marks in my post…)

Normal Application Compatibility hasn’t changed much - I got a video running, and then flipped through some FB pages in rapid order. One line is almost saturated, the other line has almost no traffic.

Now trying High Application Compatibility.

Tried a disconnect on the offending WAN line and then a reconnect, the router reported a number of userid/password failures, with an eventual connection.

Called the ISP, and they saw an “old” connection hanging around, so what might’ve been happening was the router making multiple attempts to connect, and then somehow getting a login, while the ISP’s system still had the “old” connection still hanging around for some reason.

Checking the log file shows no failed attempts to login, there is a “no cable detected” error - although the router was connected to a modem the entire time.

I’ve put the router back to “load balanced” mode and it appears to be working fine now.

There needs to be more / better diagnostic information recorded and reported - particularly for failed login attempts.

Some services and sites depend on a persistent source IP. A https session must all go out the same WAN, because the server detects a changing IP on https as invalid. Major steaming services consist of many parallel http connections, that get replaced every x minutes, and all these must come from the same IP, or the service views it as a user logon/session error. Same deal for Skype connections, and a lot more. Even simple web surfing with the logon cookies, could decide its a duplicated logon and reject.

Load sharing only works 100% for simple tasks. The complex apps and services do their own kind of load sharing, by splitting up the the session across multiple TCP connections. These servers will not accept these coming from different subnets.

You will need the HTTPS rule. You will need to add rules to converge streaming services onto one WAN IP. If each WAN and ISP has a CDN (Akamai, etc), then each WAN will give a different IP for DNS look ups to sites they host, and you will need to add rules to direct those connections to the appropriate WAN. To get really good results, you need to add the entire ASN IP subnet table, for each WAN, to your Peplink.

It’s a complicated task to get true load sharing and success with complex services.

The problem I was seeing was that the router wasn’t sending any traffic to one of the WAN lines for any web requests - page browses, etc.

As for persistence rules - I’ve had them before, and for the services that need it, I’ll add them back in.