Balance 20: "WAN1 Disconnected (WAN Failed DNS test)" [RESOLVED]

Hardware:
Peplink Balance 20 with v6.3.2 firmware
WAN1: Motorola Surfboard SB6121, ISP: Comcast
WAN2: Motorola NVG510 in bridge mode, ISP: AT&T

Problem:
Upon installation of a brand spanking new Peplink Balance 20, WAN1 would generate the error “WAN1 Disconnected (WAN Failed DNS test)” around 10-12 times per day. WAN1 would usually reconnect within 2-10 minutes, but during this time WAN1 would be unavailable. Even when working, DNS resolution was quite slow. The NVG510 on WAN2 worked perfectly.

Attempted Solution:
I tried various fixes recommended on this forum, and elsewhere, with no success including changing the various Health Check options (frequency, timeout period, DNS servers, etc). A Comcast technician came out and verified that the physical line was within specs. Tried connecting the SB6121 to WAN2 and the NVG510 to WAN1: the NVG510 continued to work perfectly on WAN1 and now the SB6121 on WAN2 disconnected. Tried changing the SB6121 power supply, still no change.

Finally, tech support at Peplink suggested that I put an unmanaged switch between the SB6121 and the Peplink to “rule out any hardware compatibility issues”. I used a TP-Link TL-SG105 5 port switch. Since then, the SB6121 has been operating normally with zero disconnections … so it seems like this workaround has solved the issue, although the solution of installing a redundant (and I don’t mean “redundant” in the good way) switch in-line seems pretty inelegant to me.

Question:
Any ideas as to the nature of the incompatibility? It seems odd to me that this is even possible nowadays. Can this be fixed via firmware, or is it really a hardware issue? Somehow the SB6121 and Balance 20 do not talk to each other reliably, but the TL-SG105 can talk to both without any problems, I am perplexed.

Usually this is an issue related to Ethernet handshaking between the two devices. Did you try locking the port speeds to 1000M full duplex?

1 Like

I did not. Tonight I will try locking the WAN1 port speed to 1000M full duplex (and remove the switch) as you suggest. Will report back tomorrow.

Quick update, locked WAN1 port speed to 1000Mbps full duplex, so far one disconnection (~4 minutes) in 18 hours. Much, much better than before, but the inline switch had zero disconnections in 48 hours. My sample sizes are small, so the only conclusion I can draw right now is that both methods probably work, although one solution is more elegant than the other!

I am going to go with the port speed solution for now and will update in a week or so (unless errors start popping up frequently again).

Good deal :up:

1 Like

Hmm, so perhaps this wasn’t really a compatibility issue after all. I had 2-3 “WAN1 Disconnected (WAN Failed DNS test)” errors within hours after I posted my update yesterday, but decided to continue monitoring to see whether the frequency would increase or abate. However, last night WAN1 disconnected again with the same error but failed to reconnect by the time I woke up (about 6 hours after the disconnection). So, I tried power cycling the modem; no joy. I re-installed the Tl-SG105 switch between the SB6121 and the Peplink; still no joy. So then I took the drastic step of replacing the Motorola SB6121 modem with new Zoom 5370 modem. I had Comcast provision the new modem and it connected right away. It’s only been a few hours, but I have my fingers crossed that the SB6121 was slowly failing and that it was the culprit all along.

I agree with you and suspect the cable modem itself failed. I bet you are good to go now…

1 Like

… waiting for the update from you.

We have similar setup. We use Balance 20 with Comcast (20/3 Mbps) and ATT (100/20 Mbps semi-fiber w/fake Uverse). Using outbound rules traffic is routed with 9:3 ratio. ATT line saturation is rarely over 50%, and health-check is configured with DNS: 8.8.8.8 and ATT’s one.

There are days when Peplink disconnects ATT line couple times due to healthcheck (usually morning hours when our office most active). ATT line has its problems (occasional packet loss and increased latencies) and ATT support does not make it easier. Still, ISP issues do not warrant a frequent line disconnect.

First, I would like to have Peplink support to give a solid troubleshooting steps to determine whether ISP is in fault or Peplink is too fast on a trigger.

OB

OK, the update is that I do not have a solution. After a good couple of days with the Zoom 5370, I started to get the DNS Health Check errors again and WAN1 keeps disconnecting, sometimes for several minutes and in a couple of egregious cases, for a few hours. It usually happens a handful of times a day, seemingly at random intervals.

Despite having had a Comcast tech out already who declared that the physical line was fine, I am beginning to think that it still is a Comcast line issue:

  • it’s not the modem, I have tried the Motorola SB6121 and the Zoom 5370 and get the same problem with both
  • it’s not an ethernet handshake issue, I have tried an unmanaged TP-Link Tl-SG105 between the Balance 20 and each modem, and I have tried locking the port speed on the Balance 20 to 1000M Full Duplex with each modem, and in all four configurations the issue persists
  • it’s not the Balance 20, i tried using a TP-Link ER-5120 Dual WAN router in place of the Balance 20 and experienced the same issues (the router log showed multiple disconnections/connections, but it did not specify the reason)

So, I think it’s time to get the Comcast tech back out here again. Will update once that happens. I don’t know enough about what could be wrong to pin it on the ISP, seems like ABC-Admin, you have an opinion on this?

bridgerider,

1- I have two devices with identical configuration and the Health Check disconnect happens on both.
2- ATT swapped their modem to a new one and it did not do anything.
3- I’ve contacted ATT support multiple times and did not get far with them. Except once they flashed settings from my modem under pretext of resetting it, which resulted in my longest outage (3-4 hours). Tech that showed up asked me to help him to recover it since his claim was “… I am not trained on this product”.
4- What DNSs are you using for Health Check ping? Comcast DNSs are not reliable, that’s why I’ve added 8.8.8.8
5- Is there correlation between HC disconnects and amount of consumed bandwidth?
6- To troubleshoot ISP connection, I would run mtr MTR (software) - Wikipedia outside of Peplink to get a picture of packet loss and latency when plugged directly to Comcast device.

  1. FWIW, when I switch modems from one WAN port to the other, the issue moves with the modem, so it’s likely not an AT&T problem and likely not a WAN port problem, hence my thinking that it is more likely Comcast related;
  2. Using 8.8.8.8 and 8.8.4.4 instead of the Comcast DNS servers, and have been since before this whole problem arose;
  3. That is a good question, but I don’t think so as I am getting multiple disconnects in the middle of the night when little to no bandwidth is being consumed;
  4. I will try MTR when I have a moment

It is starting to sound like we have similar symptoms but possibly different root causes. I am going to do one last check of all the coax connections at my location before I get a Comcast tech out here …

One more thing, I’ve increased Health Check Retries from 3 to 5


I tried that too, including PING versus DNS, longer timeout period, longer interval, more retries, etc … all to no avail unfortunately.

Just had 2x ATT disconnects this morning.

1- One of them I’ve briefly registered 100% CPU load on Peplink. Why there is no historical data on CPU load?

2- Are you using InControl2? Again, it is in realm of non-scientific data, but I’ve closed my InControl2 real-time bandwidth monitor after 2x disconnects to calm down the situation.

Well… after an uneventful hour, I’ve opened InControl2 dashboard and clicked on couple links. Boooom!!! ATT line went off. After dealing that long with this issue my judgement might be questionable, but my perception tells me it was not a coincidence. I will try it again in an hour.

I am not using InControl2, but after re-tightening all of the coax connections leading to the cable modem, I have had zero disconnects in the last 21 hours, which is by no means a record, but at least a good start …

For the sake of others looking for CPU load and other performance data on PepLink, you can enable SNMP v2c on the device and then you can collect OID: deviceCpuLoad and other data

Hi,

WAN health check feature is just simple connection test tools to verify internet connectivity over the WAN connection. WAN health check failed mean the traffics sent from the WAN interface doesn’t get replied/responded. You can actually find the heath check stability - consecutive count by accessing to the support.cgi page.


Health check failure can be caused by the following:

1. Communication between Balance router and ISP router.

  • ISP Router/Modem Hang issue
  • Physical/Port Speeds (Auto Negotiation for interface/Port Speeds)

2. Communication between ISP router and Health check target.

  • ISP service down
  • ISP routing issue

3. Unreliable Health check target.

  • Make sure health check target is not block the traffics.
  • Make sure health check target is reliable

For more information, please refer to the attached diagram:


**For WAN health check failure troubleshooting **, usually you will need to isolate the possible issue that cause by the items 1,2,3 above.

  • Make sure physical connection is fine
  • Make sure interface/port speeds is defined for both end devices. This will isolate the auto negotiation issue.
  • Disable WAN health check and monitor the internet connection status. If disabling health check, you also facing internet connection issue, this shown Internet is unstable.
  • Put a host in between the WAN interface & ISP modem for isolation test.
  • Changing using reliable health check Targets/Servers
  • Others

Below are the sample test you can use to isolate the item 1 for the communication between Balance router and ISP router.


Thank You

4 Likes

sitloongs,

Thank you for sharing this with us, in particular support.cgi page on PepLink that make a bit easier to troubleshoot. BTW, are all parameters on SUPPORT.CGI page available via SNMP, specifically Health Check history?

Another question, does Balance 20 have a cap on NAT entries?

I think I am getting closee in my ghost hunting and I will share with you even the ghost name once I am done.

We do have OID for WAN Health Check State - .1.3.6.1.4.1.23695.2.1.2.1.4

Are you referring how many Port Forwarding rule can be defined? If so, we don’t cap on this.

1 Like