Inefficient health check in erratic WANs

Hi, we’re having an important issue and I’d really appreciate your help.

We have a Peplink 310 with three dedicated internet circuits (WANs). We’ve configured all WAN’s Health Check Settings to use DNS lookup, using both our own DNS servers and also public DNS servers.

Health check parameters are set as follows:

Timeout: 1 second
Health Check Interval: 5 seconds
Health Retries: 5 seconds

In other words, we’re being as aggressive as possible (lowest timeout and lowest health check interval possible) in trying to detect if a WAN is down, so that it can fail-over to the other WANs as quickly as possible. We deal with live streaming, and having the lowest possible downtime is critical to us.

However, daily usage has shown that when a particular WAN is not completely down (say that it is having 40% packet losses), those Health Check Settings apparently are not enough to actually detect that although it is “UP” (and responds to some DNS requests or pings) it cannot not actually be considerered as “healthy”.

So I’m wondering if there is a way to make the “WAN health check” more efficient than it is today. Apparently one ping every five seconds is an interval too high to detect “erractic” (but not completely unavailable) connectivity. Is there a way to have health check tests every second? Should I be using another better method to check the health of the WANs?

Thanks in advance for your help and expertise.

Best regards,

Helder Conde
[email protected]

Please make sure that public DNS servers are used for WAN health checks. There could be routing confusion if you have configured private DNS servers to be on the WAN interface and they actually exist on the LAN network.

Hi. Thanks for your response.

We’re only using DNS servers with public IP, outside our network.

Any other ideas?

Thanks!

Helder

Sounds like you prefer a sensitive parameter to take down your WANs that are not completely unavailable but occasionally report a failure.

You may consider to decrease the “Health Check Retries” to 1. It is rather extreme and any failure detection (in 5 seconds interval) would take down the WAN immediately.

You may also consider to increase the “Recovery Retries”, so that an “erractic” WAN could become difficult to obtain this expected consecutive positive result to turn up again.

Please note that extreme settings may not be an optimal one, and you can test which parameter better fits your environment.

Dear colleagues,

Thanks for your answers.

I noticed that the information I posted about by current Health Check Settings was not correct. In fact, “Health Retries” was already set to 1 (not 5 as I mentioned originally). Here are the correct settings actually in use at my Peplink 310 WANs:

Timeout: 1 second (lowest possible value)
Health Check Interval: 5 (lowest possible value)
Health Retries: 1 *** (lowest possible value)
*
Recovery Retries: 5**

So, as far as I understand, it is as “sensitive” as possible, but still appears to be inefficient to detect erratic behavior on the WANs that are not completely unavailable.

Is there a way to make the health check intervals even lower? Any plans to deploy this in future firmwares? I totally understand that missing one ping/request only should not be seen as “failed” WAN. That’s not what I intend. What I’d like is to be able to gather more data (more pings) during a lower interval, in order to make a decision.

Or are there other detection methods (other than DNS lookup, which is the one I’m using) that is known to be more efficient than the others?

Thanks in advance,

Helder Conde

Our system does not expect a short timeout, and 5 seconds is our current lowest value.

Rapid health check is not a typical usage, and we will put into our road-map and evaluate if that can help in some situations like your scenario.

Dear colleagues at Peplink,

Thanks for your response. I do trust you judgment on this. I’ll wait for your analysis and look for it, if you see fit, in future firmware releases.

Thanks!

Helder Conde