Health Check Fails for Host 1 but Passes for Host 2 Event Logging

James_S · July 17, 2014, 11:56pm

When the Health check fails on Host 1 it swaps over to Host 2. The peplink knows this but does not log it in the event log. it makes trouble shooting issues very hard. Can you please adjust the log so it reflects when this happens it is recorded?

Thank you!

Tim_S · July 18, 2014, 1:11am

WAN up/down events are automatically logged in the event log, you should already see these.

Jarid_Petermann · July 21, 2014, 12:48am

If setup in back-up mode you will see “WAN2 connected” in the event log. If the WAN connection is set to “Always On” it will not provide a message.

James_S · July 22, 2014, 6:14am

I am looking for the when a Peplink does a Health Check and it fails on Host 1 and switches and checks Host 2 for the Event Log to be updated with that.

I know it is doing a Health check on 1 and when it fails it swaps to Health check 2 but, nothing in the Event log shows this. This would help us in trouble shooting issues where the internet is up but the Peplink cannot reach out to a server or location etc. If you have it set to say ping a specific IP Address and it cannot do it, then it takes the connection off line but, that does not say is the internet down or up just it cannot ping it. By recording it failed on 1 and had to swap to 2 then that allows us to say this server went off line but, the internet was still up.

This is basically already in place it just needs to be produced in the event log.

I hope that makes sense.

Jarid_Petermann · July 22, 2014, 7:19am

If both connections are set to always on it will send out health checks at the specified times. If a health check fails on a particular WAN this will be reflected in the Event Log.

One WAN Connection on, Other Back-Up Priority (Event Log)
Jul 22 17:10:10 WAN: Cable One connected (186.54.26.5)
Jul 22 17:09:40 WAN: Midcontinent disconnected (WAN Failed DNS Lookup)

Both Connections set to always on (Event Log)
Jul 22 17:03:59 WAN: Cable One disconnected (WAN Failed DNS Lookup)
Jul 22 17:04:51 WAN: Cable One connected (186.54.26.5)

If this is not properly reflecting the down/up times of the WAN connection as it should, please open a support ticket:
http://cs.peplink.com/contact/support/

James_S · July 22, 2014, 7:32am

maybe i am not making the issue clear enough

Wan 1
health check location 1
health check location 2

the health check location 1 fails
the health check location 2 passes
WAN 1 stays up.

there is nothing in the log about this or i am missing it then

James_S · July 22, 2014, 7:34am

WAN 1
Health Check Location 1
Health Check Location 2

WAN 1 does a Health check on Health location 1 and fails, so it then does a health check on location 2 and passes
nothing in the event log is recorded about this. WAN 1 stays up and working but if there is an issue on Health Check location 1 no one will ever know.

Tim_S · July 22, 2014, 7:41am

Your feature request is noted, let’s see if there is any more interest from other users. The whole idea behind having a second host is to insure that the WAN is actually down.

Jarid_Petermann · July 22, 2014, 7:45am

Hello, I believe I understand you now. Yes you are are correct on how it does work.

So let’s say you are using DNS Lookup or Ping as the health check method. There are 2 hosts:

WAN 1
DNS Host 1: 8.8.8.8
DNS Host 2: 8.8.4.4

or

Ping Host 1 : 192.168.1.1
Ping Host 2 : 192.168.1.2

The Health Checks will not cause a WAN to fail unless both DNS Hosts/Ping Hosts fail the Health Check.
The Event Log will not reflect if DNS Server 1 or Ping Host 1 fail if the DNS Server 2/Ping Host 2 are successful. It will only bring down a WAN and reflect this in the Event Log only if both Hosts fail.

James_S · July 22, 2014, 8:10am

Correct. So let’s say I am using ping to two different locations. If I can’t get to the 1st one and can to the second and it stays that way then maybe the isp is having an issue or their is a routing issue or a host of other possible issues. The peplink is already doing it so how hard is it to add it to the log?

James_S · August 12, 2014, 10:46pm

i am hoping others can see how useful this would be in trouble shooting isp issues. If there is a routing issue vs an issue with the server vs something else.

ricardog · September 13, 2014, 12:09pm

It might be useful for some scenarios but it could also bloat the log with irrelevant info. Most of the time, one of the test targets fails because the target itself is off-line, not your WAN.

I think you should instead use a separate service to monitor your test targets independently of your Peplink log.

adeebag · February 7, 2017, 12:39am

I would also find this level of reporting in the event log useful. In order to avoid it bloating the logs, there could be a “log all Health Check failures and immediate successes” setting to log all failures and the first success after it for Health Checks (instead of only reporting when the full range of checks for a given WAN result in a failure).

this is especially relevant if you suspect that your WAN connection is not 100% stable.

In my case, each of my 3 WAN connections has the Health Check set to:

DNS Lookup
Host 1
Host 2
Include public DNS servers
Timeout: 2 second(s)
Health Check Interval: 5 second(s)
Health Check Retries: 3
Recovery Retries: 3

So for the Peplink to fully test the connection and identify it as a “WAN failed DNS test” in the event logs it would take (assuming full failure then full recovery rather than intermittent results within cycle):

5 secs for interval from last successful check
2 secs for timeout x 3 retries x 3 hosts (assuming it is only testing one public DNS host) = 18 secs to report as down
3 retries x 2 secs for timeout x 1 host (presumably) = 6 secs (minimum) to report as up again
total time the WAN is potentially offline for each occurrence = 29 sec
total time the event log shows that the connection was down = 6 sec (minimum)

29 seconds is long enough to disrupt many user sessions – especially VOIP, streaming, etc. And we are experiencing this frequently.

While we could reduce this time by setting the timeout to 1 sec, reducing the number of hosts, and reducing the recovery retries to 1, I am not sure if that would actually cause too many WAN state changes so that users end up suffering even more.

This also does not include scenarios where the Health Check failed for 3 retries on 2 hosts but where the final retry on the 3rd host succeeded – i.e. the connection was actually down but the event log does not show it.

Having the fuller logging in place would help us better diagnose unstable connections or even tweak our Health Check settings for the optimal balance.

Thanks.

[Peplink model: Balance One Core, Firmware: 7.0.0 build 2742]