Add packet loss trigger for speedfusion failover


#1

Some time ago, when we started using latency cutoff to let us have wan and cell up, but speedfusion use wan unless it is down OR BAD, I requested that in addition to the latency cutoff that a packet loss trigger be added.
The response was “We do not need to do that because if you have packet loss you ALWAYS have very high latency”.
Well, that is not the case. I had seen this, but had not caught proof of it. Jonathan Pitts did so. He has a device where the latency was hovering around 100ms but packet loss was very high on the WAN, so it did not trigger moving the traffic to the cell path.

Here is more detail on the use case: Note that as I discussed in some other posts, the general philosophy of Peplink is all about quality/latency/best possible data flow with little thought to cost. I live in the world where I pay for cellular data (and yes, I charge it through to my customers, but still need to keep it as low as possible). Point being that I am trying to improve quality while controlling costs. Yes, if I set latency cutoff to a few ms quality is of course great, but it is using a ton of cellular when it did not need to. i.e. I am fine as long as latency is under 200ms. No one notices unless it is above that.

We have B710s and soon fusionhubs in data centers
remote locations have pepwave BR1 or similar
PEPvpn/speedfusion to two of our data centers.
WAN and cell both priority one.
Outbound policy makes some non-phone (vpn) traffic such as POS terminals prefer wan and fail over to cell
Other traffic, such as public wi-fi is enforced to WAN
Speedfusion profiles are set WAN pri 1, cell pri 2. WAN has latency cutoff say 400ms. We need it to be that high so it is not overly sensitive.

The net effect is that all speedfusion traffic stays on WAN unless latency goes over 400ms, then it snaps to cellular path. Without doing this (i.e. cell in pri 2 (standby), it only goes to cell when WAN is totally down, but not when WAN is just crap.
BUT. we do see fairly frequent events like the above picture where packet loss is high but latency is reasonable.

So - I am asking again for an additional rigger for packet loss.
Also, I was informed recently that the “suspension time after packet loss” is not doing what I thought. I thought this was how long to stay on next priority path after primary is clean…evidently it is a hard timer. i.e. go to pri 2 path, stay there for X ms then go back no matter what the condition is.
IF that is the case, I am also requesting a more intelligent decision here.In english:
Switch to pri 2 path (cell) if primary has latency over 400ms or packet loss. then be testing the primary path while running the live speedfusion over secondary, and return to primary when it has been clean for X ms."
Remember - we are talking about the situation where the WAN is slow/crap but not down/failed.