Add packet loss trigger for speedfusion failover

jmpfas · March 2, 2019, 7:37pm

Some time ago, when we started using latency cutoff to let us have wan and cell up, but speedfusion use wan unless it is down OR BAD, I requested that in addition to the latency cutoff that a packet loss trigger be added.
The response was “We do not need to do that because if you have packet loss you ALWAYS have very high latency”.
Well, that is not the case. I had seen this, but had not caught proof of it. Jonathan Pitts did so. He has a device where the latency was hovering around 100ms but packet loss was very high on the WAN, so it did not trigger moving the traffic to the cell path.

Here is more detail on the use case: Note that as I discussed in some other posts, the general philosophy of Peplink is all about quality/latency/best possible data flow with little thought to cost. I live in the world where I pay for cellular data (and yes, I charge it through to my customers, but still need to keep it as low as possible). Point being that I am trying to improve quality while controlling costs. Yes, if I set latency cutoff to a few ms quality is of course great, but it is using a ton of cellular when it did not need to. i.e. I am fine as long as latency is under 200ms. No one notices unless it is above that.

We have B710s and soon fusionhubs in data centers
remote locations have pepwave BR1 or similar
PEPvpn/speedfusion to two of our data centers.
WAN and cell both priority one.
Outbound policy makes some non-phone (vpn) traffic such as POS terminals prefer wan and fail over to cell
Other traffic, such as public wi-fi is enforced to WAN
Speedfusion profiles are set WAN pri 1, cell pri 2. WAN has latency cutoff say 400ms. We need it to be that high so it is not overly sensitive.

The net effect is that all speedfusion traffic stays on WAN unless latency goes over 400ms, then it snaps to cellular path. Without doing this (i.e. cell in pri 2 (standby), it only goes to cell when WAN is totally down, but not when WAN is just crap.
BUT. we do see fairly frequent events like the above picture where packet loss is high but latency is reasonable.

So - I am asking again for an additional rigger for packet loss.
Also, I was informed recently that the “suspension time after packet loss” is not doing what I thought. I thought this was how long to stay on next priority path after primary is clean…evidently it is a hard timer. i.e. go to pri 2 path, stay there for X ms then go back no matter what the condition is.
IF that is the case, I am also requesting a more intelligent decision here.In english:
Switch to pri 2 path (cell) if primary has latency over 400ms or packet loss. then be testing the primary path while running the live speedfusion over secondary, and return to primary when it has been clean for X ms."
Remember - we are talking about the situation where the WAN is slow/crap but not down/failed.

mantis2k · June 9, 2019, 10:01pm

+1 packet loss failover condition is fairly universal feature

Jonathan_Pitts · June 11, 2019, 7:22pm

Peplink can we get a response on this?

TK_Liew · June 12, 2019, 8:45pm

This feature request makes sense and we have filed it. It is in the queueing list and under review by engineering team.

Thanks.

Jonathan_Pitts · June 20, 2020, 5:20am

Any update on this? Can it get added to 8.1?

TK_Liew · June 21, 2020, 11:29pm

@Jonathan_Pitts, it is not included in 8.1.0. I have put a note to the engineering team for your request.

Thanks.

jmpfas · June 22, 2020, 7:29am

Great. That will eliminate a fair number of service affecting events.
Can we be sure to add event log entries for these transitions? Possibly optional incontrol email trigger? i.e.
18:30:25 Speedfusion “Columbus” wan=>cellular due to latency 510ms
18:30:31 Speedfusion “Columbus” return to wan
18:40:30 Speedfusion “Trenton” Wan=>cellular due to packet loss 3 in 5 seconds

I do think that the setting for the trigger should be something like X packets lost in Y seconds. I mean…do we want to jump to cellular if one packet is lost? probably not. But if it is one every ten seconds probably yes.

mystery · June 22, 2020, 10:46am

Subscribed and +1 on the feature request. This would be helpful for me too. I have some cases where low latency and packet loss.

Jonathan_Pitts · September 9, 2020, 2:23pm

@TK_Liew
@sitloongs

Can this be added to 8.1.1?