Speedfusion bandwidth bonding - does it actually work?

AshFinch · December 16, 2022, 3:25pm

Hello forum,

I’ve been involved with a couple of projects recently where we’ve hit a massive barrier around bandwidth bonding using Speedfusion, which by the marketing material is one of the key reasons to go with Peplink!
I’ve had some discussions with support, but I don’t seem to get an answer that makes sense to either myself or the customer involved, so I was wondering if you lot had any ideas or input!

So, in one example we have two separate ISPs going over fibre into an HD4 MBX tested with 8.2 and 8.3 beta firmware.
WAN1 with no SF speedtest 310Mbps down, 36 Up
WAN2 with no SF speedtest 230Mbps down, 36 up

We’re using Speedfusion Connect Protect, closest location is London, which of course has a 200Mbps limitation.

SFC Tests from the MBX itself, default settings no WAN smoothing or FEC. Just pure Speedfusion! Bonding algorithm was better than dynamic weighted bonding, we lost about 20 or so Mbps with that.
Test 1 - Overall: 179.6431 Mbps (Peak ~190Mbps, with packet loss)
Test 2 - Overall: 172.7785 Mbps (Peak ~190Mbps, with packet loss)
Test 3 - Overall: 164.7171 Mbps (Peak ~190Mbps, with packet loss)
Test 4 - from speedtest.net direct reached 133Mbps down, 34.5 Up
Not perfect, we were expecting to max out the 200Mbps given the WANs we have. Is this a realistic expectation?

Then, it gets worse and I really can’t get an answer which is logical, to me at least!
So, we enabled WAN smoothing, set it to normal.
Result was we get 67Mbps down via speedtest.net, which is half the value from test 4, which kind of makes sense because it’s 2x the data. But what doesn’t make sense is that we’re so far off the capabilities of 1 WAN on it’s own.
I did a second test from the MBX itself: Overall: 40.4029 Mbps 113 retrans / 414 KB cwnd, even worse!

The answer from support was as follows:

So, it is not an apple-to-apple comparison as you can see, the additional hop (in this case the SFC Node) might lead to a different path to the SpeedTest Server, and give you a different result.
From the result you obtained, it looks like the WAN(s) have packet loss, which will impacts the overall performance severely.
Overall: 40.4029 Mbps 113 retrans / 414 KB cwnd

The main thing in my head is, when doing a speedtest with no SF tech we get really good results, with SF we get bad results, but the blame goes to the WAN? Bearing in mind these are fibre connections, not cellular or wifi.
On top of this is that this is a test envrionment, in the real example we’ll be bonding 2 x Starlink terminals at 340Mbps each. Am I still only going to see 67Mbps throughput when I WAN smooth and Bond?
I know there’s no use case mentioned here, this is purely around the technology.

Is there something I’m missing? Am I expecting too much? I really don’t know, but at the moment I can’t see the actual benefit of bonding these connections, they’re better off being separate…
Thanks for reading!

Tiny_Tim · December 16, 2022, 6:34pm

https://forum.peplink.com/t/Speedfusion-Cloud-Test/6230c024dee5cede84496221/1

AshFinch · December 19, 2022, 9:22am

Not sure that really answers it: “Your ceiling is always going to be the fastest WAN, on your network”, but we’re not at the speed of the slower of the two, let alone the fastest and doesn’t explain the limited amounts we got via smoothing.

BigAl · December 20, 2022, 5:21am

What is the CPU / RAM usage during the speed tests?

There are a few real-time monitoring graphs available to see which WAN is in use. What do the graphs show?

What does the spec sheet say for the expected SpeedFusion performance?

Out of curiosity, what performance do you see when the WANs are not bonded, at the same priority and performing 2 concurrent speed tests?

There are a few configuration surprises too, related to WAN priority, health checks, etc.

AshFinch · December 20, 2022, 10:42am

I didn’t monitor this but I can check the next time the setup is online. As it’s a HD4 MBX though I don’t expect it to be maxing out! The only difference is next time it’s online it’ll have two starlinks so it won’t be a direct comparison to the previous results, but I expect to see a similar drop off.
Will provide graphs next time also.
Not sure what you mean on the spec sheet for SF Performance? But the SFC plans are capped at 200Mbps (https://www.peplink.com/software/speedfusion-cloud/)
We didn’t run the speedtests concurrently, only separately with 1 WAN at a time, so I’ll have to check this also.
Thanks for your input.

WillJones · December 21, 2022, 9:38am

Marketing are going to write what sells and sounds best, in practice we find SF Bonding works reasonably well under most circumstances with a bit of tuning, but bear in mind that 1+1 does not equal 2, there are overheads at work here and those become more obvious when you are dealing with fast WAN links (losing 20% on a couple of 10Mbps lines is one thing, losing 20% on a 300Mbps line is quite a lot more relatively speaking).

And what are the capacities of these WANs meant to be? Guessing some sort of domestic 300/30 FTTP product?

Which leads to the next question to clarify - are you doing any kind of PPPoE on the MBX4, PPP on Peplink is entirely CPU bound and has been seen to have relatively high overheads.

That 200Mbps limit is common to all SFCP servers, another thing common to them is the SF tunnels will have encryption enabled and you cannot disable it, however the MBX4 should be able to handle ~500Mbps of encrypted PepVPN traffic, when testing what is the CPU loading on the MBX4?

That seems a little bit low, however averaging 170-180ish and peaking to 190 is probably not a million miles off the mark.

When you are testing what are the loss/latency statistics for the WANs - use the detailed PepVPN graphs and run the tests for about 10 minutes.

What happens if you test just a single WAN enabled for PepVPN? - For example I can happily max out 500Mbps of PepVPN on a 310-5G with a gigabit line plugged into it, but if i throw in a low bandwidth starlink and expect the bonding to handle such differnces in latency/loss/capacity I get far worse performance.

That is often the case yes, as DWB takes into account by default loss/latency on the links and will back of traffic to avoid saturating one path, outright bonding can often get better headline results but at the expense of introducing higher latency and loss across the tunnel. Generally we use DWB by default these days as it is better suited to our application and typical connectivity mix in terms of WANs (often ropey cellular and satellite services with varying latency and capacity).

WAN Smothing and FEC will almost never result in higher observable throughput, they are designed for error correction and providing greater redundancy for traffic sent over the PepVPN tunnel at the expense of sending duplicate data and using more bandwidth. Generally also you shouldn’t just be turning these on blindly, they are features to address specific issues where quality of connectivity is the paramount concern rather than outright capacity.

My experience of Starlink is that the service itself is highly variable, in terms of quality and latency / loss and PepVPN performance across the service is also highly variable. We do not ever deploy just starlink and often will use DWB with some 4G/5G thrown into the mix to help smooth out the gaps in service.

No, you don’t state an actual use case and that is rather important as generally speaking you can optimise how the SF tunnel is configured. Despite what people would like to think too generally there is always an amount of tuning required per-location to get the best results from these boxes so you may need to explain a little bit about what the end goal here is - we do a lot of live broadcast over PepVPN without issues, but then we also have specific requirements about what I need to deliver bandwidth wise and can configure the equipment to give me the best chances of that.

I would expect your equipment (MBX4 is a good, higher end box) to be able to deliver 200Mbps reasonably consistently without much effort, some times load balancing is much better than trying to bond - depends on the use case.

You can also use outbound policies to direct specific traffic into the SF tunnel so that you are only sending traffic down that path that really needs the protection it offers.

One thing I would strongly suggest though is you consider running your own FusionHub, you will get much better visibility into the traffic across the SF tunnel from both directions, as well as a few more controls to enable you to fine tune the SF config. Even the cheapest VMs from Vultr and DO have been able to deliver around 1Gbps of PepVPN traffic for me with the right equipment on the far end, or if you have your own VM infrastructure and suitable connectivity you can also host your own there.

SpeedFusion Solo licence can be obtained from your InControl2 and the licence will allow your MBX4 to connect so the costs here vs using SFCP could actually work out better for you if you have high traffic requirements.

AshFinch · December 21, 2022, 10:10am

Hi Will,
First of all thanks for the input.
On the overhead that’s fine and we can accept the losses are magnified on fast WANs, there’s always a trade off somewhere!
Capacities of the WANs are basically as you mentioned, but with no PPPoE. There’s a modem already at site and that’s just fed into the MBX WAN port on the DHCP mode.
Device is back online again (albeit with a different setup which I’ll mention further down) in the next few days so I can monitor/test CPU and details graphs/outputs.

WAN Smothing and FEC will almost never result in higher observable throughput, they are designed for error correction and providing greater redundancy for traffic sent over the PepVPN tunnel at the expense of sending duplicate data and using more bandwidth. Generally also you shouldn’t just be turning these on blindly, they are features to address specific issues where quality of connectivity is the paramount concern rather than outright capacity.

One of the applications we have is a small amount of streaming which is going via SFC, but the rest is just going straight out the WANs set via the outbound policy depending on the network it’s coming from. On this though, I’m still slightly confused by the performance results - we have say ~170Mbps on average without smoothing, with it was around 40. I’d have expected around half the 170 due to the two streams as such but it seems to have been abut 1/4.
Good to hear your experience on starlink, we’ve had it mentioned a number of times so good to know this. The boat this is going on will have 2x starlinks and SIMs, and traffic from different networks (crew, vip’s etc) will have different WANs to go out of depending on the importance/traffic type. It remains to be seen what sort of speeds we’ll get though as to whether the 200Mbps cap will even be reached!
Your note on our own FH may well be a viable option here, though unfortunately I can’t do a direct comparison vs SFC at the moment as the device is being shipped currently. But as you mention if we don’t have the 200Mbps cap it may well provide us what we need here.

WillJones · December 21, 2022, 10:29am

So that actually also jogged my memory on another Starlink “thing” - as far as I am aware you cannot change the LAN side subnet from the default, so make sure they are configured for bridge mode although even then there is a chance you could end up with overlapped IP space on the WANs - Peplink does handle this but it can lead to some “odd” behaviour if multiple WANs have same subnets.

Yes, that is low - and I would agree should be higher. Smoothing basically set to “normal” doubles the bandwidth requirement, so let’s say to send 10Mbps you would need two WANs sending 10Mbps, that appears at the hub as 20Mbps (also consider that for your SFCP data usage, it is doubled!) so if you were reliably able to get 200Mbps out of the system it would be reasonable to expect approximately 50% of that with smoothing enabled.

That is a good approach and a good starting point.

Remember when building outbound policies that first match applies, so ordering is important, we often will also just dead-end traffic (enforced WAN + drop if not up) for low priority users to protect the most important traffic in these circumstances.

Next step would also to be to look into sub-tunnels on the PepVPN side, we often do this where for example we have a tunnel with all the FEC/Smoothing enabled and use that for the most critical traffic (video/voice) but maybe also need “hitless failover” for other traffic but without the smoothing/fec overheads.

You could also play around with the priority ordering of the WANs in each SF bundle - we quite often would do this where the primary good connection is used and the cellular stuff is in lower priorities (maybe even different priorities depending on the relative performance of each WAN at that location).

If you were running your own hub you can also use it as a bandwidth test server which tests the end to end capacity of each WAN to the hub outside of the PepVPN tunnel, we often find this can really show up if there is a provider in the mix with a bad traffic path to the hub.

In a similar vein if CPU is not the constraint at the MBX end you could bring up a second hub/SFCP tunnel and use that for less critical traffic if it was a case of just trying to spread the load across multiple tunnels to work around some ceiling you may be hitting (again with Starlink I’ve found that end to end testing shows great capacity, but PepVPN performance is way more variable and I am quite convinced on some cellular networks I seen a lot of evidence of VPN traffic in general being deprioritised at times).

PS - one other aspect that is often overlooked here is the video TX/RX chain - protocols like SRT which are designed for variable quality / lossy links are generally speaking far more robust in these circumstances than straight RTMP, we typically use some Haivision Makito gear to get video from the remote location back to a main MCR / studio before actual TX or some services like Castr let you send them SRT and they can redistribute as RTMP into the various popular platforms.

AshFinch · December 21, 2022, 11:35am

Thanks again for the input Will, most useful to be aware of your experiences and to be able to apply these tips!
I’ll come back with some more results once we have the device online again.