Latency worse with bonding active on the Max 4HD


#1

I have a Pepwave - Model: MAX HD4 Firmware: 7.0.1 build 2988
With bonding active the latency is over 200ms compared to 70ms with bonding off. The speed is also quite a lot slower with bonding active.

Can anyone recommend a fix for this?

Thanks
ITO


#2

What sort of links are you bondingon the HD4? What Peplink device are you connecting to with the HD4 for bonding and how much bandwidth does that have?

Go to Status -> Speedfusion and click the graphing button image then run some bonding throughput tests and post a screenshot of the graph here so we can see whats going on.

You might find you need to adjust the latency cut off see more info here:


#3

Good Evening Martin,

Many thanks for coming back to me on this.

I’m currently bonding three cellular links on the HD4. Please see below image of the graph having run two seperate throughput tests.

I hope this adds a little detail, I look forward to hearing your thoughts.

Thanks
ITO


#4

Look at the link above about latency cut off, and try a cut off of 150Ms. Your Cellular 2 connection is all over the place latency wise and that will stop the HD4 using it when its latency goes above 150ms which will improve the overall performance/ latency of the tunnel.


#5

Hi Martin,

I set all three cut-off latency to 150ms, however all connections are still showing latency of 200-220ms when bonded. When bonded mode is de-activated I’m able to achieve 70-80ms latency on each connection individually.

Could it possibly be an issue relating to AES 256 Encryption being activated causing this?

Thanks
ITO


#6

HOW/Where are you measuring the latency @ 70-80ms when using load balancing?
What is the RTT between the HD4 and the remote peer device when you ping it from a device on the LAN of the HD4?

Latenyc will always be higher when bonding since you’re adding in more hops for the traffic:

HD4 LAN -> HD4 -> [SpeedFusion VPN] -> REMOTE PEPLINK -> Target System

Yes AES Encryption has a CPU overhead and as such can add latency. If you don’t need it turn it off.


#7

Hi Martin,

Many thanks for coming back to me again on this.

Just to confirm when I quote latency at 70-80ms, this is on each individual interface without load balancing.

RTT between the HD4 and the remote peer device when pinged from a device on the LAN of the HD4 is between 205 and 220ms.

I tried the tests again with the Encryption off and the latency was not affected.

We’re currently using an American IP with our setup so as we can get access to american TV Channels, could this also be where we are experiencing the additional latency?

I hope this makes sense Martin, if you need any further information from me on this then please let me know.

Thanks
ITO


#8

Yes but latency to and from where?

So lets talk about latency for a moment and where we can measure it. (as much for the rest of the room as for us but its still worth a review)

  • I am on a windows PC connected to a Peplink balance (RTT <1ms)
  • The balance has a public IP address as it is connected to my ISP. The next hop from my balance to the ISP is their ingress router. (RTT 7-8ms)
    *. The Balance is connected via SF to my Fusionhub. If I ping its public WAN IP direct from my Windows PC (so not in the tunnel) i get RTT 21-23ms
  • I have an encrypted SF tunnel inplace so I can ping its LAN IP over the tunnel - 23-24ms
  • Lets ping something interesting direct via my ISP www.bbc.co.uk 25-26ms
  • now I’ll ping www.bbc.co.uk from my Fusionhub 5ms
  • now I’ll add an outbound rule to my balance and force bbc.co.uk traffic over the tunnel. 28 -32 ms
  • now I’ll run a continuous ping to bbc.co.uk and also run a speedtest on my PC - When the download portion of the test runs I see 33-44ms latency. When the upload portion runs i see 85-91ms latency.

What have we learnt? If I ping bbc.co.uk from my PC I get 25-26ms if I ping via SF I get 28-32ms a difference of 3-6ms - no big deal and expected as there is now another hop for my traffic to go through - the Fusionhub is hosted in a good datacenter with great connectivity.

When my WAN is saturated (with a speedtest) latency goes through the roof - its more than tripled when the upload bandwidth is saturated…

You’re using cellular links, by their very nature they have higher latency (signal takes longer to propagate over RF than it does a fiber connection), and cell towers are surprisingly easy to saturate - both from an RF perspective and a backhaul bandwidth perspective.

This can be because the tower is very busy serving other subscribers or because it is a tower on the end of a daisy chain of towers connected using point to point microwave / RF links.

So what you’re seeing when you get 205-220ms of latency is the cumulative latency of all the hops between you and the end target server /service.

Lets take another look at your graph above:

  1. When Upload bandwidth is saturated the latency of cellular 2 goes nuts (800ms).
  2. (and 3 +4) even when there is no traffic of note passing through the tunnel, Cell 2 latency is all over the place.

So the Cellular 2 connection is likely to a tower/operator which is over subscribed. in one way or another.

But you’ve added the 150ms cur off latency, so we can conclude that the overall cellular latency from the other links is not causing the perceived latency elevation (your observed 200ms +) so what is?

The only way to test this is to go step by step like I did above, testing from as many points along the routing path as you can to see where the latency is coming from.

I’m not sure where you are based, but obviously you’re not in the states. So we’re likely looking at intercontinental traffic and that brings a few extra gotchas.

I have a Fusionhub in a datacenter in LA. If I ping its WAN IP from my PC here its 202ms. If I ping the Fusionhub from another Fusionhub hosted in a UK datacenter its 144ms. If i ping the LAN IP of the LA Fusionhub over speedfusion i get 167-173ms.

So the question becomes - why is the latency lower over SpeedFusion compared to direct via my ISP?

Its all down to relationships between the network operators their infrastructure and the firms that run the intercontinental links and their infrastructure.

we know latency rises as the number of links traversed increases and as the traffic over those links increases to the point of saturation. What is less obvious perhaps is that our ISPs will ultimately pay for the amount of bandwidth they use on the transatlantic links, and as such they manage the traffic across those links. They have DPI tools & load balancing methodologies (and other approaches to traffic management I’m sure) that all add latency to the traffic. They also have different routing paths between countries and continents.

Here are the tracerts that show that best:
Via my SpeedFusion VPN:
image

  1. is my local router
  2. is my UK Fusionhub SF endpoint
  3. is a firewall appliance I use
    4 is the remote (LA) Fusionhub SF endpoint
  4. is the LAN IP of the FH.

So total of 166-171ms Latency (36-46ms between me and the UK datacenter) and as expected the bulk of that latency (125ms - 130ms) is added when traffic hops across the intercontinental link between the fusionhubs.

Now lets look at the tracert between my router here and the WAN of the LA Fusionhub - so traffic passing direct over the respective ISPs infrastructure.

This time we have 17 hops
1-6 are routers within my ISP (22-23ms)
7-13 is an ISP out of Stockholm that my ISP has a peering relationship with (196+ ms)
14 is the datacenter ingress point in LA
15-16 is the datacentre infrastructure routers
17 is the WAN IP of the FH (hops 14-17 add very little additional latency)

We can see that traffic is routed from me in Exeter to London > Hamburg > Denmark > New York > LA.

So a total of 200-205ms latency. 27-31ms of which is between me and my ISPs egress router in London. The rest (180ms ish) is between london and my datacenter in LA over the peering network that is operated by Telia.net

Whats interesting is that if I run a tracert from my virtual firewall in the UK datacenter to the WAN IP of the LA fusionhub i see that the traffic also goes via Telia.net but before it gets to Telia.net infrastructure, the traffic passes through another ISP router in London operated by centurylink (formerly level3.net). If we look at the GeoIP data we see the route this time is: Exeter > London> Chicago >LA

image

And there we have the reason for the lower latency between the fusionhubs - distance and hops.

Traffic sent via my ISP travels via a longer route - via europe, traffic sent between the Fusionhubs travels direct from London to Chicago and onwards to LA without the european detour.

Why? Because my datacenter operator has a commercial relationship with centurylink (who have another relationship with Telia), who provide a lower latency more direct route to the same datacenters LA location.

That’s why, when we build international and intercontinental SpeedFusion SDNs for our customers here at Slingshot6 we never do that by directly peering devices hosted in different countries via their own in country ISP networks. Instead we always build out FusionHubs as local points of presence in local datacenters in the host countries first to take advantage of higher quality lower latency intercontinental links between those datacenters.