Instability of Speedfusion?

Scenario… would really appreciate some practical input/comments…

DATACENTRE:
6 x Fusionhub Solo on separate VLANS - common WAN IP - 100MBit fibre connection

CLIENT SITES:

  • 6 × Balance 20, each with Speedfusion Alliance pack, and WAN3 Activations.
  • Each has vdsl internet service with ISP ‘A’ on WAN1 of Balance20, modem in bridge mode
  • Each has vdsl internet service with ISP ‘B’ on WAN2 of Balance20, modem in bridge mode
  • Each has 4G/LTE internet service with ISP ‘C’ on WAN3 of Balance 20, 4G modem in bridge mode and services have public IP (not NAT)

SPEEDFUSION:
Each of the Speedfusion tunnels are set as point-to-point tunnels from client to Datacentre.
Client 1 has speedfusion tunnel to fusionhub 1, data port set unique to 43013
Client 2 has speedfusion tunnel to fusionhub 2, data port set unique to 25015
Client 3 has speedfusion tunnel to fusionhub 3, data port set unique to 10015
Etcetera etcetera.
All tunnels hook up fine, all routing working perfectly, voip and RDP for each client working nicely. No issue.

But… if ONE of the client internet links goes down (say) at client 1, their whole speedfusion tunnel drops, renegotiates based on now just 2 of the 3 links being available, then speedfusion comes back up. This might take a minute. Later, when the faulted internet link returns online, again speedfusion drops for that client for minute, renegotiated speedfusion tunnel based on 3xlinks… tunnel back online.

In speedfusion config (using incontrol2), forward error control set to low, wan smoothing at medium, WAN1 & WAN2 set as priority-1, 4G set as priority 4, Link failure detection time - have experimented with various… currently set to ‘Extreme’.

Obviously, the whole goal here is unbreakable VPN by bonding 3x internet services per client… but the speedfusion tunnels are breaking every time just one of the links fail… and thus lost voice calls and dropped RDP sessions. The recovery time of these tunnels is a minute (compared to under 4s when ipsec was used).

These are not large client sites… typically 10x SIP handsets and 10x RDP clients… so throughput is low (typically less than 4MBps)

Any ideas??? Or is it simply that Ive misunderstood, and Speedfusion will not work to maintain an uninterrupted speedfusion tunnel without drop despite a single link failure? Config error? Incorrect expectations or shortfall of the alliance pack?

Edit: All on v8.01 firmware.

Thanks for reading, and input valued.

Regards,

Brett Kitchin

PS: Is there a ‘speedfusion tunnel availability’ report available, where I might be able to demonstrate tunnel availability versus individual link availability? I.e be able to show clients (graphically) how much better life really is with Speedfusion?

It sounds like your client end Speed Fusion is set up with WAN priority as sequential. This is the bottom of the speed fusion setup page. If you want all three WANs connected all the time through Speed Fusion, those need to be set up with the same priority, typically all priority 1. The behavior you describe sounds like they are set up as 1, 2, 3. With sequential priority the tunnel is not connected until needed - exactly what you have now.

If your goal is to have the data go primarily through a particular WAN within the Speed Fusion tunnel, instead of using different WAN priority that can be accomplished with an outbound policy rule. For example you can set tunnel 1 within the VPN as priority so it gets used as long as available. This might work better than allowing Speed Fusion to collectively join all three tunnels because your vdsl and 4G/LTE will have very different speed and latency from each other. SIP won’t like allowing Speed Fusion to spread everything across all three WAN tunnels. The ability to prioritize individual tunnels for specific traffic is a recent feature.

1 Like

Hi there,

Thanks for the input.

WAN1 & WAN2 both set to equal priority (priority 1)
WAN3 is set to priority 4.

The tunnel will drop, even if I simply manually disconnect WAN3 !

As a further example of what we’re dealing with here, see this order of events. Note the time delays involved:

  1. 1:41pm… manual disconnect of WAN 3. By 1:43pm, speedfusion tunnel drops and status “starting”

  2. By 1:45pm I lose contact with the device on incontrol

  3. 1:46pm in control lost th e device entirely

  4. 1:55pm the device finally comes back online, and Speedfusion back “up”

  5. 1:57pm, device now visible again in in control with WAN3 correctly shown as disconnected.

  6. At 2:03pm I manually re-enable WAN3:

  1. And… 2:05pm it drops speedfusion again

What about the handshake ports - eg typically 32015 have you created unique port forwarding rules for those too?

1 Like

Hi there Martin.
So glad you have responded.

I have set a custom data port for each of the 6x speedfusion profiles in incontrol2:

  1. 43015
  2. 25015
  3. 10015
    …etc

At the datacentre end, I have port forwarded

43015-43020 to fusionhub1,
25015-25020 to fusionhub2,
10015-10020 to fusionhub3
etc.

These are the only ports I have forwarded.

The doco I read stated that we only need one port… and that can be set as a custom port per speedfusion tunnel - and have done this via in control.

Are you saying that a separate setting exists for customising the handshake port too? Let me know where I can do this?

In my mind, I’d have thought speedfusion would need a port per inbound link… which is why I opened up a range of 5 ports per fusionhub.

Greatly appreciate the input Martin, Thankyou :slight_smile:

Did you set the WAN priority on SF profile in the same way as in WAN settings? Can you post a screen from SF profile?

Sure thing…

Here’s the config:

Hello Brett, @E55Technologies
We recommend that you attempt this on one of your FusionHub instances. Add in the second inbound default Port # for SpeedFusion of TCP port 32015 to one of your FusionHub appliances (including the suitable firewall and routing) as per this forum thread. Then retest what happens when a WAN port fails.

There have been a lot of improvements in FusionHub of which some are covered in last weeks webinar posted by @Cassy_Mak here:

Next check that there are not any automatic traffic filtering rules applied to your firewall for the ports you are using (such as limiting the # of simultaneous connections from external IPs into the port #).

Happy to Help,
Marcus :slight_smile:

Hi Marcus

Thanks for the comments.

Yes, the ‘tips n tricks’ article is one Im familiar with, and I had already tried opening up 32015 and pointing it to a Fusionhub instance so that it had both the custom port and the ‘standard’ 32015 port. Unfortunately, it did not improve the situation. Retesting again just now, I can confirm that disconnecting any one of the three WAN’s results in a dropped speedfusion tunnel.

For the sakes of clarity: the system has no trouble establishing the tunnel.

And, re v8.01 firmware… thanks for the links. I have been on v8.01 since release. As a test, I did try reverting to v8.00 to see if any of the layer-2 functional enhancements may have introduced a problem. I can confirm that regardless of firmware version, the issue remains.

Some other things that Ive also tried before posting here:

I have also tried moving the link detection time back from extreme to default… but unfortunately this too seems to have made no difference.

I’ve also tried removing from incontrol and re-adding. Also tried deleting tunnel configuration entirely, removing from incontrol and setting manual tunnels.

I’m running out of things to try… and hesitant to log a support case about it without exhausting options via collective brains here in this forum.

Marcus, I do appreciate the input… and it’s kind of reassuring to know that the suggestions from someone as experienced as you have also been running through my head. If there are any other ideas out there, I’d be glad to give them a try.

1 Like

Hi Brett,

My first suspicion would be when the wan goes down, the InControl detected IP changes, which triggers a config update on the FusionHub end of the tunnel. You can check this by looking at the device details for the FH device and seeing if it’s logging IC2 updates to match this event.

If so, I’d suggest you try moving one of your PepVPN tunnels to a ‘star’ topology, with the hub device being the FusionHub device. By default, IC2 will not include the endpoint IP into the configuration of hub devices to prevent config churn in cases like this, at the expense of the links only being establishable from the endpoint in towards the hub.

4 Likes

James,

You are spot-on.

It seems that whenever one of the links changes, the system sees the need to reinvent the wheel regarding the speedfusion configuration - starting with incontrol detected IP.

I like the suggestion to try the star configuration… a lot.

Will try this evening. Thankyou :slight_smile:

1 Like

@JamesPep

Update as follows: I did notice that the FusionHub was detecting change in incontrol IP of the peer whenever a link was disconnected. Moving to star configuration seems to eliminate these entries in the Event Log. Having the Fusionhub initiate the connection to the Balance20 seemed a good strategy for this. Thankyou for that tip.

However, manual disconnect of the 4G/LTE connection on WAN3 still breaks everything, even though it is set as lower priority that the two active links on WAN1 & WAN2.

The situation seemed to improve if I have WAN1 & WAN2 in ‘NAT’ mode, with the WAN IP’s of the Balance20 in the DMZ of the modems. If I (say) disconnect WAN2, the tunnel stays alive. Reconnecting too - stays alive. Previously, disconnecting any of the links produced a fail.

Observation: when the WAN’s are set in routed config (IP FORWARDING, PPOE), stability of speedfusion seemed much worse. Presently, WAN1 & WAN2 are now in static/NAT’d config but I have left WAN3 as a routed config… albeit in priority-4.

Disconnecting WAN3 is catastrophic to the speedfusion tunnel even with WAN1 & WAN2 active in the tunnel and WAN3 only in standby. Seems odd? To summarise… disconnecting the standby link on “priority 4” destroys the tunnel comprised of two active links on “priority 1”.

Today, when implementing this, I factory reset the Balance20 and started from scratch… because when I first implemented, and disconnected WAN3 as a test, the device (although still active and routing internet packets and responding to ping) became offline in incontrol. Fusionhub end showed tunnels stuck on ‘Starting’.

I really do value your input on this.

-Brett

If you don’t mind, I’d like to ask you to open a support ticket for this, so I can dig a bit deeper into the cause.

https://ticket.peplink.com/ticket/new/public

Ok onto it, thanks @JamesPep

Ticket: 9110560

For any who may be interested… @JamesPep suggested the star topology, and getting the FusionHub to initiate the connection. This made some notable improvements, but as already posted here, not a full fix. @jamespep then got me to log a support ticket, and within it gave me this:

“When you move to a star topology, you fixed the IP address of the hub site that’s sent to the endpoints to (ip address removed). Unfortunately, it seems that that Suppress Endpoint IPs option (pepvpn management page, select your profle, click through to the profile options page, enable “show advanced settings”, 4th from the top) is disabled.
Once you enable this, you should stop having these issues. Another option to avoid this issue is to enable the DDNS service for the devices at both ends of the link. In this case, a standard P2P link should operate correctly, as the device config only contains the DNS name of the remote device, and is not subject to changing every time a device link goes on/off line.”

…I can confirm that after selecting the “Suppress Endpoint IPs” checkbox, it all started working the way it should.

Thanks to all that offered their input: @MartinLangmaid @mldowling @Wilink_PL @Don_Ferrario …and of course @JamesPep for the solution.

4 Likes

Glad I could help. If you have any further issues, let us know.

1 Like