IPSec VPN Dropping

brcp40 · February 18, 2020, 9:09am

I’ve got a MAX Transit Duo set up to establish an IPSec site-to-site VPN. The VPN establishes just fine, and traffic will pass for hours. Over time (I haven’t observed a specific pattern yet) the IPSec tunnel will stop working. The rest of the device works fine - WiFi is still up, cellular links are up, outbound traffic from behind the device works.

Nothing I do seems to cause it to attempt to re-establish the IPSec tunnel. The “Status” shows “Connecting” with the spinning wheel, but the IPSec Event Log just shows that the tunnel disconnected. It doesn’t show any connection attempts or other activity. Monitoring inbound traffic at the VPN hub shows no traffic coming from the MAX.

If I edit the settings on the MAX and disable the VPN tunnel, apply settings, and then re-enable the VPN tunnel and apply settings, I immediately see the traffic hit the VPN hub and the tunnel comes back up. Until the next time it happens.

I do not believe this is an issue on the VPN hub (other tunnels stay up when this occurs, and no attempts are even seen hitting the hub) or anything relating to the cellular connectivity – since simply disabling/re-enabling the tunnel will cause it to come back up, I do not believe it’s related to the cellular links. They stay established all the time. I have tried two carriers as the primary link in case that was the issue, but it does not appear to make any difference. It seems like the IPSec process on the MAX just needs to be “kicked” to wake up again.

Any ideas?

DenverAdmin · June 12, 2020, 6:48am

I think I am seeing the same behavior on a Balance 305 running the 8.0.2 firmware - were you ever able to figure anything out? Ours seems to get in this state after exactly 16 hours.

brcp40 · June 12, 2020, 7:23am

I “fixed” it by reverting the P1 SA lifetime and P2 SA lifetime values to the Peplink defaults and then configuring the other side to match. They were set identically before, so I don’t know why this made a difference, but the tunnel remained stable for a couple of months after that. Since then I’ve replaced the far end with a Peplink router and it has been rock solid.

I believe there are just some inconsistencies/incompatibilities in the IPSec implementation and it does not appear to play nicely with non-default key lifetime values.

The annoying part was that it would not try to re-establish the tunnel on its own. It required manual intervention every time – which was a problem since this particular setup is used for out-of-band access to remote equipment.

thebigbeav · June 12, 2020, 6:37pm

I have random VPN routing issues as well. Example network has a B1-Core and a B210. Routers have addresses 192.168.100.1 and 192.168.101.1 respectively. Devices on 192.168.101.1/24 can hit 192.168.100.1/32, but not further into /24. Same is true vice versa. The correct /24 routes are being advertised across the tunnel. No amount of tinkering with settings would fix it. Only a reboot of the B1-Core would solve it. Remote users using OpenVPN also can’t route traffic when this happens.

brcp40 · June 12, 2020, 7:00pm

Instead of rebooting, can you try to Disable the VPN tunnel, Apply Settings, and then re-enable VPN tunnel, Apply Settings? That worked in lieu of rebooting in my case… just curious if you’re seeing the same issue.

Are you running any non-default values for key lifetimes?

thebigbeav · July 22, 2020, 8:06am

Issue happened again today. Tried dropping and restarting the tunnels. Didn’t solve it. Only a reboot of the B1-Core solved the issue.

TK_Liew · July 22, 2020, 11:08pm

If you able to access 192.168.100.1 from 192.168.101.0/24, the routing should be there. I would suggest opening a ticket for us to take a closer look.

thebigbeav · December 29, 2020, 8:35am

Happened again today. Devices on the 192.168.102.0/24 end of the SF tunnel could only ping 192.168.100.1, but nothing else inside 192.168.100.0/24. I tried one of the above suggestions of flipping services on and off and pressing apply (to prevent a full reboot), but it didn’t work. Only a reboot on the B1-Core solved it. It’s hard to open a ticket and leave it sitting open because as soon as it happens, business grinds to a halt at the office location with the 192.168.102.0/24 network. I am forced to do a reboot immediately.

TK_Liew · December 29, 2020, 6:00pm

@thebigbeav, may I know the Balance One Core (192.168.100.1) able to ping all the active LAN devices (192.168.100.x) when the problem occurs?

May I know all the LAN devices of 192.168.100.0/24 are connected to a switch as below?

LAN devices —> Switch —> [LAN] Balance One Core

thebigbeav · December 29, 2020, 6:13pm

Yes, all devices in the 192.168.100.0/24 subnet can communicate fine with the B1C at 192.168.100.1 at all times. We have never noted an interruption there. All devices in the 192.168.100.0/24 subnet can also access the internet fine. Yes, there are two switches in the mix as well. A HPE JL386A and a HPE J9028B.

TK_Liew · December 30, 2020, 7:08pm

@thebigbeav, I suggest connecting a LAN client directly to the LAN port of Balance One Core. This helps to isolate which part is giving a problem when the problem occurs again. You may proceed with the test below when the problem occurs:

Perform Network Capture at http://[LAN IP]/cgi-bin/MANGA/support.cgi > Network Capture > Start.
Ping the directly connected LAN client and an active LAN client who connects to the switch from 192.168.101.x or 192.168.102.x. Allow 5-6 pings will do.
Network Capture > Stop > Download the Network Capture.
You may open the Network Capture with WireShark to confirm the ICMP request from 192.168.101.x or 192.168.102.x sent out from LAN interface and the ICMP reply received from the LAN clients.

You may contact your point of purchase if you need further help.

thebigbeav · January 12, 2021, 9:08am

B1Core choked again this morning. The usual rock solid Peplink reliability is getting shaky. A reboot was the only thing that could solve it. No time for packet capture because the whole office was down.

I’m out of warranty so I don’t think the original vendor really wants to talk.

TK_Liew · January 12, 2021, 9:17pm

@thebigbeav, we need to isolate the problem as we yet to know the root cause. This is the reason I provided the troubleshooting step here. I know the situation is critical when the problem occurs. But there is no way to solve it if the root cause is unknown. The suggested troubleshooting steps may take 5-10 minutes only. You may reboot the Balance Core or the switch after that.