Troubleshooting routing/packet loss issues on Metro-Ethernet WAN connections

dwang · January 30, 2019, 12:12pm

I am testing speedfusion in our environment with a Balance 710 as the hub and two 380s on the spokes, all running firmware 7.1.2 build 2574. The hub is our corporate HQ. The spokes are branches with 4 IP subnets each. Each Peplink has a LIVE Internet connection on WAN2, and an Ethernet subnet on WAN1 which connects all three peplinks to simulate our private Metro-Ethernet WAN.
In order to test fault-tolerance, at each site the peplink is put in Drop-in mode with LAN Bypass, meaning LAN and WAN1 become just a physical conduit if the peplink drops dead. So far things have largely worked. There is one nagging issue however.

I noticed sporadic icmp packets drops ping from a client PC at each remote site. I saw momentary packet loss rate up to 30-40 pkt/s on WAN1 link Speedfusion status on the 380s. I further noticed the pepvpn on WAN1 would momentarily tear down and rebuild without leaving any messages in Eventlog. It would briefly display “Link failure, no data received” as it tore down vpn on WAN1. I verified all cable connections and swapped out switches to no avail.
At the same time, VPNs over WAN2 were rock solid and clients never missed a ping to Google’s DNS servers (Internet traffic is not routed through PepVPN unless WAN2 health check fails).

Based on my previous testing experience in my initial Peplink POC. I knew this had to be something to do with the way Peplink makes routing/path selection decisions. When Peplink doesn’t know where to forward the packet or put the packets on the wrong path, it will create issues like this. But it continued to happen even after I disconnected all WAN2 connections.

So I started tweaking Health Check settings under each WAN interface and later Link Failure Detection Time under Pepvpn Settings within SpeedFusion setup. When I set Failure Detection Time to Recommended (approx. 15 secs), I finally got a stable network. Speedfusion VPN over WAN1 doesn’t tear down any more, and my average packet loss drops to around 1% on WAN1 links, measured by pinging. Throughput test on WAN1 with PepVPN without encryption can now get to up 80%-90% of the bandwidth I defined in WAN1.

If this were a real-world test, I could live with the results. But the problem is that the WAN1 interfaces are connected back to back on an enterprise-quality access switch. There shouldn’t be any packet loss. Even the WAN2 connections, which are real-world Internet connections across various ISPs, do not have any packet drops the entire time.

Without a Cisco-like CLI I have no way of debugging. I have tried the SSH CLI. I have clicked on all the ? icon and exhausted all the hidden fields that I can possibly try. And the user manual is practically useless in advanced troubleshooting.

Any insight into this will be tremendously appreciated, as I am way past my planned depoyment date.

Ron_Case · January 30, 2019, 12:31pm

After logging into the Balance, type in this address to get the support.cgi page: http://<Peplink’s IP>/cgi-bin/MANGA/support.cgi

From there you can see the interface statistics, hope this helps.

dwang · January 30, 2019, 1:36pm

I happened to see that link in a different post and brought it up. Just haven’t spent time on it yet.
Thanks for the quick reply.

Any more insight into the inner working of Speedfusion and routing?

I came from a Cisco and Talari background. I am frustrated with the limited configuration and debugging options, and lack of in-depth documentation.

dwang · January 30, 2019, 2:03pm

Okay, the support.cgi page confirmed the issue - all three peplinks’ WAN1 interfaces have over 2000 errors in “Receive Dropped” column, and no other errors.
I see I can capture traffic for a short whole on all connections.
But this is not going to help me find out why incoming packets are dropped on WAN1.

Also, I noticed LAN interfaces on three devices are in the “link: Down” status. Is this because they are in Drop-in mode with LAN bypass? Otherwise, this may be a programming error.

Ron_Case · January 30, 2019, 2:11pm

This is unusual, something is not right. You can open a support ticket with us here so we can investigate the issue.

dwang · January 30, 2019, 2:29pm

Okay. Will give it a try. Thanks.

dwang · February 4, 2019, 3:15pm

@peplinkspecialist it’s been FIVE days since I submitted a ticket online at @Ron_Case suggestion, I still have not had a single email acknowledgement from Peplink. What’s going on? I am really disappointed at the speed of the support I “get” from Peplink.

Did I make a bad choice abandoning my existing SD-WAN to move to Peplink?

I need to fix these issues (I actually have three) so that I can deploy them. I bought them in November in a trial purchase and thought they worked well in my test lab. Two months later I still can’t put them in production due to these issues.

Ron_Case · February 6, 2019, 10:15am

Please check your spam filter as I responded to this ticket on the same day. We have included your re-seller to help and I am confident we will get these issues fixed so you can deploy them successfully.

dwang · February 6, 2019, 11:02am

I checked our spam filter but couldn’t find email from peplink.com. Could the ticket come from a different domain?

I will check once more today.

Thank you for getting back to us!

I already have an online session with them set up for this afternoon.

Daniel Wang