Serious Streaming Issues (100% cpu)

Machtl · November 3, 2020, 2:48am

Hi,

i’am having a simple test-setup here with a Max Transit Duo on:
1x WAN (Cable) 100/20Mbit
2x LTE-A 50/20Mbit

On the LAN side i connected a simple switch:

My problem is right now, when doing a speedtest on the 2nd PC (results in 20/20Mbit) the live update
of the Router Gui on the Config/Control PC is totally blocked. Like the LAN connection is saturated or so? What is going on here?

Same result watching the Gui via the local LAN and also via Incontrol Remote Web Login.

The router is peaking out at 100% cpu usage. So i don’t get it, how to get the maximum out of the bandwidths without maxing out the router to 100% cpu usage?

Best regards,
Martin

TK_Liew · November 4, 2020, 12:46am

Do you mean the Transit’s CPU load shows 100% then it’s web UI hangs while speed test result is showing 20Mbps (up/down) in the Speedtest,net PC? If so, may I know what is the upload and download throughout show on the Dashboard of Transit when the CPU load is 100%?

Machtl · November 4, 2020, 2:07am

Hi, i have to recheck the speedtest, but we have a serious issue here with the Max Duo Transit.

Running a YouTube Streaming Test (8Mbit/s stream from an encoder) to the FusionHub. WanSmoothing=normal, FEC=low, WAN is a CableModem with 100/20Mbit, 2x LTE-A are 50/20Mbit

This “simple” Test is maxing out the Router so that the YouTube connection drops from time to time. We bought the Peplink Solution to have a bullet proof solution for our remote streaming events, this is currently going in a total worse direction. Here is a screenshot showing the GUI Dashboard and the SpeedFusion sidebyside:

What is going on here?

I let it run now for a few minutes, and now both of the LTE connections dropped out at the same time, and again 100% cpu usage:

Tunnel broke down with no traffic even when the WAN cable is still connected:

Here is a pic with the latency values:

We then ran it a little while only on the two LTE-A interfaces, after a few minutes we connected the local WAN via cable and also via WLAN as WAN on 2,4G. There was a NAS backup running on the local WAN so the latency of those two links was high:

This resulted in a complete loss of the YouTube stream at all.

Can someone please help us to solve this issues? Otherwise the whole Peplink solution is useless for us.

Thank you very much!

Best regards,
Martin

WillJones · November 4, 2020, 2:39am

Hi @Machtl ,

If all you are doing is pushing a stream to YouTube you may consider turning off the PepVPN encryption,that will certainly buy some CPU cycles - I assume that the CPU at the hub is not maxed out at any point either?

I would also consider turning off the WAN smoothing and FEC to begin with, strip the config down to the basics and then identify if one feature in particular is what is driving the CPU, as the utilsation certainly seems relatively high for the volume of traffic but all of those features combined will add an overhead.

I also notice from the graphs in your screenshot you have very high levels of packet loss / out of order packets - this is possibly somethign to look into, if one of the WANs is significanly contributing to that the FH / TST will be having to work much harder to retransmit / reassemble packets genrated from smoothing or FEC.

We use the TST Duo a lot for the same use case and it has been working very well for us, but do bear in mind the unit is only rated for 60Mbps of PepVPN with encryption enabled, and adding smoothing + FEC will certainly drive that number lower and push the CPU much harder.

Machtl · November 4, 2020, 3:07am

CPU at the FusionHub is chilling, thats a DualXeon system.

Well, we most of the time do not know how good the Internet connections are at the remote locations. The whole thing investing in PepLink was to use the locally provided Internetconnection via Ethernetcable and/or Wifi and also have two LTE connections in the router to be sure we can stream in/out without any issues. We had the issues that a locally provided Internet connection was super fine at the beginning, but broke down after a few hours into an event. We need a solution to compensate this. Thats why we bought the Peplink.

If the router is not capable of getting out a 8Mbit/s stream to 3 WAN connections with Wan-Smoothing=normal and FEC=low, because of a CPU limitation, than this device is clearly advertised false and we have to take a serious talk with our distributor.

Turning WAN smoothing off is resulting is worse results about the link quality, FEC should help for the LTE connections or not?

So currently i am very confused about this all.

Erik_B · November 4, 2020, 3:18am

Hi Machtl,

Just a few thoughts after reading your post.
It can take quite a bit of time running PepVPN tests to find the right combination in settings for the best SpeedFusion performance.
Bonding WANs with different characteristics makes it even more difficult and doesn’t always mean you see an increased performance.
In these occasions you could be better of using some WANs in priority 2.
Run several PepVPN tests with Remote connections enabled and use different combinations of local and remote WANs to find out what works best for your situation.

Using SIM cards from different cellular providers usually works better than using SIMcards from the same provider.

Don’t be alarmed by seeing the CPU go up to 100%. As long as it is briefly. You see that quite regularly when a change is made.

We don’t use WAN smoothing and FEC at the same time.
You could try disabling WAN smoothing and configuring FEC to high instead.
You can find some more info on FEC in this forum post . It is worth watching the video.

And make sure you configure the up and download speeds of the WAN connection. ( set it to 80% of the actual speeds).
The router uses those values when the WAN is used.

Hope that helps!

Machtl · November 4, 2020, 3:29am

Hi,

i’am testing now for days. Turning of WAN smoothing and having an issue with the WAN connection results in a complete drop of the stream at all. The problem is, when we arrive at a location, we of course do a speedtest. Last time we had a ethernet WAN connection that was capable of 350/350MBit/s.
After around 4 hours, something happend in the location that occupied the line and we were not even capable of getting round 2,5Mbit/s out.

So, what numbers should we dial in into the Up/Down speed fields?

I also did a test when WAN and LTE was present, but i than limited the bandwidth on the WAN on the uplink router to only 500kBit/s or 1Mbit/s to see how the Peplink could handle this. The problem was, that the latency was still low for the WAN connection, but the throughput was limited. Without WAN smoothing turned on, this resulted in a lost stream. Even when 1/2 LTE connections were still present at this time.

Encryption is no issue for us, i started to test it with an unencryptet PepVPN tunnel. We will continue of course our testing.

Best regards,
Martin

WillJones · November 4, 2020, 3:54am

What health check interval are you using with the PepVPN, when set to “extreme” I have been able to keep streams to YT Live and many other platforms going just fine if I pull a WAN cable or eject a SIM when the Ethernet WAN is set to 1st priority with the two LTE modems in 2nd priority - so not SF Bonding but using the SF Hot Failover.

I use Dummynet pipes to simulate various WAN links - a quick and dirty way to use this is to install pfSense on a little x86 box (or even in a VM) add some VLAN interfaces to it and break them out via a managed switch, you can specify random packet loss / latency values for each interface - I have this setup in my lab where I can simulate a dozen different WAN links with high latency / loss etc.

Machtl · November 4, 2020, 4:03am

Currently the PepVPN Settings for Link Failure Detection Time are set to Recommended. I can test it with extrem.
The thing is, another point of using the Peplink solution is the Session stable connection. When setting it to extrem and putting the LTEs in 2nd Priority, that would kick the VPN Tunnel and a switch to the FusionCloud as fallback would occur right? In that case i would loose session depending connections when we do for example a GoToWebinar, so primary goal is to keep the VPN Tunnel alive as long as possible.
When is the Peplink router deciding that the WAN link is down? Pulling the cable is the extrem version yes, but what if the througput is breaking down to a few hundred kilobits/s but still alive?
Is there some method to set a “minimum” bandwidth that a WAN link must provide so that it is included in the SpeedFusion tunnel?

WillJones · November 4, 2020, 4:20am

No, the sessions should still be persistent and the standby links are brought into active use.

Example of a TST we have being used right now where the venue has given us a 100/100 WAN, and we are getting 80+Mbps up/down on both the LTE WANs - bonding these together actualy hurt performance so they are all just in failover order.

Testing this in rehearsals pulling the WAN cable the stream (1 RTMP feed to YT Live and 1 RTMP feed to a private platform) was unaffected, failover was <1second and the session is persistent, we have 3 VPN hubs configured on this TST but they are only used when the primary hub is totally unavaibale.

Yes it is the extreme failure scenario, I tend to be watching the bandwidth going in/out via each link from the PepVPN status when something is live and can manually disable the path if it looks to have gone bad.

We tweak the healthchecks on the WANs to be more aggressive, and actually ping two targets deep within our own network so we have a fair idea of reacahability to the internet and our own infrastructure so we can trigger a WAN health failure on low level packet loss. You do have to find values that will work though as too aggressive and the links may thrash between states of healthy / unhealthy.

As far as I know there is no feature in Peplink at the moment to say “if observed throughput is lower than X consider the wan useless” as they do not do any kind of active measurement of the links involved in the VPN. We actaully returned to using Peplink recently from a different product which had this feature and honestly it often was trigger happy about declaring a link unusable, and all it really did was generate a huge amount of excess usage performing active measurements.

Perhaps in the future though Peplink could introduce some sort of hysterisis curve that wuold allow this to be done passively, but such methods also typically require a good knowledge of the historical performance of a given link - not ideal when you are using it in one location for a few hours at most.

Machtl · November 4, 2020, 4:32am

Thank you for your long answer, i really appreciate it!

In your current scenario, your WAN is gone trigger depends on your configured health check method right? Hmm… i think this will need some time to figure out if it not directly depends on a bandwidth limit.

This is a more realistic scenario i am testing right now. The WAN connection (local Internet in the office) is occupied in this moment totally by the synology NAS doing some backups to another NAS. So, the latency is rising alot, and the Router can handle this better than having a low latency but limited throughput it seems. Also i disabled encryption, looks like this made up some % of free cpu space. But currently again testing with wan smoothing set to normal and fec set to low.

we will continue testing… so, even when there is no active WAN in a PepVPN tunnel (in the moment of the failover) the link and the sessions are still persistent? have to test this aswell. and yes of course, we will use two different lte providers out in the field.

thanks!

Best regards, Martin

p.s.: do you have a go thru method of testing all the wan connections? did you made yourself a list of outbound policies for a config laptop to send the traffic to each wan alone, also doing some captive portal logins on locations wifi that needs it?

WillJones · November 4, 2020, 5:08am

In that exact example yes, and the target for the healthcheck is two IPs within our core network (if both are unable to answer safe to say 2 different datacentres have gone dark for me!).

This is possibly a result of how I believe Peplink does their passive measurements of the WANs but someone more knoweldgeable can probably explain that better than myself.

Out of interest did you try the dynamic weighted bonding option (DWB) - you can enable it for the tunnels by visiting the support.cgi page, I have had mixed results with it but it has proven effective when links are very variable or with less closely matched latency / loss / capacity, there are some threads on here about it, so something else to look into perhaps

Ofcourse your milage may vary to mine, but this is my expereince when using the extreme setting for the PepVPN healthcheck the standby links are brought into use so quickly that no sessions drop or expire.

I tend to use the WAN analysis tester to a spare FusionHub we host in the same location as the production units as for me this is proving the end to end capacity of each link between the remote network and mine outside of the PepVPN, after that using the PepVPN bandwidth test itself to verify VPN performance on site and also good old iPerf again to a server in our network and public speedtest servers - we even host a Speedtest.net server in the same location as one of our FusionHubs so I can benchmark the VPN with their tool but I’m not testing against some random (and often severly underprovisioned and overloaded) server on the public internet.

rich205 · November 4, 2020, 10:43am

Hi, Just following your thread.

We ae streaming on max700’s and get locked out the user interface when the router starts being used. Fortunately everything else seems to work ok, but we cant log in until the actual traffic requirements slow down. Thing is were only talking 20-30Mbps to lock the routers out.

MartinLangmaid · November 4, 2020, 10:59am

You said here that they were HW2 versions. The MAX 700 HW2 has a VPN limit of 25MBps because of CPU / hardware restrictions - so that would make sense.

rich205 · November 4, 2020, 11:06am

Yes I just saw the reply on the other thread so It may not be the same issue causing a similar result.

joelbean · November 5, 2020, 4:46am

I do wish the CPUs would handle more bandwidth while managing tunnels. This is my major problem with the Max Transit. I have two LTE connections that can provide 100Mbps+ each, but when bonded, they give me a max of 65Mbps due to CPU restrictions. Unfortunate bottleneck.

WillJones · November 5, 2020, 5:04am

To be fair to Peplink this is one way they differentiate between a Max Transit, an MBX-4 and the SDX… higher performacne models are available with more powerful guts (and the obvious increase in price).

Differentiating performance / capabilities of a proudct in this way is quite typical of all network equipment vendors.

joelbean · November 5, 2020, 5:16am

I understand your comment, however the Max Transit Duo is advertised to support 150 users and 400Mbps. This is at least a small/medium office expectation.

Fair enough, but when you look at the advanced features that are clearly intended to support a larger client base with enterprise-class requirements…

…all of a sudden you need to turn on Speedfusion and have the Max Transit Duo manage tunnels. Then the router drops to 65Mbps, best case.

This seems to suggest a different expectation and purpose for this particular router. Which is it, really?

Machtl · November 6, 2020, 11:19am

I can’t find this “dynamic weighted bonding option” on the transit duo. Can someone give me a hint where i can find it? Thx!

WillJones · November 6, 2020, 12:35pm

Log into your router either directly or via InControl, in your browsers address bar change “index.cgi” to “support.cgi” it is hidden there.