Surf SOHO Mk3 with fw 8.3.0 - finally reliable

Hello,

I have a Surf SOHO Mk3. I have a love & hate relationship with it, but I think we’ve found this week-end and to get along and perhaps extend that relationship a little longer than expected.

SITUATION

  • FW 8.3.0.
  • Frequent lock ups. Unable to reach MANGA front-end, nor SSH console. Was better on 8.1 but I needed the UDP relay, hence 8.3.0.
  • Router would frequently fail to establish uplink on restart
  • MANGA not responding so unable to diagnose, reboot twice or thrice would usually do the trick… the wife and kids frustration had to be managed, too, understandably.

TL;DR
Disclaimer:

  • the following worked for me. Do not blindly apply this configuration has it has impacts on functionality.
  • I expect most readers in this forum are somewhat versed in network tech and can RTFM.

1- head to MANGA/support.cgi - disable DPI (this greatly decrease QoS and Content Blocking capability, but at least the darn thing is stable), enable watchdog (why disabled?!)
2-head to MANGA/network.cgi → Network → [LAN] Network Settings → DNS Proxy Settings → Uncheck “Enable”, Uncheck “DNS Caching” (this may increase DNS latency so if you have a cell backup WAN link, you might want to keep the proxy enabled and start with step 1 and see how it goes)
3- Apply changes. Wait. Cross-fingers it doesn’t crash due to CPU overload. If it does, reboot and try again.
4- Profit.

THE NOVEL

I’ve struggled a lot with my Surf SOHO Mk3 in the last months. When you read some comments published out there about the unit, it got some bad press from many due to frequent crashes/network drop with recent firmwares. During the pandemic, I had replaced my Surf SOHO Mk3 from main house/office router to access point as it would frequently drop during Zoom calls. Before I went that way, I had done a full diagnostics on wires and NIC stabilities. It was puzzling me because it was remarkably stable for years! It started to be unstable towards FW 7 maybe? Before that, I loved the unit and could do with its throughput limit. But then it was somewhat becoming unbearable; luckily we received a consumer router as a gift from a relative and it worked remarkably well so, so the robust commercial-grade Surf SOHO was relegated to Access Point duty. How ironic.

As access point, the network traffic on the SURF SOHo was lower and it would less frequently crash… until the load started to increase on the access point again. Same story as before. Last saturday, I was thoroughly pissed at the situation (cumulation of everyone’s frustration) and decided to get to the bottom of it. (interlude, I’m a software developer since the end of the 90s, I’ve coded embedded systems and know a thing or two on debugging, this helps.)

I spent the whole saturday rethinking the network topology/VLANs/subnets and what not to segregate what traffic was video streaming, what machines needed in same subnet (Bonjour/mDNS fun) video conferencing… while trying to keep things somewhat simple. I’ve manage to reduce the load on the SOHO again.

But the SURF SOHO remained flaky. Less flaky. But still flaky. I was looking for a replacement… but the Surf SOHO model isn’t ready… almost there…not yet. Part of me hesitating, considering alternatives… But there MUST be reason why it suddenly becomes flaky. Let’s dig in the stack.

I noticed MANGA would get unusually slow… to… not responding. The router would still respond to ping, but everything else would appear dead. Connecting to SSH would be impossible. I suspected the system to enter memory thrashing condition until eventual lockup. Fancy firewall rules? The firewall rules are not over the top (about 10-15 entries?) so that can’t possibly be that…frankly, if my old DD-WRT router of 2010 could do it… ya know…

So, how to decrease CPU usage otherwise? So what else may consume memory? VPN obviously… but I don’t have any VPN connection so not much possible there beyond killing the VPN daemon… but we can’t since SSH is controlled. And yeah it’s probably busybox-based as most routers, but the capability to just disable the VPN and/or SpeedFusion daemons isn’t accessible.

OK what else?

Anything with caching? DNS Proxy! Disable the damn thing which was enabled. Oh wow, that helped, more than I expected. A few years ago, a page would be made of about 10 URL sources or so. Now? Facebook, Google, Microsoft, Apple URLs. Other ad services on top. CDN sites. The AWS API server URLs. All those cell phone apps trying to call home… all the chrome tabs… the enhanced experience services in Windows and macOS… that’s a lot of entries, all things considered.

So that was better, but not there yet. The throughput was bad, even after the VLAN segregation and QoS was done. Went to support.cgi page to review the settings, what’s DPI? Why does my router has Dot Per Inch setting? Oh Deep Packet Inspection. Oh. Like Suricata CPU-killer type thing? Turned it off.

  • Throughput is back;
  • Stability is back;
  • I still have some QoS going on it seems;
  • MANGA has never been that fast over long periods;
  • I even reconfigured reboots from overnight to weekly!

Frankly, in retrospective, I don’t think the poor SURF SOHO has the CPU power to do DPI. It’s a great security feature but it’s too modern and too CPU intensive for a single CPU system. I think it was a Bad Software Design Decision to allocate that much CPU budget to this (Go on, change my mind!!) The DNS Caching should have a better warning on memory constrained hardware, too. Better yet, delegate this to a separate system, say a container running on your NAS or something… if you can.

Hope this helps others.
If the system starts crashing again, I’ll report back, but so far so good, all systems are nominal.

3 Likes

hey,

I’m happy to report back that the router has been running in that configuration for over a week. Rock solid, like it used to be.

Hope this helps anyone.

2 Likes

Interesting write-up. Thanks.

1 Like

in the interest of science, I could re-enable DPI and DNS caching alternately to see further isolate the culprit if that could be useful engineering wise. Let me know.

1 Like

I did not find DPI on the support page of my Surf SOHO running firmware 8.2. Must be new in 8.3. So, maybe re-enable DNS caching and just that.

1 Like

Right. On the other hand, I’ve had similar performance issues with a Synology router with their Threat Monitoring thing, which is a Synology-branded Suricata. The router started to behave again once I had disabled that process, which is seriously too heavy for such low-memory, low-cpu devices. That’s why when I finally understood that DPI meant similar packet inspection, I had a good hunch that was the root cause of the instabilities.

I’ll start with DNS caching and we’ll see from there.

2 Likes

Just to add to this interesting discussion… I too have been using a SOHO Mk3 for several years and have kept the firmware up to date as it is now. There are certainly differences between how my SOHO is being used and undoubtedly this is why we’ve had different experiences with regard to the built-in ips system but I’ve been pleasantly surprised by how robust router is with or without the intrusion detection subsystem running.

First, with regard to firewall rules, I’m fairly certain I’ve pushed the router to the limit because if I go much further, the UI starts to slow down to an intolerable point but I’d guess I’ve got at least 100 active firewall rules (and another 100 or so inactive ones) just counting the standard ip/port blocking on the first page of firewall settings. On the second page I’ve blocked pretty much every protocol that exists. I’ve blocked most types of protocol as groups but with regard to some, I’ve had to break them out and block protocols one by one.

I’m not going to count how many protocols are being blocked but I’d guess it’s some number above 200 or so. I really only allow a tiny few through… perhaps 10 or so. I also block perhaps 20 top-level domains. Admittedly, the SOHO is sweating but it still runs with sufficiently high throughput to be useful to me (I’m completely content with the SOHO speeds - my first router was a 300 baud acoustic coupler I would use to connect to my friend’s BBS. I was so happy when 1200 baud came out because the text output finally exceeded my ability to read it faster than it was coming in - I simply don’t know what people do with the speeds they sell today - I think most of it sits figuratively collecting dust).

But I digress… the reason I’m taking the time to write this is because I wish to contrast my experiences with yours. There is some difference between how we are using the routers which leads to mine being so stable I’ve had to set a reboot cycle on it so it actually cleans itself out once in a while, while at the same time, yours has grown unstable. I don’t know what that is. I’m just adding this as food for thought.

Now, my setup is likely very different and I’ll briefly describe. I use the Peplink solely as a perimeter router which does nothing but block inbound traffic and keep track of a few rules regarding internal traffic while also providing comparatively complex egress filtering. Almost all the firewall rules are with regard to outgoing traffic (a lot of them are either external VPN addresses or whitelisted ip’s for places like Microsoft, Amazon or CDN’s like Akamai. I do this to keep my logs clean enough to be useful).

Behind the SOHO, I have DMZ with a Cisco small business switch and then a Protectli router running opnsense. That router is the polar opposite of my Peplink. It’s beefy (an i5 with 16gb of ram), has endless capacity for customization and additional capacities and it’s complex beyond belief. It handles most of the network DNS/DHCP/etc. tasks. However, I simply don’t trust it. It’s got far too big an attack surface and too many configurations to keep track of so I keep the SOHO out front and will replace it with another Peplink router when the time comes. I can get my head entirely around the Peplink’s configurations while at the same time, I’m confident I will never get my head around the BSD-based *sense systems. You need to have a background in networking to do so and mine is in software development.

So, other than providing almost unused DNS and DHCP services to the the one or two devices in the DMZ, all the SOHO is doing is blocking incoming traffic, selectively blocking outgoing traffic, and keeping track of state. Yet, even though it has a limited role, it’s still having no problem keeping track of what is undoubtedly an over-complicated firewall while also successfully running the SOHO’s built-in IPS system.

It has been my impression that the Peplink’s IPS system is not as complex as something like Suricata or Snort (I too once ran into performance issues on a Synology router I once had though - but I sensed that was a comparatively complete Suricata rendition). On the SOHO, there’s no way to configure it other than to turn it on or off and I don’t recall seeing any readings from it either. I just assumed it was a simple black box setup that watched for a few specific low-hanging intrusion signatures and such. I wasn’t aware it had that much impact on performance.

As I said, I don’t have any answers or intuition to provide. Rather, I add this solely for the purpose of highlighting the fact that problem lies somewhere between my admittedly simplistic use of the router and your much more complex use on what seems to be your entire network with what I’m guessing are multiple IOT/phone/etc. devices constantly communicating with the router. I wonder if there is one or more specific devices that are causing the bulk of the problem… perhaps something being overly chatty (like me?) Because I don’t think the DPI activity alone is enough to cause a great deal of problems. With my (relatively) complex firewall needs and IPS turned on, I can still achieve nearly capacity speed on the router without almost ever having to reboot and it’s getting a little long in the tooth to boot.

Perhaps you might want to isolate individual devices attaching to the router to see if any of them are interacting with the SOHO in a manner causing it to overload to the point the IPS has to be turned off? I’m not sure, just offering observations. This isn’t really my area of expertise.

This is my first comment on the Peplink forum after years and years of reading these posts. Hopefully, my thoughts are at least marginally useful in further diagnosing this problem and perhaps getting your full protection back up and running. Every defense is a useful one in some way (well, hopefully).

2 Likes

Wow. You are really pushing the poor little SOHO to the limit! ;<) When you are ready for a B-One let me know – I’ll give you a good deal and you’ll like the experience even better! :<) :<)

1 Like
Perhaps you might want to isolate individual devices attaching to the router to see if any of them are interacting with the SOHO in a manner causing it to overload to the point the IPS has to be turned off? I'm not sure, just offering observations. This isn't really my area of expertise.

Hello! Thank you for sharing your observations. You provided quite good paths of investigation. I, too, am sw dev and pf-based firewalls aren’t much my cup of tea. My network expertise lies more in BSD sockets than with packet routing and filtering :grinning:

I bet that your… comprehensive(?!) firewall rules reduce a great deal the intake of packet to inspect. Reducing the dataset as soon as possible remains to this day the most efficient optimization… This make me (and I am sure a lot of readers of this thread) really curious about your filtering rules. Heck, many of these could perhaps be built in the product as rules with friendly names.

In my network, the surf soho now sits as AC repeater behind another router. There’s a fair amount of traffic going through, especially in the evening when streaming happens. I suspect that the router may become overwhelmed… now that I have root access through the CVE, I could simply leave top running and monitor with greater accuracy than through MANGA while Netflix is playing and so on.

I wonder if there could be some value of sharing even a partial list of the rules you made as an ‘awesome’ config (referring to github ‘awesome’ compilations series). I used to have the patience to build those, but I’m growing lazy with time.

PS: I do have a C64 with wifi modem to replace the 300 (or 1200) bauds VicModem, amazing hardware tricks we can do these days!