SOHO: Lockups & Reboots

@Nielb

Appreciated you can open a support ticket and allow support team to check on the issue.

1 Like

@Nielb thank you for taking the time to post here. The issue you had today is exactly like failure case#2 I posted here. Wifi is gone but existing outbound connections for wired devices somehow still work. New connections usually fail. I am glad I am not the only one to have run into this :smile: .
Also I forgot to mention earlier, but like you, when the device was in limbo, I noticed it was warmer than usual. Probably because it was busy looping somewhere in its code…
Maybe we get these issues more regularly because our SOHO is under more load?

@sitloongs you are great! Thanks a lot for working on this; I’ll try to help the best I can. So don’t hesitate to ask if you have any questions.
The problem with this kind of issues is they take a long time to reproduce; in our case 4.5 days more or less, but for @Nielb it seems more random. I would think the more devices you connect and the more traffic you create, the more likely it will happen, sooner than later. I am hoping it is not a low level WIFI protocol bug that could be caused by interference or certain WIFI stacks on certain client devices.
I am crossing my fingers :crossed_fingers: you run into the issue, at least one of the three failure modes.
Finally, I still think it would be a good idea to rename this thread to something like: “SOHO: lockups and reboots” to get more attention and participation from the community.
Cheers!

Hi, I just want to add that I too have had this problem where the router simply locks up and I can’t even connect to it with an ethernet cable. The SSIDs remain broadcasting though… It’s happened maybe 5-10 times since upgrading to 7.1.1. Prior to this, it had been solid since owning it. I have a HW3 Surf SOHO.

I’ve not yet rasied any tickets, I’ve been rebooting as and when required. I’d guess I’m also out of warranty now too, as I’ve had the router for what will be 2 years come Jan/Feb 2019…

Will be a bit disappointed that I’ve got a hardware problem after such a short length of time.

1 Like

Have just renamed topic to reflect underlying issues. Just to recap on my scenario (and appreciate others’ experiences may differ):

  1. I was having lockup issues with previous firmware (7.1.0) hence the request to schedule reboots and then upgrade to 7.1.1 - I am still doing daily reboots and still on 7.1.1
  2. I’m running a Huawei HG 612 OpenReach box as “modem”, one wired connection to the SOHO + 3 WPA2 SSIDs - 2 x 2.4 + 1 x 5Ghz - all on Auto and unlimited clients
  3. I have a Guest VLAN and an untagged VLAN

What seems to happen is that when a new device connects (total somewhere around 7-10) it sometimes - and only rarely (but twice the other day) locks up. When I say locks up, I mean: WAN light stays solid, other lights stay on, no wireless and no wired connections work either - and obviously not Internet. It’s not predictable, sometimes new devices just connects and work. Not sure that this is a memory leak (unless a very subtle one) as I wouldn’t expect one new connection to behave like that. Only way to restart is hard reboot (have to leave it for a minute or so) and then restart. The other day it went down twice in short succession after working for a few weeks (albeit with daily reboots)

I don’t have an immediately repeatable test case, but if anyone can suggest things to try, then I will give them a go and disable the reboots (in case they were masking problems - although the 2 lockups in short order would suggest it’s not the amount of time the thing was running). I also genuinely have no idea whether this is software or a hardware fault.

Anyway, hope that’s some help in terms of sharing behaviour.

1 Like

Thanks for sharing more about your issue @SOHO. It is interesting you were able to tie this to devices connecting to the WiFi network. In your case, this would lean towards a bug in the WiFi stack somewhere that would crash the router under certain conditions. On 7.1 when I had those issues randomly I was thinking of that kind of a bug, however in my case it became oddly regular (4d10h ±6h) with 7.1.1, with the exception of occasional random crashes before it reaches that mark (those are probably the random ones I was getting on 7.1 already). I still can’t understand why it would be so regular on 7.1.1 while it was random on 7.1. There might be multiple bugs we are after, one that was there already on 7.1 causing those random crashes, and a new one linked to resource exhaustion (maybe) on 7.1.1, which would explain why it is so regular given our usage patterns.

Regarding things to try:
Since yesterday I have enabled the watchdog (see here) and disabled my daily reboot which was to prevent it from reaching the 4d mark where it would crash and potentially require manual intervention. In theory, with the watchdog activated, it will not require a manual intervention when it crashes, so I can just let it run. Since I have turned on the watchdog, I have had one automatic reboot which may have been one of those random crashes, or the watchdog might be a bit too nervous and may have triggered under load if the router was a bit late servicing the timer… not sure yet. I will continue monitoring and see if it can run for multiple days like before. In any cases, I would rather have more reboots than hangs that require manual interventions…
Hopefully this gets fixed soon though because those reboots disrupt work in the office, and interrupt conference calls etc…

Finally, thank you for agreeing to change the thread topic name, it should help getting the community’s attention. However, for some reasons I don’t see the name change reflected on my side. Was the change saved?

Thanks sitloongs. I’ll open a ticket if it happens again. They’ve had us working crazy hours lately and unfortunately time is something I’m short on at the moment. Although I also like the idea of having a reliable router too:).

No problem peparn! The more information everyone can provide, the quicker the problem is found and fixed. Your case #2 sounds like what I’ve seen. Wifi is gone but outbound connections for wired devices somehow still work. This is half and half here.

I should have stated the peplink is plugged directly into the back of an AT&T router. When I had the AT&T router restarted, the wired connections that were on the VLANS with the wireless devices, they could no longer re-establish connections with their cloud component after the AT&T router came back up. But, the raspberry pi’s, which are on a separate VLAN, were able to re-establish outbound connections to their cloud component.

Maybe it’s coincidence, but the issues impacted the clients/systems on VLANs that had a WiFi component to them. I’ll have to keep a log of how long I go between the freeze ups. I believe I updated to firmware 7.1.1 back in August (I need to go look, I have it written down), and have experienced this only three times, so definitely not as frequently as you, but still more than I’d like.

Thanks for the update on the watchdog component too. I was debating turning that on but have not done so yet.

@Nielb

Do confirm your SOHO hardware revision.

Do share me the ticket number when you had opened the support ticket. I really need your help to allow us to investigate the issue from the device.

@peparn (SOHO MK3)
My device still running fine with your configuration - Uptime around 1 days 8 hours. I have active users using the device. Still monitoring.

@SOHO (SOHO MK3)
I will contact you via your previous ticket. Look like multiple behaviors and we need to check from the device.

@brill

Would you please confirm the hardware version ? SOHO MK3 ?

Please open a support ticket for support team to check.
https://contact.peplink.com/secure/create-support-ticket.html

1 Like

@SOHO, the topic title change is now visible, thanks. I guess it did not take effect immediately.

Yes, it is MK3. I’ll get a ticket opened this weekend

Hi sitloongs,

The SOHO I have is a “Pepwave Surf SOHO MK3”, product code “SUS-SOHO”, hardware revision “1”.

The firmware currently running is 7.1.1 build 1342. Before that, I was running 7.0.3s031 build 1282 (I believe this was a special release for me to resolve an issue I was having when the SSID key was a 64 character hex key).

I will try to get a ticket opened this weekend to help with investigating the issue.

Thanks.

1 Like

An update on my side: I am still getting the same crashes, as expected, every 4d10h ±6h now that I have turned off daily reboots. However, since I have enabled the watchdog timer, I have never had to take action to correct these issues, the router restarts by itself as expected. It is too early to conclude that the watchdog is able to catch all the failure modes I reported earlier, given that it only went through a single 4d cycle, but it is encouraging. I will post further updates after it has gone through a few more cycles. Auto recovering is the most important thing. Fixing the problem to avoid spurious network interruptions is next.

@sitloongs I am guessing your setup that clones our configuration is still happily running without any problems? Hopefully it will eventually show the issue if you run it long enough. Again I am hoping these crashes are not caused by WiFi interference patterns or certain device WiFi stacks, these kind of issues are almost impossible to identify/replicate when they happen. Let’s keep our fingers crossed that you will run into it :slight_smile:. Perhaps adding more devices (such as phones/tablets/Voip devices) to your test network and having those devices change during the day (like on a regular office network) would help replicating the issue by making it more likely to happen?

2 Likes

@peparn

:sweat_smile::sweat_smile::sweat_smile: Yes, i still trying to reproduce the issue. Just for your info that the issue also escalated to other team to investigate.

2 Likes

Thanks for the update @sitloongs, I am keeping my fingers crossed :crossed_fingers: that the efforts will eventually yield success. And it is great to hear that the issue is getting some internal exposure. Thanks as usual for being thorough! :smile:

Me too. Surf SOHO hardware revision 2. Firmware 7.1.1 Build 3102

I was on a WiFi connected computer remotely controlling an Ethernet connected Windows PC on the LAN. First problem was DNS failures on the Ethernet connected computer. A connection to an existing website on that machine continued to work, but new websites would not load. This seemed typical of the initial problem, existing connections were OK, but new ones failed. This pattern was true on a third computer on the LAN. Some apps continued to run fine after the initial problem but others failed.

The remote control connection continued to work after the first problems.
Then new Wi-Fi connections to the SSIDs failed

Then new Ethernet connections to the LAN failed. The Ethernet connected Windows PC that was being remotely controlled was connected to a smart switch. A new computer connected to the same switch could not get on the LAN.

From the Ethernet connected PC, access to the router by HTTPS and IP address failed. I don’t allow HTTP access.

From the Ethernet connected PC computer, I did an “arp -a” command, The arp command saw the router and the smart switch that it was connected to.

From the Ethernet connected PC, I connected to the Smart switch. This worked fine. The switch said all three connections were alive and well and running at GB speeds. The three are: to the router, from the Windows PC and a third computer that never did get an IP address from the router. The switch showed packets coming and going on each of these 3 connection with no Ethernet transmission errors.

From the Ethernet connected PC, I tried to ping the router and it timed out. Normally pings to the router from the LAN side, work fine. Then another arp -a showed that the router was gone.

The wifi light on the router was initially blinking when the problems started, then for a while it was dark, then it was back on.

The router had rebooted itself.

I have a diagnostic report file, from just after the reboot, if anyone wants it.

There is nothing interesting on the Event log
Oct 16 16:17:10 System: Time synchronization successful
Dec 31 19:02:47 System: Wi-Fi AP Normal Mode
Dec 31 19:02:27 WAN: timewarner connected ()
Dec 31 19:01:27 System: Started up (7.1.1 build 3102)

@Michael234

Would you able to open a support ticket for us to further check ? We need more help from you. As for now, we only have 1 device access for 1 of the forum users here. We trying our best to check on the reported issue. Still contacting other forum users here so that we can confirm the issue.

1 Like

Will do. A problem that only happens rarely is the worst to debug.

1 Like

:+1::+1::+1::+1: Appreciated the help :blush::blush::blush:

1 Like

More updates on my side: it has now been 10 days since I have enabled the watchdog timer, and, so far, I had reboots but never had to manually intervene. Another 10 days like this and it will be the longest without having to manually reboot the router.
At this point I am convinced there are 2 different issues at play in my case. One causing my crash/reboots every 4d10h ±6h, and another one with the same symptoms but that happens randomly. Still working with support on this, and the first issue may have been understood (fix under test). The second issue is still a mystery at this point.