SOHO: Lockups & Reboots

No problem peparn! The more information everyone can provide, the quicker the problem is found and fixed. Your case #2 sounds like what I’ve seen. Wifi is gone but outbound connections for wired devices somehow still work. This is half and half here.

I should have stated the peplink is plugged directly into the back of an AT&T router. When I had the AT&T router restarted, the wired connections that were on the VLANS with the wireless devices, they could no longer re-establish connections with their cloud component after the AT&T router came back up. But, the raspberry pi’s, which are on a separate VLAN, were able to re-establish outbound connections to their cloud component.

Maybe it’s coincidence, but the issues impacted the clients/systems on VLANs that had a WiFi component to them. I’ll have to keep a log of how long I go between the freeze ups. I believe I updated to firmware 7.1.1 back in August (I need to go look, I have it written down), and have experienced this only three times, so definitely not as frequently as you, but still more than I’d like.

Thanks for the update on the watchdog component too. I was debating turning that on but have not done so yet.

@Nielb

Do confirm your SOHO hardware revision.

Do share me the ticket number when you had opened the support ticket. I really need your help to allow us to investigate the issue from the device.

@peparn (SOHO MK3)
My device still running fine with your configuration - Uptime around 1 days 8 hours. I have active users using the device. Still monitoring.

@SOHO (SOHO MK3)
I will contact you via your previous ticket. Look like multiple behaviors and we need to check from the device.

@brill

Would you please confirm the hardware version ? SOHO MK3 ?

Please open a support ticket for support team to check.
https://contact.peplink.com/secure/create-support-ticket.html

1 Like

@SOHO, the topic title change is now visible, thanks. I guess it did not take effect immediately.

Yes, it is MK3. I’ll get a ticket opened this weekend

Hi sitloongs,

The SOHO I have is a “Pepwave Surf SOHO MK3”, product code “SUS-SOHO”, hardware revision “1”.

The firmware currently running is 7.1.1 build 1342. Before that, I was running 7.0.3s031 build 1282 (I believe this was a special release for me to resolve an issue I was having when the SSID key was a 64 character hex key).

I will try to get a ticket opened this weekend to help with investigating the issue.

Thanks.

1 Like

An update on my side: I am still getting the same crashes, as expected, every 4d10h ±6h now that I have turned off daily reboots. However, since I have enabled the watchdog timer, I have never had to take action to correct these issues, the router restarts by itself as expected. It is too early to conclude that the watchdog is able to catch all the failure modes I reported earlier, given that it only went through a single 4d cycle, but it is encouraging. I will post further updates after it has gone through a few more cycles. Auto recovering is the most important thing. Fixing the problem to avoid spurious network interruptions is next.

@sitloongs I am guessing your setup that clones our configuration is still happily running without any problems? Hopefully it will eventually show the issue if you run it long enough. Again I am hoping these crashes are not caused by WiFi interference patterns or certain device WiFi stacks, these kind of issues are almost impossible to identify/replicate when they happen. Let’s keep our fingers crossed that you will run into it :slight_smile:. Perhaps adding more devices (such as phones/tablets/Voip devices) to your test network and having those devices change during the day (like on a regular office network) would help replicating the issue by making it more likely to happen?

2 Likes

@peparn

:sweat_smile::sweat_smile::sweat_smile: Yes, i still trying to reproduce the issue. Just for your info that the issue also escalated to other team to investigate.

2 Likes

Thanks for the update @sitloongs, I am keeping my fingers crossed :crossed_fingers: that the efforts will eventually yield success. And it is great to hear that the issue is getting some internal exposure. Thanks as usual for being thorough! :smile:

Me too. Surf SOHO hardware revision 2. Firmware 7.1.1 Build 3102

I was on a WiFi connected computer remotely controlling an Ethernet connected Windows PC on the LAN. First problem was DNS failures on the Ethernet connected computer. A connection to an existing website on that machine continued to work, but new websites would not load. This seemed typical of the initial problem, existing connections were OK, but new ones failed. This pattern was true on a third computer on the LAN. Some apps continued to run fine after the initial problem but others failed.

The remote control connection continued to work after the first problems.
Then new Wi-Fi connections to the SSIDs failed

Then new Ethernet connections to the LAN failed. The Ethernet connected Windows PC that was being remotely controlled was connected to a smart switch. A new computer connected to the same switch could not get on the LAN.

From the Ethernet connected PC, access to the router by HTTPS and IP address failed. I don’t allow HTTP access.

From the Ethernet connected PC computer, I did an “arp -a” command, The arp command saw the router and the smart switch that it was connected to.

From the Ethernet connected PC, I connected to the Smart switch. This worked fine. The switch said all three connections were alive and well and running at GB speeds. The three are: to the router, from the Windows PC and a third computer that never did get an IP address from the router. The switch showed packets coming and going on each of these 3 connection with no Ethernet transmission errors.

From the Ethernet connected PC, I tried to ping the router and it timed out. Normally pings to the router from the LAN side, work fine. Then another arp -a showed that the router was gone.

The wifi light on the router was initially blinking when the problems started, then for a while it was dark, then it was back on.

The router had rebooted itself.

I have a diagnostic report file, from just after the reboot, if anyone wants it.

There is nothing interesting on the Event log
Oct 16 16:17:10 System: Time synchronization successful
Dec 31 19:02:47 System: Wi-Fi AP Normal Mode
Dec 31 19:02:27 WAN: timewarner connected ()
Dec 31 19:01:27 System: Started up (7.1.1 build 3102)

@Michael234

Would you able to open a support ticket for us to further check ? We need more help from you. As for now, we only have 1 device access for 1 of the forum users here. We trying our best to check on the reported issue. Still contacting other forum users here so that we can confirm the issue.

1 Like

Will do. A problem that only happens rarely is the worst to debug.

1 Like

:+1::+1::+1::+1: Appreciated the help :blush::blush::blush:

1 Like

More updates on my side: it has now been 10 days since I have enabled the watchdog timer, and, so far, I had reboots but never had to manually intervene. Another 10 days like this and it will be the longest without having to manually reboot the router.
At this point I am convinced there are 2 different issues at play in my case. One causing my crash/reboots every 4d10h ±6h, and another one with the same symptoms but that happens randomly. Still working with support on this, and the first issue may have been understood (fix under test). The second issue is still a mystery at this point.

May I ask Peplink and other experts in this forum, do the Soho and Balance routers have the ability to dump memory for analysis by Peplink? Microsoft Windows has that capability. It is certainly used to deal with challenging problems. My company’s software also does.

If a watchdog timer can reboot the router, then the watchdog timer should also be capable of dumping memory for analysis. In my opinion, it should be always be collecting a minimum amount of significant information (ours does). Call stacks for all threads (is there a thread deadlock?), status of ports, the status of buffers, significant recent activity (for example status changes in ports), etc.

Having developed memory dumping and “crash” dumps (the minimum amount of significant data) for our software, I know implementation can be tricky. This is because the dump software must be careful not to use any parts of the code which may not be reliable. For example, perhaps writing to the event log might not be working anymore, necessitating another approach such as saving the information somewhere to be finally written as part of the restart.

I assume having a memory dump triggered by the watchdog timer to be analyzed, full or abbreviated, would help solve this and other problems.

@peparn

Thank you for the update. We are working on the random crash as reported. So far we don’t see this in lab. :sweat_smile::sweat_smile::sweat_smile:

Peplink Surf Soho froze again this morning. I’ll open a ticket here shortly and let you know the number.

Thanks,
Niel

Ticket opened: Ticket #788371

1 Like

@Nielb

Yes, working on your ticket

1 Like

An update: Still making progress with support. They have identified one of the two issues: the one that was causing our regular crash/reboots. This was caused by a memory leak and they have a fix for that one.
The random crashes, however, are still elusive as they can happen in the first couple hours/days after a reboot, or much later in some instances. They are looking into it. I will post here if this gets solved.

1 Like

I also have a ticket open - #183123. This tix resulted in a new SOHO being shipped but unfortunately I am having the same issues as before but with an additional issue the team is aware of.

Issue 1 - SOHO disconnects from WAN, outbound connects drop as does the AP. Cannot connect to a LAN port to interrogate or intervene. System hang requires a power recycle. I was running 7.1.1 until this issue started to occur with greater frequency and looked like a memory leak. I was given a patch 7.1.1 s064 build 1362 which seemed like it worked until today. See more below.

Issue 2 - I have been experiencing WiFi connects/disconnects across the two channels 2.4 and 5, intermittently. WiFi connections will degrade and browser pages will not load. This may span 30 min to 60 min and then it clears up.

I implemented an InControl2 daily reboot at 0800 and that seemed to prevent the disconnect from WAN issue but did not seem to resolve the intermittent WiFi disconnect/connect issue. This issue would not occur daily but every few days at irregular times. The latter was more of a nuisance but not a show stopper.

Unfortunately today Oct 31 the router disconnected from the WAN again around 1700 even though the daily InControl2 software reboot had occurred at 0800. When the router disconnects from the WAN you cannot access it via InControl2 or via a LAN port. You have to do a power recycle. This is problematic because all my IoT devices go off line as well as all of my personal WiFI devices. This means my Arlo base station, cameras and SimpliSafe camera go off-line which is not good from a security perspective.

I am now considering connecting the router to a timer that will perform a daily power cycle at 0800 since the software reboot did not prevent the problem from occurring. I want to test the power cycling to see if that prevents the router hang and WAN disconnect issue.

I was able to grab a diagnostic report yesterday and send it to the team after the WiFi connect/disconnect incident occurred. It did not involve a reboot of any sort so the diag report should be helpful.

The 3GStore and Peplink teams are aware of the issues. I am hoping we can get this resolved soon because it has be problematic for several months.

Bob

@PeteTrax, your issue #1 after getting 7.1.1 s064 looks a lot like the random crash we are hunting down in our case. After the fix for the memory leak (s064) that @sitloongs provided, which solved one of the reboot/crash case we were having regularly, we have been getting more of the random lockups you describe (compared to the 7.1.1 release or 7.1.0 for that matter), often in the first 3 days after the last reboot, and sometimes like now less often (router up for 5+ days now). In our case though, we have enabled the watchdog, so those lockups/disconnects turn into automatic reboots which is convenient as the router self recovers. See above in the thread for a link to instructions to turn on the Watchdog.