SOHO: Lockups & Reboots


#22

No problem peparn! The more information everyone can provide, the quicker the problem is found and fixed. Your case #2 sounds like what I’ve seen. Wifi is gone but outbound connections for wired devices somehow still work. This is half and half here.

I should have stated the peplink is plugged directly into the back of an AT&T router. When I had the AT&T router restarted, the wired connections that were on the VLANS with the wireless devices, they could no longer re-establish connections with their cloud component after the AT&T router came back up. But, the raspberry pi’s, which are on a separate VLAN, were able to re-establish outbound connections to their cloud component.

Maybe it’s coincidence, but the issues impacted the clients/systems on VLANs that had a WiFi component to them. I’ll have to keep a log of how long I go between the freeze ups. I believe I updated to firmware 7.1.1 back in August (I need to go look, I have it written down), and have experienced this only three times, so definitely not as frequently as you, but still more than I’d like.

Thanks for the update on the watchdog component too. I was debating turning that on but have not done so yet.


#23

@Nielb

Do confirm your SOHO hardware revision.

Do share me the ticket number when you had opened the support ticket. I really need your help to allow us to investigate the issue from the device.

@peparn (SOHO MK3)
My device still running fine with your configuration - Uptime around 1 days 8 hours. I have active users using the device. Still monitoring.

@SOHO (SOHO MK3)
I will contact you via your previous ticket. Look like multiple behaviors and we need to check from the device.

@brill

Would you please confirm the hardware version ? SOHO MK3 ?

Please open a support ticket for support team to check.
https://contact.peplink.com/secure/create-support-ticket.html


#24

@SOHO, the topic title change is now visible, thanks. I guess it did not take effect immediately.


#25

Yes, it is MK3. I’ll get a ticket opened this weekend


#26

Hi sitloongs,

The SOHO I have is a “Pepwave Surf SOHO MK3”, product code “SUS-SOHO”, hardware revision “1”.

The firmware currently running is 7.1.1 build 1342. Before that, I was running 7.0.3s031 build 1282 (I believe this was a special release for me to resolve an issue I was having when the SSID key was a 64 character hex key).

I will try to get a ticket opened this weekend to help with investigating the issue.

Thanks.


#27

An update on my side: I am still getting the same crashes, as expected, every 4d10h ±6h now that I have turned off daily reboots. However, since I have enabled the watchdog timer, I have never had to take action to correct these issues, the router restarts by itself as expected. It is too early to conclude that the watchdog is able to catch all the failure modes I reported earlier, given that it only went through a single 4d cycle, but it is encouraging. I will post further updates after it has gone through a few more cycles. Auto recovering is the most important thing. Fixing the problem to avoid spurious network interruptions is next.

@sitloongs I am guessing your setup that clones our configuration is still happily running without any problems? Hopefully it will eventually show the issue if you run it long enough. Again I am hoping these crashes are not caused by WiFi interference patterns or certain device WiFi stacks, these kind of issues are almost impossible to identify/replicate when they happen. Let’s keep our fingers crossed that you will run into it :slight_smile:. Perhaps adding more devices (such as phones/tablets/Voip devices) to your test network and having those devices change during the day (like on a regular office network) would help replicating the issue by making it more likely to happen?


#28

@peparn

:sweat_smile::sweat_smile::sweat_smile: Yes, i still trying to reproduce the issue. Just for your info that the issue also escalated to other team to investigate.


#29

Thanks for the update @sitloongs, I am keeping my fingers crossed :crossed_fingers: that the efforts will eventually yield success. And it is great to hear that the issue is getting some internal exposure. Thanks as usual for being thorough! :smile:


#30

Me too. Surf SOHO hardware revision 2. Firmware 7.1.1 Build 3102

I was on a WiFi connected computer remotely controlling an Ethernet connected Windows PC on the LAN. First problem was DNS failures on the Ethernet connected computer. A connection to an existing website on that machine continued to work, but new websites would not load. This seemed typical of the initial problem, existing connections were OK, but new ones failed. This pattern was true on a third computer on the LAN. Some apps continued to run fine after the initial problem but others failed.

The remote control connection continued to work after the first problems.
Then new Wi-Fi connections to the SSIDs failed

Then new Ethernet connections to the LAN failed. The Ethernet connected Windows PC that was being remotely controlled was connected to a smart switch. A new computer connected to the same switch could not get on the LAN.

From the Ethernet connected PC, access to the router by HTTPS and IP address failed. I don’t allow HTTP access.

From the Ethernet connected PC computer, I did an “arp -a” command, The arp command saw the router and the smart switch that it was connected to.

From the Ethernet connected PC, I connected to the Smart switch. This worked fine. The switch said all three connections were alive and well and running at GB speeds. The three are: to the router, from the Windows PC and a third computer that never did get an IP address from the router. The switch showed packets coming and going on each of these 3 connection with no Ethernet transmission errors.

From the Ethernet connected PC, I tried to ping the router and it timed out. Normally pings to the router from the LAN side, work fine. Then another arp -a showed that the router was gone.

The wifi light on the router was initially blinking when the problems started, then for a while it was dark, then it was back on.

The router had rebooted itself.

I have a diagnostic report file, from just after the reboot, if anyone wants it.

There is nothing interesting on the Event log
Oct 16 16:17:10 System: Time synchronization successful
Dec 31 19:02:47 System: Wi-Fi AP Normal Mode
Dec 31 19:02:27 WAN: timewarner connected ()
Dec 31 19:01:27 System: Started up (7.1.1 build 3102)


#31

@Michael234

Would you able to open a support ticket for us to further check ? We need more help from you. As for now, we only have 1 device access for 1 of the forum users here. We trying our best to check on the reported issue. Still contacting other forum users here so that we can confirm the issue.


#32

Will do. A problem that only happens rarely is the worst to debug.


#33

:+1::+1::+1::+1: Appreciated the help :blush::blush::blush:


#34

More updates on my side: it has now been 10 days since I have enabled the watchdog timer, and, so far, I had reboots but never had to manually intervene. Another 10 days like this and it will be the longest without having to manually reboot the router.
At this point I am convinced there are 2 different issues at play in my case. One causing my crash/reboots every 4d10h ±6h, and another one with the same symptoms but that happens randomly. Still working with support on this, and the first issue may have been understood (fix under test). The second issue is still a mystery at this point.


#35

May I ask Peplink and other experts in this forum, do the Soho and Balance routers have the ability to dump memory for analysis by Peplink? Microsoft Windows has that capability. It is certainly used to deal with challenging problems. My company’s software also does.

If a watchdog timer can reboot the router, then the watchdog timer should also be capable of dumping memory for analysis. In my opinion, it should be always be collecting a minimum amount of significant information (ours does). Call stacks for all threads (is there a thread deadlock?), status of ports, the status of buffers, significant recent activity (for example status changes in ports), etc.

Having developed memory dumping and “crash” dumps (the minimum amount of significant data) for our software, I know implementation can be tricky. This is because the dump software must be careful not to use any parts of the code which may not be reliable. For example, perhaps writing to the event log might not be working anymore, necessitating another approach such as saving the information somewhere to be finally written as part of the restart.

I assume having a memory dump triggered by the watchdog timer to be analyzed, full or abbreviated, would help solve this and other problems.