[Incident Update] PepVPN / SpeedFusion and InControl (MARS) issues


#101

Almost all our devices are up again through IC2. The exceptions are two Balance 380 HW5s running FW 6.3.4 build 3613 (the latest for that HW model). Their speedfusion connections did not re-establish themselves and remote administration is unavailable (“502 Bad Gateway”). We were able to revive one of them by the simple expedient of cycling the power, forcing a reboot - it rebooted to its full functionality.

Unfortunately, the other one is not on a networked power supply so the power cycle option is not available. It’ll be a while until somebody can get to it.

Any suggestions?

(Fortunately it is not a mission critical unit, but the fact that IC2 could effectively kill it like this without a revival path is troubling.)


#102

@zegor_mjol

Would you please PM me the device serial number (Remaining 1 device still having issue) ? We will check on it immediately


#103

Done.

I appreciate the fact that you pay such close attention to the well-being of even individual cases.

Z


#104

@zegor_mjol

We had checked the IC2 device logs and confirmed that the issue are not related to the known issue reported for this post. I will PM you the detail info for this.


#105

I have a 305 with two PepVPN connections that still say Starting. What am I missing?


#106

The check and the reply is appreciated.


#107

@micromarc

Would you PM me the devices serial number ? We will check on this.


#109

@sitloongs I just sent an email to the support address with the info. Ticket #788979.


#110

@micromarc

Thank you … appreciated that as this should be the proper way to work on the issue. Checking on this.


#111

Hi Marc,

I’ve checked your devices. Looks like the site ID for one of them wasn’t updated during our emergency scan & restore last night. It’s been reset to it’s former value and your tunnels are back up.


#112

Great! Very much appreciated.


#113

#114

I have a couple tunnels that didn’t return. Will send to support.


#115

Please have someone at Peplink to post a Summary of what exactly went wrong during this issue and some assurance as to why this will not happen again. We all understand that STUFF happens, but we need to explain to our customers WHY this happened and assure them it hopefully will not happen again. Peplink is Great! Let’s keep it that way. :slight_smile:


#116

We are working on that summary. Please give us a day or two for collecting the accurate information. We will provide the transparency to our customers. It is one of our core beliefs. Thanks.


#117

This issue has caused massive brand damage for us and despite being a long time Peplink partner, we currently have zero certainty that this type of issue will not occur again. The fact that Peplink’s own systems proactively altered the configuration of devices that were explicitly not being configuration controlled by InControl is a massive breach of trust. Coupled with that, there was no logging to show which exact parameters were changed, meaning we had to look for a needle in a haystack to bring our customers back online. We did manage to find it, before you posted your own details, but the customer damage was already done. It is clear to us there is no stress testing of updates before being deployed and we are now left having to ensure none of our devices communicate with InControl for the foreseeable future. I have no option but to recommend we stop deploying Peplink devices to customer sites.


#118

@sandor, your comments are well noted and I have read a couple times over. We knew this is an event that is unacceptable to the impacted customers. We knew the damage has been done. We knew it’s very difficult to regain the lost trust. The frustration is beyond what can be described in words.

But what can we do? We must learn from the mistakes, get them fixed, improve our systems, service our customers better in future and move on.

In a short while, we are going to post the first part of our incident report - for what has happened to the system and what we fixed on the day.

The second part (our team is still working on) will address improvements to the systems with certainty to prevent this type of issues from occurring again. By posting them publicly, we can receive peer reviews and scrutiny by the entire community.


#119

The original post has been updated to provide some initial findings. More information to come.


#120

As part of our internal discussions regarding this incident we have a feature which has been suggested internally which would involve adding a feature in upcoming firmware allowing users to set a device’s incontrol access to read only.

I have created this as a feature request for customer and partners to discuss seperate from this post and would appreciate any feedback on this idea in this forum post:


#121

We have posted a second update with our action plan in another post - [Incident Update] IC2/Mars - Actions and Improvements