[Incident Update] PepVPN / SpeedFusion and InControl (MARS) issues

zegor_mjol · November 12, 2018, 5:13pm

Almost all our devices are up again through IC2. The exceptions are two Balance 380 HW5s running FW 6.3.4 build 3613 (the latest for that HW model). Their speedfusion connections did not re-establish themselves and remote administration is unavailable (“502 Bad Gateway”). We were able to revive one of them by the simple expedient of cycling the power, forcing a reboot - it rebooted to its full functionality.

Unfortunately, the other one is not on a networked power supply so the power cycle option is not available. It’ll be a while until somebody can get to it.

Any suggestions?

(Fortunately it is not a mission critical unit, but the fact that IC2 could effectively kill it like this without a revival path is troubling.)

sitloongs · November 12, 2018, 5:16pm

@zegor_mjol

Would you please PM me the device serial number (Remaining 1 device still having issue) ? We will check on it immediately

zegor_mjol · November 12, 2018, 5:43pm

Done.

I appreciate the fact that you pay such close attention to the well-being of even individual cases.

Z

sitloongs · November 12, 2018, 6:15pm

@zegor_mjol

We had checked the IC2 device logs and confirmed that the issue are not related to the known issue reported for this post. I will PM you the detail info for this.

micromarc · November 12, 2018, 6:26pm

I have a 305 with two PepVPN connections that still say Starting. What am I missing?

zegor_mjol · November 12, 2018, 6:36pm

The check and the reply is appreciated.

sitloongs · November 12, 2018, 6:39pm

@micromarc

Would you PM me the devices serial number ? We will check on this.

micromarc · November 12, 2018, 6:41pm

@sitloongs I just sent an email to the support address with the info. Ticket #788979.

sitloongs · November 12, 2018, 6:43pm

@micromarc

Thank you … appreciated that as this should be the proper way to work on the issue. Checking on this.

JamesPep · November 12, 2018, 7:16pm

Hi Marc,

I’ve checked your devices. Looks like the site ID for one of them wasn’t updated during our emergency scan & restore last night. It’s been reset to it’s former value and your tunnels are back up.

micromarc · November 12, 2018, 7:31pm

Great! Very much appreciated.

cyclops · November 13, 2018, 7:20am

I have a couple tunnels that didn’t return. Will send to support.

tmaVoIP · November 13, 2018, 1:39pm

Please have someone at Peplink to post a Summary of what exactly went wrong during this issue and some assurance as to why this will not happen again. We all understand that STUFF happens, but we need to explain to our customers WHY this happened and assure them it hopefully will not happen again. Peplink is Great! Let’s keep it that way.

Keith · November 13, 2018, 1:47pm

We are working on that summary. Please give us a day or two for collecting the accurate information. We will provide the transparency to our customers. It is one of our core beliefs. Thanks.

sandor · November 14, 2018, 1:07am

This issue has caused massive brand damage for us and despite being a long time Peplink partner, we currently have zero certainty that this type of issue will not occur again. The fact that Peplink’s own systems proactively altered the configuration of devices that were explicitly not being configuration controlled by InControl is a massive breach of trust. Coupled with that, there was no logging to show which exact parameters were changed, meaning we had to look for a needle in a haystack to bring our customers back online. We did manage to find it, before you posted your own details, but the customer damage was already done. It is clear to us there is no stress testing of updates before being deployed and we are now left having to ensure none of our devices communicate with InControl for the foreseeable future. I have no option but to recommend we stop deploying Peplink devices to customer sites.

Keith · November 14, 2018, 6:11pm

@sandor, your comments are well noted and I have read a couple times over. We knew this is an event that is unacceptable to the impacted customers. We knew the damage has been done. We knew it’s very difficult to regain the lost trust. The frustration is beyond what can be described in words.

But what can we do? We must learn from the mistakes, get them fixed, improve our systems, service our customers better in future and move on.

In a short while, we are going to post the first part of our incident report - for what has happened to the system and what we fixed on the day.

The second part (our team is still working on) will address improvements to the systems with certainty to prevent this type of issues from occurring again. By posting them publicly, we can receive peer reviews and scrutiny by the entire community.

Travis · November 14, 2018, 10:01pm

The original post has been updated to provide some initial findings. More information to come.

James_Webster · November 15, 2018, 10:39am

As part of our internal discussions regarding this incident we have a feature which has been suggested internally which would involve adding a feature in upcoming firmware allowing users to set a device’s incontrol access to read only.

I have created this as a feature request for customer and partners to discuss seperate from this post and would appreciate any feedback on this idea in this forum post:

https://goo.gl/qVymFC

Travis · November 16, 2018, 11:24am

We have posted a second update with our action plan in another post - [Incident Update] IC2/Mars - Actions and Improvements