[Incident Update] PepVPN / SpeedFusion and InControl (MARS) issues

Steve.Taylor · November 12, 2018, 3:14am

November 12, 2018 InControl 2 Incident Update - 11/14/2018

We want to provide all customers impacted by Monday’s issues an update on what went wrong. We will provide an additional update to provide you with a detail plan to resolve and prevent these issues from happening again.

What Issues Were Seen:

Accounts on the mars server cluster were inaccessible to users
IC2 mistakenly reconfigured VPN profiles it was not actively managing
Some PepVPN/SpeedFusion tunnels were taken offline

What Action Has been Taken So Far:

Additional resources were added to the mars server cluster to allow it to keep up with the increased demand
The upgrade to version 2.8 was rolled back to the previous version 2.7.3
- Version 2.7.3 does not include the problematic features causing the above issues
An audit of the upgrade failure has been started and preliminary root causes have been identified

Root Cause of PepVPN Issues:

A new feature of InControl 2.8 lets PepVPN profiles created locally on devices and in IC2 coexist. As a result, IC2 now needs to be aware of all profiles, including ones created locally on the device. There was a problem in the implementation of this hybrid profile mechanism that caused some routers to have existing locally configured PepVPN IDs to be updated to IC2s naming convention format.

In Monday’s implementation of this feature, IC2 performed two or three things:

Retrieves the device’s latest configuration, reads the Local ID from the configuration file, and writes that value to the database.
Reads the Local ID from the database, builds a configuration, and compares it with the device’s configuration.
If they are different, IC2 will push the new configuration to the device.

A problem occurred - steps 1 and 2 got out of sync and configurations were built before the devices original configuration could be read, and the locally configured SiteID could not be written to the IC2 database. This caused several routers with only locally configured tunnels to incorrectly receive an IC2 generated VPN profile.

Root Cause of IC2 Account Unavailability:

The update to version 2.8 created an extreme memory load. This caused the IC2 cluster to go unusable until the system could restart and clear resources. Efforts to add resources to accommodate the load were not effective and the system ultimately had to be restored to the previous version 2.7.3.

What’s Next:

We will postpone the release of InControl 2.8
Additional Earth instances will be created to allow customers to migrate to this more conservative upgrade schedule
We will populate a beta planet with real-world deployments and configurations
- We will invite partners to create networks inside of this planet for better testing and feedback
We will provide a second report to detail our strategy to prevent this from happening again
We will communicate a clear rollout schedule to customers on all IC2 planets once 2.8 has passed our updated live testing

Steve.Taylor · November 12, 2018, 3:55am

The engineers are still working on the issue - we will post here once we have an update.

Thanks,

Steve

Steve.Taylor · November 12, 2018, 4:46am

Brief update - the engineers are still working on this issue - we will keep you informed here.

Thanks,

Steve

jflanigan · November 12, 2018, 5:47am

We have down customers…do you have an ETA?

This is unacceptable. How can you roll out untested patches automatically to our customers without notifying us? Why is it that i had to do a search on your forums to find out that the problem is with you and not something we should have spent our own time troubleshooting.

Steve.Taylor · November 12, 2018, 5:54am

Hi @jflanigan,

Unfortunately I don’t have an ETA at the moment, but the engineers are working on this - they want to ensure that any fix doesn’t have any other implications, so please bear with us. There is a work-around posted above, which should bring your PepVPN / SpeedFusion tunnels back on-line, whilst we work on a permanent fix.

We are also working on the Performance issue, which is causing problems for users trying to login to InControl, or once logged in, being forced off. Again, we are working on this, but at the moment, we don’t have an ETA for the fix.

Thanks,

Steve

jflanigan · November 12, 2018, 5:58am

Local Device ID the Profile name of the PepVPN connection or the Router Name?

jflanigan · November 12, 2018, 6:03am

We are receiving this message in InControl as well.

Steve.Taylor · November 12, 2018, 6:09am

Hi @jflanigan,

Yes, that is the IC2 performance issue - you should be able to keep refreshing and it will let you in.

It is the Local Device ID for the Profile Name that needs to be checked / amended.

Thanks,

Steve

itg_PAM · November 12, 2018, 6:10am

I changed back the Local Device ID for PepVPN and all 3 units reverted back to default again. Would removing incontrol management until the issue is resolved help keep the IDs from being changed while you fix the issue?

sw00t · November 12, 2018, 6:11am

My monday in a nutshell:

Rob_White · November 12, 2018, 6:14am

We fix it and everything breaks again. Should we just break the in-control connection? We have nested tunnels and they keep breaking after we fix it!

itg_PAM · November 12, 2018, 6:16am

Same here Rob. I have disabled incontrol mgmt on 1 box so far to test.

sw00t · November 12, 2018, 6:26am

changing the remoteid in all of the actual pepvpn profiles is the way to go. If you change the name of local device id back to what it was, it seems to revert back.

Julien37 · November 12, 2018, 6:54am

Unacceptable, down twice in a day !!! I agree with jflanigan : How can you roll out untested patches automatically to our customers without notifying us?

JTfromIT · November 12, 2018, 6:58am

I agree with the others. This is absolutely unacceptable!

network_admins · November 12, 2018, 7:00am

Any chance you could post a screenshot of what to change and where? I only have two of these in use and haven’t had to touch them in over a year. I don’t want to make the situation worse.

Thanks.

JTfromIT · November 12, 2018, 7:03am

Go to your PEPVPN/Speedfusion settings and check the local ID. Make sure it matches the “remote ID” on the other one.

Walter · November 12, 2018, 7:05am

Will it make any difference to move the Organization to the Earth InControl Environment and what is the process to get this done?

Heriberto_Garcia · November 12, 2018, 7:15am

Hi Steve, disconnecting devices from IC2 is an option?

Not all customers can do this and we are aware (not for mobile units).

Your prompt response is appreciated

Heriberto