[Incident Update] PepVPN / SpeedFusion and InControl (MARS) issues


#1

November 12, 2018 InControl 2 Incident Update - 11/14/2018

We want to provide all customers impacted by Monday’s issues an update on what went wrong. We will provide an additional update to provide you with a detail plan to resolve and prevent these issues from happening again.

What Issues Were Seen:

  • Accounts on the mars server cluster were inaccessible to users
  • IC2 mistakenly reconfigured VPN profiles it was not actively managing
  • Some PepVPN/SpeedFusion tunnels were taken offline

What Action Has been Taken So Far:

  • Additional resources were added to the mars server cluster to allow it to keep up with the increased demand
  • The upgrade to version 2.8 was rolled back to the previous version 2.7.3
    • Version 2.7.3 does not include the problematic features causing the above issues
  • An audit of the upgrade failure has been started and preliminary root causes have been identified

Root Cause of PepVPN Issues:

A new feature of InControl 2.8 lets PepVPN profiles created locally on devices and in IC2 coexist. As a result, IC2 now needs to be aware of all profiles, including ones created locally on the device. There was a problem in the implementation of this hybrid profile mechanism that caused some routers to have existing locally configured PepVPN IDs to be updated to IC2s naming convention format.

In Monday’s implementation of this feature, IC2 performed two or three things:

  1. Retrieves the device’s latest configuration, reads the Local ID from the configuration file, and writes that value to the database.
  2. Reads the Local ID from the database, builds a configuration, and compares it with the device’s configuration.
  3. If they are different, IC2 will push the new configuration to the device.

A problem occurred - steps 1 and 2 got out of sync and configurations were built before the devices original configuration could be read, and the locally configured SiteID could not be written to the IC2 database. This caused several routers with only locally configured tunnels to incorrectly receive an IC2 generated VPN profile.

Root Cause of IC2 Account Unavailability:

The update to version 2.8 created an extreme memory load. This caused the IC2 cluster to go unusable until the system could restart and clear resources. Efforts to add resources to accommodate the load were not effective and the system ultimately had to be restored to the previous version 2.7.3.

What’s Next:

  • We will postpone the release of InControl 2.8
  • Additional Earth instances will be created to allow customers to migrate to this more conservative upgrade schedule
  • We will populate a beta planet with real-world deployments and configurations
    • We will invite partners to create networks inside of this planet for better testing and feedback
  • We will provide a second report to detail our strategy to prevent this from happening again
  • We will communicate a clear rollout schedule to customers on all IC2 planets once 2.8 has passed our updated live testing

All tunnels dropped after incontrol applied config
All tunnels dropped after incontrol applied config
#2

The engineers are still working on the issue - we will post here once we have an update.

Thanks,

Steve


#3

Brief update - the engineers are still working on this issue - we will keep you informed here.

Thanks,

Steve


#4

We have down customers…do you have an ETA?

This is unacceptable. How can you roll out untested patches automatically to our customers without notifying us? Why is it that i had to do a search on your forums to find out that the problem is with you and not something we should have spent our own time troubleshooting.


#5

Hi @jflanigan,

Unfortunately I don’t have an ETA at the moment, but the engineers are working on this - they want to ensure that any fix doesn’t have any other implications, so please bear with us. There is a work-around posted above, which should bring your PepVPN / SpeedFusion tunnels back on-line, whilst we work on a permanent fix.

We are also working on the Performance issue, which is causing problems for users trying to login to InControl, or once logged in, being forced off. Again, we are working on this, but at the moment, we don’t have an ETA for the fix.

Thanks,

Steve


#6

Local Device ID the Profile name of the PepVPN connection or the Router Name?


#7

image

We are receiving this message in InControl as well.


#8

Hi @jflanigan,

Yes, that is the IC2 performance issue - you should be able to keep refreshing and it will let you in.

It is the Local Device ID for the Profile Name that needs to be checked / amended.

Thanks,

Steve


#9

I changed back the Local Device ID for PepVPN and all 3 units reverted back to default again. Would removing incontrol management until the issue is resolved help keep the IDs from being changed while you fix the issue?


#10

My monday in a nutshell:


#11

We fix it and everything breaks again. Should we just break the in-control connection? We have nested tunnels and they keep breaking after we fix it!


#12

Same here Rob. I have disabled incontrol mgmt on 1 box so far to test.


#13

changing the remoteid in all of the actual pepvpn profiles is the way to go. If you change the name of local device id back to what it was, it seems to revert back.


#14

#15

Unacceptable, down twice in a day !!! I agree with jflanigan : How can you roll out untested patches automatically to our customers without notifying us?


#16

I agree with the others. This is absolutely unacceptable!


#17

Any chance you could post a screenshot of what to change and where? I only have two of these in use and haven’t had to touch them in over a year. I don’t want to make the situation worse.

Thanks.


#18

Go to your PEPVPN/Speedfusion settings and check the local ID. Make sure it matches the “remote ID” on the other one.


#19

Will it make any difference to move the Organization to the Earth InControl Environment and what is the process to get this done?


#20

Hi Steve, disconnecting devices from IC2 is an option?

  • Not all customers can do this and we are aware (not for mobile units).

Your prompt response is appreciated

Heriberto