November 12, 2018 InControl 2 Incident Update - 11/14/2018
We want to provide all customers impacted by Monday’s issues an update on what went wrong. We will provide an additional update to provide you with a detail plan to resolve and prevent these issues from happening again.
What Issues Were Seen:
Accounts on the mars server cluster were inaccessible to users
IC2 mistakenly reconfigured VPN profiles it was not actively managing
Some PepVPN/SpeedFusion tunnels were taken offline
What Action Has been Taken So Far:
Additional resources were added to the mars server cluster to allow it to keep up with the increased demand
The upgrade to version 2.8 was rolled back to the previous version 2.7.3
Version 2.7.3 does not include the problematic features causing the above issues
An audit of the upgrade failure has been started and preliminary root causes have been identified
Root Cause of PepVPN Issues:
A new feature of InControl 2.8 lets PepVPN profiles created locally on devices and in IC2 coexist. As a result, IC2 now needs to be aware of all profiles, including ones created locally on the device. There was a problem in the implementation of this hybrid profile mechanism that caused some routers to have existing locally configured PepVPN IDs to be updated to IC2s naming convention format.
In Monday’s implementation of this feature, IC2 performed two or three things:
Retrieves the device’s latest configuration, reads the Local ID from the configuration file, and writes that value to the database.
Reads the Local ID from the database, builds a configuration, and compares it with the device’s configuration.
If they are different, IC2 will push the new configuration to the device.
A problem occurred - steps 1 and 2 got out of sync and configurations were built before the devices original configuration could be read, and the locally configured SiteID could not be written to the IC2 database. This caused several routers with only locally configured tunnels to incorrectly receive an IC2 generated VPN profile.
Root Cause of IC2 Account Unavailability:
The update to version 2.8 created an extreme memory load. This caused the IC2 cluster to go unusable until the system could restart and clear resources. Efforts to add resources to accommodate the load were not effective and the system ultimately had to be restored to the previous version 2.7.3.
What’s Next:
We will postpone the release of InControl 2.8
Additional Earth instances will be created to allow customers to migrate to this more conservative upgrade schedule
We will populate a beta planet with real-world deployments and configurations
We will invite partners to create networks inside of this planet for better testing and feedback
We will provide a second report to detail our strategy to prevent this from happening again
We will communicate a clear rollout schedule to customers on all IC2 planets once 2.8 has passed our updated live testing
This is unacceptable. How can you roll out untested patches automatically to our customers without notifying us? Why is it that i had to do a search on your forums to find out that the problem is with you and not something we should have spent our own time troubleshooting.
Unfortunately I don’t have an ETA at the moment, but the engineers are working on this - they want to ensure that any fix doesn’t have any other implications, so please bear with us. There is a work-around posted above, which should bring your PepVPN / SpeedFusion tunnels back on-line, whilst we work on a permanent fix.
We are also working on the Performance issue, which is causing problems for users trying to login to InControl, or once logged in, being forced off. Again, we are working on this, but at the moment, we don’t have an ETA for the fix.
I changed back the Local Device ID for PepVPN and all 3 units reverted back to default again. Would removing incontrol management until the issue is resolved help keep the IDs from being changed while you fix the issue?
changing the remoteid in all of the actual pepvpn profiles is the way to go. If you change the name of local device id back to what it was, it seems to revert back.
Unacceptable, down twice in a day !!! I agree with jflanigan : How can you roll out untested patches automatically to our customers without notifying us?
Any chance you could post a screenshot of what to change and where? I only have two of these in use and haven’t had to touch them in over a year. I don’t want to make the situation worse.