November 12, 2018 InControl 2 Incident Update - 11/14/2018
We want to provide all customers impacted by Monday’s issues an update on what went wrong. We will provide an additional update to provide you with a detail plan to resolve and prevent these issues from happening again.
What Issues Were Seen:
- Accounts on the mars server cluster were inaccessible to users
- IC2 mistakenly reconfigured VPN profiles it was not actively managing
- Some PepVPN/SpeedFusion tunnels were taken offline
What Action Has been Taken So Far:
- Additional resources were added to the mars server cluster to allow it to keep up with the increased demand
- The upgrade to version 2.8 was rolled back to the previous version 2.7.3
- Version 2.7.3 does not include the problematic features causing the above issues
- An audit of the upgrade failure has been started and preliminary root causes have been identified
Root Cause of PepVPN Issues:
A new feature of InControl 2.8 lets PepVPN profiles created locally on devices and in IC2 coexist. As a result, IC2 now needs to be aware of all profiles, including ones created locally on the device. There was a problem in the implementation of this hybrid profile mechanism that caused some routers to have existing locally configured PepVPN IDs to be updated to IC2s naming convention format.
In Monday’s implementation of this feature, IC2 performed two or three things:
- Retrieves the device’s latest configuration, reads the Local ID from the configuration file, and writes that value to the database.
- Reads the Local ID from the database, builds a configuration, and compares it with the device’s configuration.
- If they are different, IC2 will push the new configuration to the device.
A problem occurred - steps 1 and 2 got out of sync and configurations were built before the devices original configuration could be read, and the locally configured SiteID could not be written to the IC2 database. This caused several routers with only locally configured tunnels to incorrectly receive an IC2 generated VPN profile.
Root Cause of IC2 Account Unavailability:
The update to version 2.8 created an extreme memory load. This caused the IC2 cluster to go unusable until the system could restart and clear resources. Efforts to add resources to accommodate the load were not effective and the system ultimately had to be restored to the previous version 2.7.3.
- We will postpone the release of InControl 2.8
- Additional Earth instances will be created to allow customers to migrate to this more conservative upgrade schedule
- We will populate a beta planet with real-world deployments and configurations
- We will invite partners to create networks inside of this planet for better testing and feedback
- We will provide a second report to detail our strategy to prevent this from happening again
- We will communicate a clear rollout schedule to customers on all IC2 planets once 2.8 has passed our updated live testing