InControl Lost connection to devices

thoerr · April 11, 2017, 1:14pm

InControl has lost connection to all our 5 of our routers sometime between 14:18-14:24 CST. It stopped pulling bandwidth reports between these times. They still show connected but we cannot access any device via remote web admin. We also tried adding a Balance 380 but it never shows it coming online. PepVPN/Speed Fusion configurations are also not being pushed to the routers when created. Is there a problem with your InControl network?

ish_your1point.com · April 11, 2017, 1:36pm

SImilar issue.
Added a device over an hour ago and it still is not online in IC2. I am able to remote into the device via WiFi and it is connected to cellular. Everything else is working correctly.

thoerr · April 11, 2017, 7:51pm

This resolved itself. All is back to normal

Michael · April 11, 2017, 9:34pm

The “mars” systems started to behave abnormally from 19:25 GMT+0. The problem has been identified and resolved at 23:20 GMT+0. Some reporting and GPS data during the period may be lost. We are sorry for any inconvenience caused.

Here is a detailed explanation of the incident. Some background information first. The InControl system is divided into multiple subsystems. An organization can reside on only one of them. Most customers’ organizations are resided on the subsystems “earth” and “mars”. You can find the where your organization resided on by looking at the URL’s host name.

The “mars” was upgraded to 2.4.0 on Monday 02:00 GMT+0. One of the changes in the back-end was to use a Redis memory cache for device communication. Live and reporting data are also buffered into it.

From 16:47 GMT+0, one of server processes abnormally stopped reading and processing raw reporting data from the Redis cache. A memory queue for buffering reporting data started to pile up. The cache’s free memory started to drop continuously. At 19:25, the cache’s memory finally consumed up. As the cache is used for live communication with devices, the mars system started to operate abnormally. The problem was escalated to the development team at 21:15. Until 23:20, the development team identified the cause and fixed the problem.

The followings preventive measures will be applied within 24 hours to avoid the same problem from happening:

A monitor on all Redis cache queues will be added. So we could quickly identify if any queue grows abnormally.
A health check on live data communication will be added. So we could identify for any live communication failures or delay.