Unable to access remote web admin on 1400 devices and none show GPS


#1

I now have 1400 devices that I am unable to access remotely via the remote web admin.
None of the devices are showing a GPS location.
None of the devices are showing connected clients.


#2

I have lost connection to over 100 They show online but no control for the devices at all.


#3

I now shows the devices online but when we try to manage one of them it shows them as disconnected but still online


#4

The “mars” systems started to behave abnormally from 19:25 GMT+0. The problem has been identified and resolved at 23:20 GMT+0. Some reporting and GPS data during the period may be lost. We are sorry for any inconvenience caused.

Here is a detailed explanation of the incident. Some background information first. The InControl system is divided into multiple subsystems. An organization can reside on only one of them. Most customers’ organizations are resided on the subsystems “earth” and “mars”. You can find the where your organization resided on by looking at the URL’s host name.

The “mars” was upgraded to 2.4.0 on Monday 02:00 GMT+0. One of the changes in the back-end was to use a Redis memory cache for device communication. Live and reporting data are also buffered into it.

From 16:47 GMT+0, one of server processes abnormally stopped reading and processing raw reporting data from the Redis cache. A memory queue for buffering reporting data started to pile up. The cache’s free memory started to drop continuously. At 19:25, the cache’s memory finally consumed up. As the cache is used for live communication with devices, the mars system started to operate abnormally. The problem was escalated to the development team at 21:15. Until 23:20, the development team identified the cause and fixed the problem.

The followings preventive measures will be applied within 24 hours to avoid the same problem from happening:

  1. A monitor on all Redis cache queues will be added. So we could quickly identify if any queue grows abnormally.
  2. A health check on live data communication will be added. So we could identify for any live communication failures or delay.