The issue resolved and resumed around 00:30 UTC 24/5/2023. For those affected devices, we can confirmed there are coming back online gradually and all devices have come back online at about 01:45 UTC 24/5/2023.
For any IC2 users still found their devices offline in IC2 and suspect the issue are related to the IC2 services , few free to open a support ticket for support team to check.
At approximately 00:04 UTC on 24 May 2023, a maintenance fix was deployed to all “planets”. The deployment of the fix is supposed to be transparent to our users and have no negative impact on the system. However, as the deployments were made on all planets at approximately the same time, they caused all devices on all planets to send their report data to the system at the same time. The volume of incoming report data overloaded the system’s device communication cluster. The cluster became unresponsive to devices. Devices began to initiate re-authentication with the system. However, the cluster was too busy to process all the device reauthentication requests in time. As a result, the system began to incorrectly treat devices as offline.
At 00:30, the system stopped requesting devices to send their report data. The cluster load started to decrease. The cluster started to process the re-authentication requests from the devices. Devices gradually and slowly came back online. At about 01:23, more resources were added to the cluster to speed up re-authentication. At around 01:45, all devices were back online.
To prevent the same problem from occurring again,
Fixes will not be applied to more than one planet at a time;
More spare resources have been added to the communications cluster so that it can cope with an increase in load.
If an abnormally high number of authentications are detected, the system will stop identifying devices as offline to avoid potential false alarms.
Date: 2023-10-11
Time:
Entire IC2 = since 08:23 UTC
Partial of Mars planet = since 07:35 UTC
Issue #1: The IC2 live queries and operations are not working. Issue #2: Users are reporting devices are randomly appearing offline and online.
Progress:
The InControl and Engineering teams are working on to resolve the issue now. Most of the planets are recovering, while partial of the Mars users are still affected.
Impact on Issue #1: The user might experience the RWA and Captive Portal service are affected. Impact on Issue #2: Devices are randomly appearing offline and online at the moment
Next update:
We will update the latest status to this post as soon as the issue is resolved.
Please set this forum post “Watching” to receive the notification.
We apologize for any inconveniences caused.
[Update #1] The Issue #1 has been resolved around 09:15 UTC.
Date: 2023-11-11
Time: 05:36 UTC
System: One of Mars subsystems called “mars3”.
Issues: Some device-reported data were not processed. Device and group status might be out of date.
Update: The issue was resolved on 2023-11-12 at 23:40 UTC. It was due to a database connection pool being exhausted. However, the issue was not identified promptly.
Issue avoidance: A monitor on database connection pool errors has been implemented. When the same error occurs, the pool will be reset automatically. Peplink engineers will be notified at the same time.
We apologize for any inconveniences caused.
Please set this forum post “Watching” to receive notifications.
Date: 2024-01-02
Time: 06:20 UTC
System: Entire InControl system.
Issue: A lot of devices have been falsely marked as offline.
Update:
A system component was generating a high CPU load to a memory database. The database was overloaded.
The component stopped generating the load at about 07:12. Devices started to appear online gradually since then. The system was totally recovered at 07:48 UTC.
Please set this forum post “Watching” to receive the notification.
Issue:
When users visit their organization, an error message “This organization requires users to enable two factor authentication.” even though the users have been two-factor authenticated during sign-in.
Impact:
For organizations that require their users to be two-factor authenticated, their users were unable to open the organization. Organizations with the option disabled were not affected.
Another extremely useful tool would be a secondary site like status.incontrol2.peplink.com where known issues and planned interruptions could be shared with users. This would help all those that aren’t part of the forums to know if/when there are any incidents and that Peplink is working on things keeping network managers from worrying and channel support teams from creating tickets for known issues.