InControl issue tracking page

Michael · February 1, 2023, 12:32pm

When there are any major events or issues with InControl, a post will be created under this topic. Feel free to follow this topic by clicking the bell icon on the right-hand side and selecting “Watching”.

Michael · February 1, 2023, 12:47pm

On 2023-02-01 between 03:50 and 05:40 UTC, InControl was inaccessible to newly signed-in user sessions. A “Server Error” message might be returned. API calls might receive an error return code.

The issue occurred because a housekeeping process was not enabled. A database table for keeping track of user sessions had filled up with too many records. It caused session checks to time out and fail. After the table was cleaned up, the services resumed.

The housekeeping process has been enabled. The issue has been avoided. We apologize for any inconvenience.

Michael · February 4, 2023, 10:08am

On 2023-02-04 between 02:00 and 09:00 UTC, some devices have occasionally been mistreated as offline for a variable period of time. For devices that have been mistreated as offline for more than two hours, their client usage figures earlier than that will be unavailable.

Cause: There is an internal service for redirecting devices to their corresponding sub-systems. As a database connection pool setting for the service was not large enough, when devices report online to InControl, the service occasionally cannot redirect devices to the corresponding sub system. So the sub-systems mistreated those devices as offline.

A monitor for the database connection pool has been implemented. The same issue will be avoided in the future. We apologize for any inconenience.

The Peplink InControl team

Michael · February 7, 2023, 3:55pm

On 2023-02-07 between 06:40 and 09:09 UTC, a fraction of organizations that reside on the Mars system was experiencing slow response. Live status information and usage reports were not updated. Configuration changes could not be pushed to devices timely.

Cause: the database system serving the part of organizations has hit a software limit. The sub-system’s performance has been degraded significantly. After increasing the database system’s capacity and relocating the resources, the system’s performance gradually returned to normal.

sitloongs · February 27, 2023, 6:50am

We’re experiencing a service interruptions with IC2 services started at 02:13 UTC 27/2/2023. Our team is currently working to identify the issue and restore the service.

Impact : Devices are incorrectly showing offline in IC2 for all IC2 planets.

Next update : We will update the latest status to this post as soon as the issue is resolved.

Please set this forum post “Watching” to receive the notification.

We apologize for any inconvenience.

Michael · February 27, 2023, 8:05am

The previously mentioned issue has been resolved. All devices are back online now.

On 2023-02-27 between 02:13 and 07:40, the system incorrectly identified online devices as offline. All branches were affected. The issue was due to an unexpected memory database restart. It triggered an overwhelming number of devices to report online in a short period of time. The system was unable to serve all online requests within 3 minutes. Thus they were incorrectly treated as offline. After increasing the system capacity, the system started to catch up. All online requests were finally processed at 07:40.

We will investigate the root cause of the memory database restart. We will also allocate enough system capacity to handle the similar system outage. We are sorry for any inconvenience caused.

The Peplink InControl team.

sitloongs · March 27, 2023, 2:06am

We’re experiencing a service interruptions with Mars IC2 services.

AWS engineers are fixing a database issue and Mars IC2 is under maintenance now

Impact : Service interrupted for Mars and some user may experiencing access issue.

Next update : We will update the latest status to this post as soon as the issue is resolved.

Please set this forum post “Watching” to receive the notification.

We apologize for any inconvenience.

sitloongs · March 27, 2023, 2:57am

The issue resolved and services is resume as normal now.

For any Mars planet users still found issue accessing to their ORG/Group, few free to open a support ticket for support team to check.

https://ticket.peplink.com/ticket/new/public

Jonathan_Pitts · May 24, 2023, 12:47am

Major issue, please update us here.

Ben_Koehler_West_Net · May 24, 2023, 12:51am

Peplink Engineering has been notified and is actively working on the issue.

Keith · May 24, 2023, 1:05am

Just a quick update.

Peplink team has been working on the issue. IC2 is showing devices as down but the devices should operate as normal with the current config.

Issue has been identified and being fixed.

sitloongs · May 24, 2023, 1:06am

We’re experiencing a service interruptions with the IC2 services.

Date: 24/5/2023
Time : Start from 00:10 UTC

IC2 Cloud Engineering team is working to resolve the issue now.

Impact :
Device updates, reporting, online status ,captive portal and sim pool will be affected, as the device will shown offline in IC2.

Next update : We will update the latest status to this post as soon as the issue is resolved.

Please set this forum post “Watching” to receive the notification.

We apologize for any inconvenience.

Reynaldo_Galgao · May 24, 2023, 1:10am

Thank you for the update. please keep us posted.

sitloongs · May 24, 2023, 2:03am

The issue resolved and resumed around 00:30 UTC 24/5/2023. For those affected devices, we can confirmed there are coming back online gradually and all devices have come back online at about 01:45 UTC 24/5/2023.

For any IC2 users still found their devices offline in IC2 and suspect the issue are related to the IC2 services , few free to open a support ticket for support team to check.

https://ticket.peplink.com/ticket/new/public

Jonathan_Pitts · May 24, 2023, 2:24am

What was the root cause?

Michael · May 25, 2023, 7:46pm

At approximately 00:04 UTC on 24 May 2023, a maintenance fix was deployed to all “planets”. The deployment of the fix is supposed to be transparent to our users and have no negative impact on the system. However, as the deployments were made on all planets at approximately the same time, they caused all devices on all planets to send their report data to the system at the same time. The volume of incoming report data overloaded the system’s device communication cluster. The cluster became unresponsive to devices. Devices began to initiate re-authentication with the system. However, the cluster was too busy to process all the device reauthentication requests in time. As a result, the system began to incorrectly treat devices as offline.

At 00:30, the system stopped requesting devices to send their report data. The cluster load started to decrease. The cluster started to process the re-authentication requests from the devices. Devices gradually and slowly came back online. At about 01:23, more resources were added to the cluster to speed up re-authentication. At around 01:45, all devices were back online.

To prevent the same problem from occurring again,

Fixes will not be applied to more than one planet at a time;
More spare resources have been added to the communications cluster so that it can cope with an increase in load.
If an abnormally high number of authentications are detected, the system will stop identifying devices as offline to avoid potential false alarms.

sitloongs · August 22, 2023, 4:52am

Date: 2023-08-22
Time: since 01:50 UTC

Issue :
One of three Mars sub-systems hit a bug in Amazon Aurora service and encounter performance issues.

Progress:
The InControl Engineering team and the Amazon engineering team are working out a solution to resolve the issue now.

Impact :
Device updates, reporting, online status & configuration changes are experiencing slowness/delayed issues for some organizations on Mars.

Next update: We will update the latest status to this post as soon as the issue is resolved.

Please set this forum post “Watching” to receive the notification.

We apologize for any inconvenience.

Michael · August 22, 2023, 1:27pm

At 12:00 UTC on 2023-08-22, the performance issue with a Mars sub-system has been resolved. The system’s services have been completely restored.

We are sorry for any inconvenience.

WeiMing · September 25, 2023, 3:25am

Date: 2023-09-25
Time: since 01:30 UTC

Issue :
The IC2 messaging server are unstable at the moment, so user might notice intermittent online and offline alerts.

Progress:
The InControl and Engineering teams are working on the issue now to resolve it soonest possible.

Impact :
So far, we have receive users reported they receive device false offline email alerts and RWA is impacted.

Next update: We will update the latest status to this post as soon as the issue is resolved.

Please set this forum post “Watching” to receive the notification.

We apologize for any inconveniences caused.

[Update]
Issue resolved at 07:30 UTC

WeiMing · October 11, 2023, 9:09am

Date: 2023-10-11
Time:
Entire IC2 = since 08:23 UTC
Partial of Mars planet = since 07:35 UTC

Issue #1: The IC2 live queries and operations are not working.
Issue #2: Users are reporting devices are randomly appearing offline and online.

Progress:
The InControl and Engineering teams are working on to resolve the issue now. Most of the planets are recovering, while partial of the Mars users are still affected.

Impact on Issue #1: The user might experience the RWA and Captive Portal service are affected.
Impact on Issue #2: Devices are randomly appearing offline and online at the moment

Next update:
We will update the latest status to this post as soon as the issue is resolved.

Please set this forum post “Watching” to receive the notification.

We apologize for any inconveniences caused.

[Update #1] The Issue #1 has been resolved around 09:15 UTC.

[Update #2] Issue #2 is resolved at 10:50 UTC.