InControl issue tracking page

sitloongs · October 9, 2024, 6:56am

Date: 2024-10-09
Time: Since 06:20 UTC

Issue: Devices on IC2 are falsely appearing offline

The IC2 Cloud Engineering team are looking at the issue.

We apologize for any inconvenience caused.

David_Jones · October 9, 2024, 7:00am

Thanks for the update. I was panicking for a bit there, but I could also see tunnels up on routers/appliances marked as offline in IC2!

Michael · October 9, 2024, 8:03am

The issue has been identified and resolved. Devices have been coming back online gradually since 07:22 UTC. The process will take about 60 minutes additionally.

The root cause was that one of the memory cache servers was intermittently unresponsive. However, it didn’t trigger any monitoring alarms. We will ensure the same issue can be identified and fixed timely in the future.

WeiMing · October 10, 2024, 2:58am

Date: 2024-10-10
Time: Around 02:26 UTC

Issue: Devices on IC2 are falsely appearing offline

The IC2 Cloud Engineering team is looking at the issue.

Our apologies for any inconvenience caused.

WeiMing · October 10, 2024, 5:38am

[Update]
The issue is resolved and the devices are back online at 04:09 UTC.

Michael · August 15, 2025, 12:47pm

InControl 2 Outage – August 14, 2025

Incident Overview

Date & Time of Outage:
Start: 2025-08-14 14:46 UTC
End: 2025-08-14 19:00 UTC
Duration: Approx. 4:14 hours
Affected Service:
InControl 2 (Cloud-based device management platform)
Impact Summary:
Online devices appeared offline. The website was occasionally unresponsive or returned errors. The captive portal service was unavailable. The API service was unavailable.

Detection

How was the issue detected?
The internal system monitoring system showed a significant number of devices went offline in a short period of time.
Time of Detection:
2025-08-14 14:51 UTC
Initial Severity Assessment:
Critical

Timeline of Events

Time (UTC)	Event Description
14:51	Issue detected
14:59	Engineering team engaged
15:40	Start to increase resources to cope with the device online and API request surge. Devices started to reconnect to IC2 gradually.
16:28	The API service restored.
17:04	Finish to increase the resources.
17:36	All devices reconnected to IC2.
17:42	Mars system: All devices appear online. All services restored completely.
19:00	Earth system: All devices appear online. All services restored completely.

Root Cause

Primary Cause:
The device communication server cluster was near its capacity.
Contributing Factors:
Between 14:41 and 14:44, 15 times more than usual number of devices reported online in short period of time.
Result: The device online surge caused the cluster overload. Device connections started to drop. When devices attempted to reconnect, even higher load was generated on the cluster. Finally, the cluster collapsed and disconnected most of the devices.

Resolution

Immediate Fix Applied:
- Increased the device communication cluster’s capacity by 43%.
- Doubled the API cluster’s capacity.

Preventive Actions

Short-Term Mitigations:
Keep the existing cluster’s capacity at the current level.
Long-Term Fixes:
Leave more capacity headroom to cope with service load spikes.
Reduce the system scale-up and scale-down time.
Timeline for Fixes:
To be completed in Q3, 2025.

Communication

Peplink will set up a status portal to report Peplink services’ current and the past status. To be completed in Q4, 2025.