The issue has been identified and resolved. Devices have been coming back online gradually since 07:22 UTC. The process will take about 60 minutes additionally.
The root cause was that one of the memory cache servers was intermittently unresponsive. However, it didn’t trigger any monitoring alarms. We will ensure the same issue can be identified and fixed timely in the future.
Impact Summary:
Online devices appeared offline. The website was occasionally unresponsive or returned errors. The captive portal service was unavailable. The API service was unavailable.
Detection
How was the issue detected?
The internal system monitoring system showed a significant number of devices went offline in a short period of time.
Time of Detection:
2025-08-14 14:51 UTC
Initial Severity Assessment:
Critical
Timeline of Events
Time (UTC)
Event Description
14:51
Issue detected
14:59
Engineering team engaged
15:40
Start to increase resources to cope with the device online and API request surge. Devices started to reconnect to IC2 gradually.
16:28
The API service restored.
17:04
Finish to increase the resources.
17:36
All devices reconnected to IC2.
17:42
Mars system: All devices appear online. All services restored completely.
19:00
Earth system: All devices appear online. All services restored completely.
Root Cause
Primary Cause:
The device communication server cluster was near its capacity.
Contributing Factors:
Between 14:41 and 14:44, 15 times more than usual number of devices reported online in short period of time.
Result: The device online surge caused the cluster overload. Device connections started to drop. When devices attempted to reconnect, even higher load was generated on the cluster. Finally, the cluster collapsed and disconnected most of the devices.
Resolution
Immediate Fix Applied:
Increased the device communication cluster’s capacity by 43%.
Doubled the API cluster’s capacity.
Preventive Actions
Short-Term Mitigations:
Keep the existing cluster’s capacity at the current level.
Long-Term Fixes:
Leave more capacity headroom to cope with service load spikes.
Reduce the system scale-up and scale-down time.
Timeline for Fixes:
To be completed in Q3, 2025.
Communication
Peplink will set up a status portal to report Peplink services’ current and the past status. To be completed in Q4, 2025.