InControl issue tracking page

Date: 2024-10-09
Time: Since 06:20 UTC

Issue: Devices on IC2 are falsely appearing offline

The IC2 Cloud Engineering team are looking at the issue.

We apologize for any inconvenience caused.

2 Likes

Thanks for the update. I was panicking for a bit there, but I could also see tunnels up on routers/appliances marked as offline in IC2!

The issue has been identified and resolved. Devices have been coming back online gradually since 07:22 UTC. The process will take about 60 minutes additionally.

The root cause was that one of the memory cache servers was intermittently unresponsive. However, it didn’t trigger any monitoring alarms. We will ensure the same issue can be identified and fixed timely in the future.

9 Likes

Date: 2024-10-10
Time: Around 02:26 UTC

Issue: Devices on IC2 are falsely appearing offline

The IC2 Cloud Engineering team is looking at the issue.

Our apologies for any inconvenience caused.

3 Likes

[Update]
The issue is resolved and the devices are back online at 04:09 UTC.

2 Likes

InControl 2 Outage – August 14, 2025

  1. Incident Overview
  • Date & Time of Outage:
    Start: 2025-08-14 14:46 UTC
    End: 2025-08-14 19:00 UTC
    Duration: Approx. 4:14 hours
  • Affected Service:
    InControl 2 (Cloud-based device management platform)
  • Impact Summary:
    Online devices appeared offline. The website was occasionally unresponsive or returned errors. The captive portal service was unavailable. The API service was unavailable.

  1. Detection
  • How was the issue detected?
    The internal system monitoring system showed a significant number of devices went offline in a short period of time.
  • Time of Detection:
    2025-08-14 14:51 UTC
  • Initial Severity Assessment:
    Critical

  1. Timeline of Events
Time (UTC) Event Description
14:51 Issue detected
14:59 Engineering team engaged
15:40 Start to increase resources to cope with the device online and API request surge. Devices started to reconnect to IC2 gradually.
16:28 The API service restored.
17:04 Finish to increase the resources.
17:36 All devices reconnected to IC2.
17:42 Mars system: All devices appear online. All services restored completely.
19:00 Earth system: All devices appear online. All services restored completely.

  1. Root Cause
  • Primary Cause:
    The device communication server cluster was near its capacity.
  • Contributing Factors:
    Between 14:41 and 14:44, 15 times more than usual number of devices reported online in short period of time.
  • Result: The device online surge caused the cluster overload. Device connections started to drop. When devices attempted to reconnect, even higher load was generated on the cluster. Finally, the cluster collapsed and disconnected most of the devices.

  1. Resolution
  • Immediate Fix Applied:
    • Increased the device communication cluster’s capacity by 43%.
    • Doubled the API cluster’s capacity.

  1. Preventive Actions
  • Short-Term Mitigations:
    Keep the existing cluster’s capacity at the current level.
  • Long-Term Fixes:
    Leave more capacity headroom to cope with service load spikes.
    Reduce the system scale-up and scale-down time.
  • Timeline for Fixes:
    To be completed in Q3, 2025.

  1. Communication
  • Peplink will set up a status portal to report Peplink services’ current and the past status. To be completed in Q4, 2025.
6 Likes