Back to overview

ExtendlyOS Service Disruption on 1/3/2026

Jan 03 at 01:38pm EST
Affected services
ExtendlyOS Data Graph
ExtendlyOS Support AI v2
ExtendlyOS Features Service
GHL Sync Orchestrator Service
GHL Sync Worker Service
KB Indexing Orchestrator Service
KB Indexing Worker Service
News & Announcements Orchestrator Service
News & Announcements Worker Service

Resolved
Jan 03 at 01:38pm EST

Incident Post-Mortem: Service Outage (2026-01-03)

Summary

Incident ID: EXT-2026-01-03-01

Date / Time (UTC): 2026-01-03 05:45–07:25

Duration: ~100 minutes

Severity: High

Services Affected: ExtendlyOS Backend Services, Background Workers

At approximately 12:45 AM EST, multiple ExtendlyOS backend services became unavailable. Monitoring confirmed service outages due to failures in the production infrastructure control plane and database layer.


Impact

  • Production operations relying on ExtendlyOS backend services were affected.
  • API requests failed or timed out.
  • Background jobs did not execute.
  • Full production outage during the incident window.

Timeline (UTC)

Time Event
00:45 First signal detected: Background worker services missed heartbeat.
00:52 Proxy services began returning errors.
01:08 Multiple services unable to connect to server.
01:10 Widespread background service failures.
01:59 Full connectivity loss for API and support services.
02:36 Intermittent connectivity during partial recovery.
03:30 System stabilized; full service restored.

Root Cause

A routine operating system update on infrastructure nodes introduced a version incompatibility with the container orchestration platform. This caused a loss of coordination between services. Consequently, the database cluster experienced replication issues requiring a rebuild of specific nodes.


Resolution

  • Upgraded infrastructure proxy and orchestration agents to compatible versions.
  • Repaired the database cluster by rebuilding affected replicas.
  • Verified data integrity and service connectivity.

Corrective & Preventative Measures

  • Version Locking: Implemented stricter version pinning for all infrastructure components to prevent implicit upgrades.
  • Enhanced Monitoring: Added specific alerts for database replication lag to detect similar issues earlier.
  • Process Improvements: Updated change management procedures to include stricter compatibility checks for infrastructure updates.