IoT Device Resilience During Connectivity Loss

The inevitability of network outages in a distributed system is captured by the CAP theorem, formulated by Eric Brewer in 2000 and proved by Gilbert and Lynch in 2002: from the trio of consistency, availability and partition tolerance, only two can be preserved simultaneously during a network partition. Because a partition is unavoidable, field IoT devices typically choose availability + partition tolerance and lean towards an eventually consistent data model.

Edge computing vs cloud-only

A wide spectrum of designs lies between the two extremes:

Cloud-only: the device handles only sensing and actuation; every decision happens in the cloud. Easy to build, but useless when the network is down.
Edge-first / fog computing: critical decisions are made on the device; the cloud is used for analytics and long-term optimisation. NIST SP 500-325 frames this as the fog reference architecture.
Hybrid / hierarchical: local control rules on the device, heavy model inference in the cloud, with a gateway acting as cache in between.

For applications that demand physical continuity — irrigation, industrial drives, energy management — an edge-first approach is in practice mandatory.

Store-and-forward telemetry

To prevent data loss when the link drops, the device writes telemetry to local persistent storage and forwards it in order once connectivity returns. Common data structures:

Ring (circular) buffer: fixed size; the oldest record is overwritten by the newest. The canonical choice in embedded systems.
Append-only log (write-ahead log, WAL): a record structure that preserves consistency even across power loss.
FIFO queue: preserves message order; combined with persistent-session support such as MQTT QoS 1/2.

Tagging each record with a monotonic timestamp and, ideally, an idempotency key prevents the double-processing problem on replay.

Time consistency: RTC and NTP

Timestamp accuracy is critical on an offline device. Hardware RTC (Real-Time Clock) chips (e.g. DS3231, MCP7940) deliver typical accuracies of ±2 ppm and ±5 ppm — corresponding to 1–3 minutes per year. A backup battery (CR2032 or supercapacitor) keeps the clock alive across power loss. When connectivity returns, drift is corrected by NTP (RFC 5905) or SNTP; in critical systems GNSS-based time synchronisation is used.

Watchdog and recovery

To recover from software failure (infinite loop, memory corruption) without an on-site visit, embedded systems use a watchdog timer. A hardware watchdog resets the MCU if it is not "kicked" periodically. Complementary techniques:

Brown-out detection: a clean reset when the supply voltage falls below a threshold.
Battery-backed RAM / FRAM: retention of state across resets.
Persistent state journaling: atomic writes for information such as which valve is currently open or the latest counter values.
Crash-safe filesystems: power-loss-tolerant filesystems such as LittleFS or SPIFFS.

Eventual consistency: the accepted pattern for reconciling decisions made offline with the cloud is last-write-wins or operational transform / CRDT. In embedded systems, the typical choice is a monotonic counter plus timestamp rather than vector clocks.