Chunked OTA Over LoRa on an ESP32-C3

ESP32-C3 sensor node on the bench — LoRa module, antenna, and battery pack wired up.

COVER·ESP32-C3 NODE · SX127x · 18650 PACK·BENCH, BEFORE IT WENT UNDER A BRIDGE

The Constraint

A sensor node sits under a bridge. A soil-moisture probe. A thermometer bolted to a turbine. Somewhere a firmware engineer is regretting a hardcoded threshold. The device runs on a coin cell, talks over LoRa, and is never going to see a USB cable again.

You still have to update it.

LoRa is not a friendly medium for this. At SF7 / 125 kHz, the MTU is effectively about 255 bytes; after framing and header overhead you’re handing the firmware about 200 usable bytes per transmission. There is no retransmit built in, no flow control, no guarantee anything you send arrives, and the European duty-cycle rule caps your airtime at 1% — which, once you do the arithmetic, is an interesting cap to design around.

We built a chunked OTA protocol for an ESP32-C3 node that accepts these constraints instead of pretending they don’t exist. It is maybe 700 lines of firmware. It survives a reboot halfway through an update. It is deliberately boring.

The Packet

Every chunk on the wire is a JSON header, a pipe character, and a binary payload:

FIG 01·CHUNK PACKET LAYOUT·HEADER + PIPE + BINARY FIRMWARE

Why JSON, for this? Because the alternative was a custom binary schema, and a custom binary schema is a thing you get wrong four times before it works. JSON is legible in a serial console, every value prints in the log, and the header overhead — about 40 bytes — is not the part of the packet we were trying to optimise. Firmware is big; this header is a fixed tax.

The payload is binary, straight from the firmware image. The LoRa library enables a hardware CRC on the frame itself, so we don’t add a per-chunk checksum. If the frame is corrupt, LoRa drops it and we time out, which is the same recovery path as if it had been lost in the air. Two failure modes collapsed into one is two fewer to test.

Flashing While Receiving

The ESP32-C3 has two OTA partitions, ota_0 and ota_1. The running image is on one, the incoming update is written to the other, and on Update.end() the bootloader is told to boot from the new one on the next restart. If the restart never happens, the old image stays active. This is standard; it’s also the reason we don’t have to invent anything clever about rollback.

What’s slightly less standard is that we stream bytes to flash as they arrive. Nothing is buffered in RAM beyond one packet. The chunk comes in, the header parses, the binary portion gets handed to Update.write(), and the flash controller takes it from there:

C++ · receive loop, sketched

// Called on every successfully-received LoRa chunk.
if (hdr.index != expectedChunk) {
    requestChunk(expectedChunk);     // ask for the one we wanted
    return;
}

Update.write(payload, payloadLen);   // straight to the spare partition
receivedSize    += payloadLen;
expectedChunk   += 1;

if (hdr.last || expectedChunk >= totalChunks) {
    if (Update.end(/*evenIfRemaining=*/true)) ESP.restart();
}

We don’t accumulate the whole image in RAM first. You cannot; this is a chip with ~400 kB of RAM and firmware images bigger than that. The staging happens in flash, at the offset the Update API maintains for us, and Update.end(true) is the one line that flips the boot partition.

Surviving a Reboot Mid-Update

The thing about a coin-cell-powered sensor node is that sometimes the coin cell droops. Sometimes a stray reset happens. Sometimes the watchdog trips at an inconvenient moment. None of these events should mean "start the update over from chunk zero over a 200-byte-per-packet radio link."

The fix is small and important. Every piece of state that matters to the update — otaInProgress, expectedChunk, receivedSize, totalSize, totalChunks — lives in RTC-retained memory, which survives deep sleep and, critically, also survives an unexpected reboot:

C++ · RTC-persisted state

RTC_DATA_ATTR bool     otaInProgress    = false;
RTC_DATA_ATTR uint16_t expectedChunk    = 0;
RTC_DATA_ATTR uint32_t receivedSize     = 0;
RTC_DATA_ATTR uint32_t totalSize        = 0;
RTC_DATA_ATTR uint16_t totalChunks      = 0;

void setup() {
    if (otaInProgress) {
        Update.begin(totalSize);       // re-opens the same partition
        requestChunk(expectedChunk);   // ask for the next one we needed
    }
}

When the node wakes — whether from a scheduled deep-sleep cycle or a crash — the first thing setup() does is ask: were we in the middle of something? If yes, re-open the OTA partition, resume from the chunk we hadn’t acknowledged, and carry on. The gateway was going to get a retransmit request anyway; it doesn’t care whether it came from a device that had timed out or one that had rebooted.

“An OTA that can’t survive a reset is an OTA that works in the lab and fails on the roof.”— lesson from a prior project, priced in flash writes

Stop-and-Wait, Unapologetically

We considered three recovery schemes and chose the simplest.

We did not implement forward error correction. FEC is powerful when your link has a known BER and your chunks are small enough that redundancy bytes are cheap; neither was true for us. We did not implement selective repeat, with its bitmap of missing chunks and partial retransmissions, because the book-keeping overhead on the gateway was going to outweigh the airtime savings for firmware of this size. What we implemented is stop-and-wait: the receiver acknowledges chunks implicitly by asking for the next one; if it doesn’t, the sender waits briefly and resends.

The specific behaviour: after sending chunk N, the node starts a 4-second timer. If chunk N+1 hasn’t arrived by then, the node emits a retransmit request for N+1. Ten retries per chunk, then the whole update aborts and the node sleeps. Worst case, a chunk that needs the full retry budget costs 40 seconds of airtime.

This is slow. A 200-kilobyte firmware is ~1,000 chunks; at one chunk per second, a clean transfer is on the order of 15-20 minutes. With retries and duty-cycle respect, budget an hour. If your operational model can’t absorb an hour, LoRa is not the right radio for your OTA.

Duty Cycle, and Not Pretending Otherwise

The EU 868 MHz band, in most sub-bands, caps a device at 1% duty cycle — averaged over an hour, you may transmit for no more than 36 seconds. In normal operation, this node is comfortably below that: it wakes every 300 seconds, sends one telemetry packet, and sleeps. Roughly 0.3% duty cycle.

During an update the arithmetic flips. If each chunk takes ~100 ms on air and you’re sending ten per minute, you’re approaching the cap within minutes. The naive implementation — blast chunks back to back — violates the rule inside of a couple of seconds.

Our pacing is the chunk-timeout doing double duty. A 4-second interval between packets, enforced by the state machine waiting for the previous chunk to be acknowledged, keeps on-air time at roughly 2.5% measured across an update and well below 1% averaged over the hour. It’s not an elegant solution; it’s a side-effect. It works.

If you were deploying this at scale you would add an explicit duty-cycle meter — LoRaWAN stacks track airtime directly — and a backoff that pauses rather than retries if the budget is spent. We did not. For a single-node deployment we were comfortable with the implicit margin.

Parameters

Six constants define the whole protocol. Change these and you’re tuning the trade between latency, retry budget, and airtime:

TBL 01 · Constants and their consequences
Name	Default	What it changes
`LORA_SPREADING_FACTOR`	7	SF7 = fastest, shortest range. SF12 = 5× range, 32× slower.
`LORA_BANDWIDTH`	125 kHz	Narrower bandwidth = better sensitivity, longer airtime.
`LORA_TX_POWER`	17 dBm	Watch your local regs before nudging this.
`OTA_CHUNK_TIMEOUT`	4 000 ms	Stop-and-wait interval. Also, implicit duty-cycle throttle.
`OTA_MAX_RETRIES`	10	Per-chunk retry budget before the whole update aborts.
`SLEEP_DURATION_SECONDS`	300	How often the node wakes to check for updates.

A useful rule we wrote on the whiteboard: if you’re tempted to reduce OTA_CHUNK_TIMEOUT to make updates faster, compute the resulting duty cycle first. Nine times out of ten you can’t actually go faster; you can only get closer to breaking the regulation.

What We’d Change

Three things we’d put in before the next deployment.

1. A cryptographic signature.

Right now the bootloader’s built-in header checks are all that stands between a receiver and a malicious firmware image. For a node on someone else’s roof, that isn’t enough. An Ed25519 signature over the image, verified against a public key baked into the partition table, costs a handful of kilobytes and rules out an entire category of attack. This is the first thing we’d add.¹

2. A proper watchdog dance during flash writes.

Flash writes on the ESP32-C3 can block for tens of milliseconds — occasionally longer if a sector erase is triggered. The task watchdog is on by default. A long write inside an interrupt context can trip it; we haven’t seen it in practice, but we’ve seen enough similar incidents on other chips to respect the pattern. Explicit esp_task_wdt_reset() calls on the write path are cheap insurance.

3. A manifest step before transfer.

Currently the node asks "got anything for me?" and the gateway says "yes, 200 kB, 1,000 chunks, go." A manifest — version, size, SHA-256, signature, rollout window — lets the node refuse an update it already has, verify integrity before reboot, and log exactly which image it’s running. Twelve bytes of header become a few hundred bytes of schema; in exchange you can actually run a fleet.

None of this is novel. All of it would have been more work than the thing is currently worth. That’s the whole story of a first OTA system: ship the embarrassingly simple one, watch it not catch fire, then add the features the second one needs.²

Footnotes

ESP-IDF has a secure-boot mode that does a lot of this for you, but we intentionally targeted the Arduino framework to keep the project approachable. Bolting signature verification on top of Update is a few dozen lines; the cost is more in key-management discipline than in code.
We have watched people try to ship "the second one" as their first system. They are always, without exception, still trying to ship it the following year.

← All notesHave a firmware problem like this? [email protected]