Error Handling & Rollback Mechanisms

In production-grade migration automation, failure is not an edge case — it is an expected operational state. When orchestrating Automated Data Migration from Relational & JSON Sources, the distinction between a resilient pipeline and a corrupted graph topology hinges on how transaction boundaries, retry semantics, and compensating logic are engineered at the driver level. Neo4j’s ACID guarantees provide a robust foundation, but platform teams must layer explicit error isolation, idempotent execution, and state-aware rollback on top of them to preserve integrity during high-throughput loads. This page defines the decision rules, implementation patterns, and failure-mode catalogue that make a migration recoverable by construction rather than by manual reconciliation.

Prerequisite Concepts

Error handling is the last line of defence, not the first. It presumes that upstream stages have already produced clean, deterministic work. Before implementing the patterns below, the reader should be comfortable with:

The overall pipeline contract. Rollback logic only makes sense inside the end-to-end flow defined by the parent Automated Data Migration from Relational & JSON Sources reference — extract, map, validate, chunk, load, verify.
Chunk lifecycle and commit boundaries. Recovery points align with transaction boundaries, so you must first understand how work is partitioned in Batch Processing & Chunking Workflows. A rollback strategy is only as granular as its chunking.
Idempotent writes. Driver-managed retries re-execute the whole transaction function, so every write must be an idempotent MERGE backed by a uniqueness constraint. Retry safety is a property of the Cypher, not the try/except.
A stable target schema. Compensating logic assumes the labels, relationship types, and constraints it cleans up are fixed. If the model is still moving, stabilise it first against Neo4j Graph Schema Design & Architecture.

Conceptual Model: Failure Classification & Recovery Paths

Every unit of work exits through exactly one of three doors: commit, retry, or quarantine. The driver classifies the underlying exception; the pipeline decides what to do with the classification. Transient faults loop back through backoff; permanent faults are serialized to a durable dead-letter queue with enough context to reconstruct the failure; only a clean result commits its side effects.

The diagram below traces how a chunk moves through retry, backoff, and quarantine paths.

Each chunk carries a state in a durable ledger, and only a committed chunk advances the external watermark. The state diagram below shows the three transitions out of IN_FLIGHT and how a resumed run replays the watermark to skip work that already landed.

Design Rules: Classifying & Routing Failures

Resilient pipelines encode a fixed policy for each failure category rather than reacting ad hoc. The matrix below is the decision table every handler in the pipeline should implement identically.

Failure category	Representative exceptions	Retryable?	Routing action
Transient infrastructure	`ServiceUnavailable`, `SessionExpired`	Yes (driver-managed)	Exponential backoff + jitter, then retry the transaction function
Lock contention	`TransientError` (`LockAcquisitionTimeout`, deadlock detected)	Yes (bounded)	Retry with reduced concurrency / smaller chunk
Constraint violation	`ConstraintValidationFailed`	No	Quarantine the offending records; never retry blindly
Malformed / contract failure	schema-validation errors caught pre-flight	No	Reject at the ingestion layer, before any transaction opens
Client / query error	`CypherSyntaxError`, `ClientError`	No	Fail fast — this is a code defect, not a data defect

Four rules follow directly from this table:

Isolate side effects. Publishing to a dead-letter queue, incrementing a Prometheus counter, or advancing a checkpoint ledger must happen outside the transaction function and only after the driver returns a successful result. A side effect committed inside a retried transaction fires multiple times.
Never retry a constraint violation. A ConstraintValidationFailed is deterministic — the same records will fail again. Retrying wastes lock time and masks a data-quality problem that belongs in quarantine.
Quarantine before rollback. Malformed records should be routed to a durable queue with an explicit error code, not allowed to abort a database-level transaction. Reserve rollback for genuine partial-write recovery.
Bound every retry. Unbounded retries against a saturated database amplify the outage. Cap attempts, then dead-letter.

Step-by-Step Implementation

Step 1 — Transaction boundaries and driver-level retry

The Neo4j Python driver (neo4j 5.x) enforces transactional safety through managed execution blocks: session.execute_write() and session.execute_read(). These abstractions classify exceptions into transient, client, or database-level categories and route recoverable failures through an internal exponential-backoff loop. When a transient error triggers a retry, the entire transaction function re-executes from scratch — which is exactly why the write must be idempotent and why side effects must live outside it.

python

import structlog
from neo4j import GraphDatabase
from neo4j.exceptions import TransientError, ClientError, ServiceUnavailable
from prometheus_client import Counter

logger = structlog.get_logger()
RETRY_METRIC = Counter("neo4j_migration_retries_total", "Total driver-level retries")

def load_batch(tx, chunk: list[dict], entity_type: str):
    # The transaction function may run MORE THAN ONCE on transient retry.
    # It must therefore be pure and idempotent: parameterised, MERGE-based,
    # and free of any external side effects.
    #
    # str.format injects only the (validated) label name; literal Cypher map
    # braces are escaped as {{ }} so they survive .format() untouched.
    query = """
    UNWIND $records AS r
    MERGE (n:`{entity}` {{id: r.id}})
    ON CREATE SET n += r.properties
    RETURN count(n) AS upserted
    """.format(entity=entity_type)

    return tx.run(query, records=chunk).single()["upserted"]

def execute_with_isolation(session, chunk, entity_type):
    try:
        upserted = session.execute_write(load_batch, chunk, entity_type)
        # Side effects commit ONLY after the driver returns a committed result.
        logger.info("batch_committed", entity=entity_type, upserted=upserted)
        return upserted
    except ServiceUnavailable as exc:
        # Driver already exhausted its managed retries — infrastructure is down.
        RETRY_METRIC.inc()
        logger.warning("transient_exhausted", error=str(exc))
        raise
    except ClientError as exc:
        # Non-retryable: syntax error, constraint violation, auth failure.
        logger.error("client_error", error=str(exc), code=exc.code,
                     payload_hash=hash(str(chunk)))
        raise

The entity_type label must come from a bounded allow-list, never straight from source data — dynamic labels fragment planner statistics and are a documented property graph anti-pattern. Confine variance to a validated node label taxonomy. For the authoritative lifecycle rules, consult the Neo4j Python Manual: Transactions.

Step 2 — Custom retry with backoff and jitter

The driver’s managed retry covers transient infrastructure faults, but application-level policy (lock contention that warrants a smaller chunk, dead-lettering after N attempts) needs an explicit wrapper. Full jitter prevents the thundering-herd effect where every parallel worker retries in lockstep.

python

import random
import time
from neo4j.exceptions import TransientError

def run_chunk_with_policy(session, chunk, entity_type, *, max_attempts=5, base=0.5, cap=30.0):
    for attempt in range(1, max_attempts + 1):
        try:
            return execute_with_isolation(session, chunk, entity_type)
        except TransientError as exc:
            if attempt == max_attempts:
                # Bounded retries exhausted — hand off to compensating logic.
                quarantine(chunk, entity_type, reason="transient_exhausted", detail=str(exc))
                raise
            # Full jitter: sleep in [0, min(cap, base * 2**attempt)).
            delay = random.uniform(0, min(cap, base * (2 ** attempt)))
            RETRY_METRIC.inc()
            logger.warning("chunk_retry", attempt=attempt, delay=round(delay, 3))
            time.sleep(delay)

Step 3 — Compensating logic and quarantine

When a chunk fails permanently, deterministic recovery means: freeze forward progress, serialize the failed payload with reconstructable context, and record the offset so a rerun resumes cleanly. Because each write is idempotent, a partially applied chunk needs no destructive “undo” — replaying it converges to the same graph state. That is the practical substitute for a distributed rollback.

python

import json, hashlib
from datetime import datetime, timezone

def quarantine(chunk, entity_type, *, reason, detail):
    # Durable, reconstructable failure record — never a swallowed exception.
    record = {
        "entity_type": entity_type,
        "reason": reason,
        "detail": detail,
        "offset_ids": [r["id"] for r in chunk],
        "payload_sha256": hashlib.sha256(
            json.dumps(chunk, sort_keys=True, default=str).encode()
        ).hexdigest(),
        "quarantined_at": datetime.now(timezone.utc).isoformat(),
    }
    dead_letter_queue.publish(record)          # durable side effect (outside any tx)
    QUARANTINE_DEPTH.inc()
    logger.error("chunk_quarantined", **record)

Constraint-first design and the compensating-transaction strategy behind this are covered in depth in Implementing idempotent migration scripts for Neo4j.

Step 4 — State-aware checkpoint rollback

High-throughput initial loads must not be monolithic. Partition the dataset into fixed windows (typically 5,000–10,000 records per chunk) and track progress in an external checkpoint ledger so a failed run resumes from the last committed watermark instead of the beginning. When a chunk fails, the pipeline should:

Halt subsequent chunk dispatch (fail closed, not open).
Serialize the failed payload to the durable dead-letter queue.
Emit a migration_chunk_failed metric with offset metadata.
Resume from the last successful checkpoint once compensating logic has cleared any partial state.

python

def resume_from_checkpoint(driver, chunks, entity_type, ledger):
    watermark = ledger.last_committed_offset(entity_type)  # e.g. 42
    with driver.session() as session:
        for offset, chunk in enumerate(chunks):
            if offset <= watermark:
                continue                                   # already committed — skip
            run_chunk_with_policy(session, chunk, entity_type)
            ledger.commit_offset(entity_type, offset)      # side effect AFTER success

This checkpoint model is what lets platform teams validate graph topology against the source-of-truth dataset before flipping traffic during a legacy cutover, eliminating dual-write divergence. See the Neo4j Operations Manual: Backup & Restore for snapshot automation that anchors those recovery points.

Constraint & Validation Layer

Retries are only safe when the underlying Cypher is genuinely idempotent, and idempotency depends on schema-level uniqueness. Without a uniqueness constraint, a MERGE under concurrent load can create duplicate nodes because two transactions each find no match and both insert. Create the backing constraint first:

cypher

// Neo4j 5.x — the constraint that makes MERGE-on-id idempotent under retry.
CREATE CONSTRAINT customer_id_unique IF NOT EXISTS
FOR (c:Customer) REQUIRE c.id IS UNIQUE;

With the constraint in place, a retried MERGE (:Customer {id: $id}) either matches the existing node or fails with a ConstraintValidationFailed that the handler routes to quarantine — never a silent duplicate. Two further validation gates keep failures out of the transaction path entirely:

Structural mapping failures. Most transformation errors are structural, not network-related: mismatched foreign keys, null-constraint violations, or type-coercion errors. Catch these where the mapping is defined — Relational Schema Mapping Strategies — and route malformed rows to quarantine at the ingestion layer instead of aborting a database transaction.
Document-shape failures. Hierarchical omissions and array-to-relationship coercion errors surface during JSON Document Flattening & Graph Conversion. A pre-flight gate that enforces JSON Schema compliance, checks relationship cardinality, and verifies property type casting drastically reduces runtime ConstraintValidationFailed exceptions.

Both gates belong to the broader contract-enforcement discipline detailed in Data Validation & Integrity Checks; error handling and validation are two halves of the same resilience budget.

Performance & Scale Considerations

Error-handling design directly shapes throughput, because every retry, lock wait, and rollback consumes capacity the happy path also needs.

Chunk size versus retry cost. Larger chunks amortise round-trip overhead but raise the cost of a single failure — a 50,000-record transaction that aborts on record 49,999 wastes the whole window and holds locks longer. Keep chunks in the 5,000–10,000 range so a retry re-executes a bounded unit. This trades directly against the tuning covered in Initial Load Performance Tuning.
Lock contention scales with concurrency, not data volume. Parallel writers touching overlapping nodes (shared reference data, dense hub nodes) drive LockAcquisitionTimeout. When telemetry shows sustained deadlock or lock-timeout rates, reduce write concurrency or partition chunks by key range so workers touch disjoint regions.
Constraint checks add write-time cost but save reconciliation cost. A uniqueness constraint validates on every MERGE, adding index lookups; the alternative — post-load dedup scans across the whole graph — is far more expensive and non-deterministic. Pay the write-time cost.
Dead-letter depth is a leading indicator. A growing neo4j_migration_quarantine_queue_depth predicts a systemic data-quality or schema-drift problem long before record counts visibly diverge. Alert on its slope, not just its absolute value.

Observability that makes rollback decisions

Resilience is invisible without observability. Embed structured logging at every transaction boundary — payload hashes, chunk offsets, constraint violations, retry counts — and track these signals on a migration dashboard:

neo4j_migration_records_processed_total
neo4j_migration_constraint_violations_total
neo4j_migration_retry_latency_seconds
neo4j_migration_quarantine_queue_depth

A sustained spike in LockAcquisitionTimeout or deadlock-related errors is the signal to dynamically reduce chunk size, widen jitter, or switch to asynchronous write batching — a feedback loop that protects integrity without stalling the load.

Known Pitfalls

Side effects inside the transaction function. Symptom: duplicate dead-letter records, inflated retry metrics, or a checkpoint that advances even though the transaction rolled back. Root cause: the side effect ran on every retry attempt because it lives inside the function passed to execute_write. Fix: move all publishing, metric emission, and ledger commits to after the managed call returns, as in Step 1.
Retrying a deterministic constraint violation. Symptom: a chunk burns all its attempts and holds locks, yet fails identically each time. Root cause: ConstraintValidationFailed was caught by a generic transient handler. Fix: branch on exception type — route ClientError/constraint failures straight to quarantine, and reserve backoff for TransientError/ServiceUnavailable.
Swallowed errors and non-idempotent replay. Symptom: record counts drift between source and graph; duplicates appear after a restart. Root cause: exceptions were caught-and-continued, hiding partial writes, while non-idempotent CREATE statements duplicated data on replay. Fix: dead-letter every failure, make every write a constraint-backed MERGE, and gate reruns on committed watermarks. Where the model itself changes over time, coordinate through disciplined schema evolution and versioning so a replayed migration never fights an out-of-band schema change.
Oversized transactions masquerading as durability. Symptom: OutOfMemoryError, long GC pauses, or bloated transaction logs during large loads. Root cause: an attempt to make the load “atomic” by committing an entire table in one transaction, which also makes any failure catastrophic to recover. Fix: chunk to bounded windows and, for server-side batching, use CALL { ... } IN TRANSACTIONS; recoverability comes from small committed units plus checkpoints, not from one giant transaction.

Conclusion

Error handling in graph migration pipelines is not about preventing failures — it is about containing them, recovering deterministically, and preserving auditability. By enforcing strict transaction boundaries, designing idempotent Cypher, quarantining malformed data before it reaches a transaction, and resuming from state-aware checkpoints, platform teams execute high-volume loads with confidence. Combined with automated backups, telemetry-driven tuning, and rigorous cutover validation, these mechanisms turn migration from a high-risk event into a repeatable, observable engineering workflow.

Automated Data Migration from Relational & JSON Sources — the parent reference this resilience layer plugs into.
Batch Processing & Chunking Workflows — the transaction boundaries every rollback aligns to.
Data Validation & Integrity Checks — contract enforcement that keeps failures out of the transaction path.
Initial Load Performance Tuning — chunk-size and concurrency trade-offs that govern retry cost.
Implementing idempotent migration scripts for Neo4j — the constraint-first replay pattern in full.

Error Handling & Rollback Mechanisms

Explore this section