Data Validation & Integrity Checks

Production-grade graph migrations must guarantee that what lands in Neo4j is exactly what the source promised — no orphaned edges, no silently coerced property types, no duplicate business keys. This is harder than in row-based ETL because graph topology adds referential surface area: a single missing constraint can let a whole subgraph of dangling relationships accumulate before anyone notices. The engineering task this guide addresses is where to place validation so that corruption is caught at the earliest possible boundary, and how to make each check deterministic, cheap, and safe to rerun. Within an automated migration pipeline, validation is not a post-load reconciliation afterthought bolted onto the end — it is embedded into three gates: before write sessions open, inside each chunk’s transaction, and across the whole graph after cutover.

Prerequisite concepts

Before implementing the checks below, the reader should already have these in place:

The parent architecture from Automated Data Migration from Relational & JSON Sources — validation gates attach to its extract → map → load stages.
A settled source-to-graph mapping from Relational Schema Mapping Strategies, so you know which foreign keys must become typed relationships and which columns become properties.
The flattening contract from JSON Document Flattening & Graph Conversion for any semi-structured payloads, since nested-array boundaries are a primary source of integrity defects.
Neo4j 5.x with the neo4j Python driver v5+, and the target constraints already declared (uniqueness, node key, relationship key).

Validation also assumes a disciplined schema underneath it. Checks are only as trustworthy as the node label taxonomy and relationship cardinality and direction they enforce; if labels are dynamic or edges bidirectionally duplicated, a “passing” count check can still describe a broken graph.

Conceptual model: the three validation gates

Every load is bracketed by three checkpoints. A failure at any gate stops progress before corruption can propagate downstream.

Corruption is diverted sideways at whichever gate first has enough information to catch it — a failure never continues down the spine.

Design rules / decision matrix

Use the following rules to decide which check belongs at which gate. The guiding principle is that the cheapest check that can catch a class of defect should run at the earliest gate that has enough information to run it.

Integrity risk	Gate	Mechanism	Failure action
Missing uniqueness / node-key constraint	Pre-load	`SHOW CONSTRAINTS` metadata diff	Fail fast, do not open write sessions
Missing supporting index	Pre-load	`SHOW INDEXES` + `state = ONLINE` check	Fail fast or block until ONLINE
Duplicate business key in a chunk	In-loop	`UNWIND` + aggregation before `MERGE`	Rollback chunk, route to dead-letter queue
`null` in a required property	In-loop	Pre-flight validation query	Rollback chunk, emit error code
Type coercion (string where temporal expected)	In-loop	Cast-and-compare in validation query	Rollback chunk, quarantine payload
Orphaned / dangling relationship	Post-load	Global `MATCH` for zero-degree endpoints	Reconciliation failed, alert
Source vs. graph count drift	Post-load	Aggregate count comparison	Reconciliation failed, alert
Cardinality violation (fan-out > policy)	Post-load	Degree aggregation per node type	Reconciliation failed, alert

Two rules override the table. First, never run a check inside a write transaction that could run just as accurately before it — pre-flight queries add transaction-log pressure. Second, never rely on application code to enforce what a database constraint can enforce declaratively; a UNIQUE constraint is atomic with the write, application-side deduplication is not.

Step-by-step implementation

Step 1 — Verify the target schema before any write session

Before materializing nodes, confirm the target contract. Run metadata discovery inside a read-only transaction, materialize the records while the transaction is still open, and diff against the set of constraints the mapping requires. If any structural guarantee is missing, fail before acquiring a single write session.

python

from neo4j import GraphDatabase
import logging

def verify_target_schema(uri: str, auth: tuple, expected_constraints: list[str]) -> bool:
    with GraphDatabase.driver(uri, auth=auth) as driver:
        with driver.session(database="neo4j") as session:
            # Materialize records INSIDE the transaction function; the Result
            # cursor is invalid once execute_read closes the managed transaction.
            active_constraints = session.execute_read(
                lambda tx: {record["name"] for record in tx.run("SHOW CONSTRAINTS")}
            )

        missing = [c for c in expected_constraints if c not in active_constraints]
        if missing:
            logging.error(f"Schema validation failed. Missing constraints: {missing}")
            return False
        logging.info("All expected constraints verified. Proceeding to load phase.")
        return True

The constraints this check expects should be created idempotently, so re-running the migration never errors on an already-present constraint:

cypher

// Neo4j 5.x — idempotent DDL; safe to run on every pipeline start
CREATE CONSTRAINT entity_id_unique IF NOT EXISTS
FOR (n:Entity) REQUIRE n.id IS UNIQUE;

CREATE CONSTRAINT order_key IF NOT EXISTS
FOR (o:Order) REQUIRE (o.tenant, o.order_no) IS NODE KEY;

Step 2 — Validate each chunk inside its own transaction

During execution, validation shifts from static schema checks to dynamic data reconciliation. Each chunk of the batch-processing workflow is wrapped in a managed write transaction with a pre-flight validation step. The driver’s session.execute_write() handles automatic retry for transient errors and pools connections. Validation runs synchronously in the same transaction context, so a detected violation rolls back the exact chunk that produced it.

python

def process_chunk(session, chunk: list[dict], validation_query: str):
    def validate_and_load(tx):
        # Pre-flight validation: detect duplicate keys, null-in-required,
        # or type mismatches for just the ids in this chunk.
        validation_result = tx.run(validation_query, batch_ids=[r["id"] for r in chunk])
        violations = validation_result.data()

        if violations:
            # Raising inside the transaction function triggers automatic rollback.
            raise ValueError(f"Data integrity threshold breached: {violations}")

        # Safe to execute MERGE operations only after validation passes.
        result = tx.run("""
            UNWIND $batch AS row
            MERGE (n:Entity {id: row.id})
            ON CREATE SET n += row.attrs, n.created_at = datetime()
            ON MATCH  SET n += row.attrs, n.updated_at = datetime()
            RETURN count(n) AS upserted
        """, batch=[{"id": r["id"], "attrs": r["attributes"]} for r in chunk])
        return result.single()["upserted"]

    return session.execute_write(validate_and_load)

The MERGE keys on id, which is backed by the uniqueness constraint from Step 1. This makes the load idempotent: a replayed chunk updates existing nodes rather than duplicating them, so a partial failure followed by a full rerun converges to the same graph.

Step 3 — Reconcile counts and topology after the load completes

Once all chunks commit, integrity verification scales from per-chunk to graph-wide. Compare source row counts against target node and edge cardinalities, confirm relationship directionality, and detect dangling pointers. Reconciliation queries should be index-backed and parameterized with the expected values computed at extract time.

cypher

// Reconcile target node count against the expected source row count
MATCH (t:TargetEntity)
WITH count(t) AS target_count
RETURN
    target_count,
    $expected_count AS expected_count,
    CASE WHEN target_count = $expected_count
         THEN 'RECONCILIATION_PASSED'
         ELSE 'RECONCILIATION_FAILED'
    END AS status;

Constraint & validation layer

Constraints and ingestion-side checks are complementary, not redundant. The constraint is the invariant the database will never let you violate; the ingestion check is the early, informative signal that lets you route a bad payload to remediation instead of hitting a hard database error mid-transaction.

The pre-flight validation query referenced in Step 2 is where ingestion-side invariants live. A representative implementation checks three failure classes in one round trip — duplicate keys, missing required properties, and unparseable temporal strings:

cypher

// $batch_ids is the list of ids in the current chunk.
// Returns one row per violating id with a machine-readable reason.
UNWIND $rows AS row
WITH row
WHERE row.id IS NULL
   OR row.name IS NULL
   OR NOT row.event_ts =~ '\\d{4}-\\d{2}-\\d{2}T.*'
RETURN row.id AS id,
       CASE
         WHEN row.id IS NULL   THEN 'MISSING_KEY'
         WHEN row.name IS NULL THEN 'MISSING_REQUIRED'
         ELSE 'BAD_TEMPORAL'
       END AS reason;

Enforcing correct property types at ingestion matters because native temporal and numeric types keep properties eligible for range and composite indexes. If a timestamp lands as a string, downstream range queries silently fall back to scans — a class of defect covered in graph data type selection. Persisting the check as a constraint plus an ingestion filter means both the fast feedback and the hard guarantee are present.

Defence in depth: a defect is stopped at the outermost layer that has enough information to catch it, and whatever slips through still hits the constraint — the one guard the write can never bypass.

Performance & scale considerations

Validation overhead must be balanced against initial load performance tuning objectives. Excessive pre-flight queries or unparameterized MERGE operations add latency and inflate the transaction log. The cost model is straightforward: a validation query that seeks an index costs microseconds per row, while the same query forced into a full label scan costs milliseconds and grows linearly with the graph.

Confirm index-backed lookups. Run EXPLAIN and PROFILE on every validation and MERGE query and require NodeIndexSeek leaves — never NodeByLabelScan — on the properties you key on.
Batch parameters with UNWIND. Send one parameterized statement per chunk rather than one statement per row; this collapses round trips and lets the planner reuse a single compiled plan.
Size chunks against the transaction log. Larger chunks amortize round-trip cost but raise rollback expense and memory pressure. The trade-offs are the same ones weighed in the batch-processing workflow; keep post-validation chunk commits in the low thousands of nodes and profile from there.
Tune the connection pool. Set max_connection_pool_size and connection_acquisition_timeout to match cluster topology so validation reads and write commits do not starve each other.
Emit structured metrics. Track validation pass/fail rate, rollback frequency, and per-chunk duration through driver logging and OpenTelemetry so a rising rollback rate is visible before it becomes a stalled migration.

Cardinality is the scaling variable that most often surprises teams: a reconciliation query that aggregates node degree is cheap on a well-partitioned graph and expensive on a hub node with millions of relationships. Where fan-out is inherent, align reconciliation with the graph partitioning strategy so degree checks run per partition rather than across the whole graph.

Known pitfalls

Pitfall 1 — Reading a Result cursor after the transaction closes

The most common driver-level bug is returning the Result object from an execute_read/execute_write callback and iterating it afterward. Once the managed transaction closes, the cursor is consumed and raises ResultConsumedError. Always materialize inside the callback, as in Step 1. Root cause: the managed transaction API scopes the result lifetime to the callback; anything you need must be turned into a plain Python value before the callback returns.

Pitfall 2 — Count checks that pass over a corrupt graph

A source-vs-target count match proves only that the right number of nodes exist, not that they are correctly connected. Orphaned relationships and missing edges pass a naive count reconciliation. Add an explicit topology check for zero-degree endpoints where the model forbids them:

cypher

// Detect dangling endpoints: Orders that should always be placed by a Customer
MATCH (o:Order)
WHERE NOT (o)<-[:PLACED]-(:Customer)
RETURN count(o) AS orphaned_orders;

This class of defect frequently traces back to modelling choices catalogued in property graph anti-patterns — most often mirror edges or dynamic labels that make the endpoint query miss real orphans.

Pitfall 3 — Manual rollback that leaks partial state

Teams reaching for explicit session.begin_transaction() and manual tx.rollback() sometimes commit half a chunk when an exception fires between statements. In driver 5.x, prefer execute_write(): an unhandled exception raised inside the callback rolls the whole transaction back automatically and, for transient errors, retries it. Reserve manual transaction objects for cases that genuinely need mid-transaction branching, and always wrap them in try/except with an explicit rollback. Failed chunks should be serialized with their original payload, error code, and stack trace so they can be replayed deterministically after remediation — the discipline detailed in Error Handling & Rollback Mechanisms.

Pitfall 4 — Validating against a schema that has drifted

A validation suite is only correct while it matches the live contract. When a source column is deprecated or a property changes type, checks that still assert the old shape either reject valid data or wave through invalid data. Run automated drift detection on a schedule and post-cutover, comparing source metadata against Neo4j constraint definitions — the technique in automating schema drift detection between source and graph. Keep validation and schema in lockstep with the versioning approach from schema evolution and versioning, so a contract change updates both the migration and the checks that guard it.

Automated Data Migration from Relational & JSON Sources — the parent reference this validation layer plugs into.
Error Handling & Rollback Mechanisms — where failed chunks go after a validation gate rejects them.
Batch Processing & Chunking Workflows — the loop that in-transaction validation runs inside.
Initial Load Performance Tuning — balancing validation overhead against throughput.
Automating schema drift detection between source and graph — keeping checks aligned with an evolving source contract.

Data Validation & Integrity Checks

Explore this section