Automating schema drift detection between source and graph

You are running a continuous migration into Neo4j and you need a pre-flight check that fails the pipeline the moment the source schema stops matching the graph model it feeds. When relational DDL evolves, an API contract mutates, or a JSON payload gains an unversioned nesting layer, ingestion jobs either fail silently, create orphaned nodes, or violate uniqueness constraints — and the corruption is only noticed after thousands of transactions have committed. This page shows how to materialize the live graph schema with the Python driver, diff it against a declared source contract, and turn any divergence into a hard gate that halts the load. It is the automated, scheduled version of the drift check embedded in data validation and integrity checks.

Prerequisites

Neo4j 5.x with the db.schema.nodeTypeProperties() and db.schema.relationshipTypeProperties() procedures enabled (they ship by default).
The neo4j Python driver v5+, so the managed execute_query() API and context-manager sessions are available.
A settled source-to-graph mapping from relational schema mapping strategies, so you know which columns become properties and which foreign keys become typed relationships.
A flattening contract from JSON document flattening and graph conversion for any semi-structured payloads — nested-array boundaries are the most common source of undetected drift.
Read access to the source catalog (information_schema for a relational store, or a serialized JSON Schema) to build the expected contract.

How drift reaches the graph

Three reproducible patterns account for almost all breakage, and each maps to a specific failure once it hits ingestion:

Type promotion or demotion in a relational source. A column moves from INT to BIGINT or VARCHAR, causing precision truncation or a coercion failure during property assignment. Pre-compiled MERGE statements that assume a fixed Cypher type break.
JSON structural mutation. A nested object flattens into an array, or a required key disappears. Extraction paths that rely on static JSONPath either drop a relationship or build a malformed intermediate node.
Cardinality shift in a foreign key. A one-to-many becomes many-to-many when a junction table or composite key is added. Without detection, the load silently creates duplicate edges or trips a uniqueness constraint — the exact class of defect covered under relationship cardinality and directionality.

Root cause is almost always an asynchronous DDL deploy against a pipeline that assumes static source metadata. A post-load audit such as MATCH (n) WHERE n.property IS NULL only surfaces the damage after the batch processing and chunking workflow has already committed. The fix is to compare schemas before any write session opens.

Core implementation

The detector has two halves: introspect the live graph into a diffable dictionary, then compare that dictionary against the declared source contract. Introspection uses the schema procedures rather than a COUNT() scan, so it stays cheap enough to run on every pipeline invocation.

python

from neo4j import GraphDatabase
from typing import Dict, Any, List
import logging

logging.basicConfig(level=logging.INFO)
log = logging.getLogger("drift")


def extract_graph_schema(uri: str, auth: tuple) -> Dict[str, Any]:
    """Materialize live node and relationship metadata into a diffable structure.

    Uses the catalog-backed schema procedures, so cost is independent of
    node count — safe to run on every pipeline start.
    """
    schema: Dict[str, Any] = {"nodes": {}, "relationships": {}}

    with GraphDatabase.driver(uri, auth=auth) as driver:
        # execute_query() returns an EagerResult; rows live on .records.
        nodes = driver.execute_query(
            "CALL db.schema.nodeTypeProperties() "
            "YIELD nodeType, propertyName, propertyTypes RETURN *"
        )
        for r in nodes.records:
            # nodeType arrives as e.g. ":`Customer`"; keep it verbatim as the key.
            schema["nodes"].setdefault(r["nodeType"], {})[r["propertyName"]] = r["propertyTypes"]

        rels = driver.execute_query(
            "CALL db.schema.relationshipTypeProperties() "
            "YIELD relationshipType, propertyName, propertyTypes RETURN *"
        )
        for r in rels.records:
            schema["relationships"].setdefault(r["relationshipType"], {})[r["propertyName"]] = r["propertyTypes"]

    return schema


def detect_drift(source_contract: Dict[str, Any], graph_schema: Dict[str, Any]) -> List[str]:
    """Diff the declared source contract against the live graph and list violations."""
    violations: List[str] = []

    for label, props in source_contract.get("nodes", {}).items():
        if label not in graph_schema["nodes"]:
            violations.append(f"MISSING_NODE_LABEL: {label}")
            continue
        live = graph_schema["nodes"][label]
        for prop, expected_types in props.items():
            if prop not in live:
                violations.append(f"MISSING_PROPERTY: {label}.{prop}")
            elif not any(t in live[prop] for t in expected_types):
                # Type promotion/demotion: the property exists but the stored
                # type no longer intersects what the source contract promises.
                violations.append(
                    f"TYPE_MISMATCH: {label}.{prop} expected {expected_types}, found {live[prop]}"
                )

    for r_type in source_contract.get("relationships", {}):
        if r_type not in graph_schema["relationships"]:
            violations.append(f"MISSING_RELATIONSHIP: {r_type}")

    return violations

The source_contract is a plain dictionary you generate from the source catalog or a declared schema model — for example a dataclass or a pydantic BaseModel serialized with model_dump(). Keeping the contract as data rather than code means it can be versioned alongside the migration and diffed in review. For the authoritative behaviour of the schema procedures, consult the Neo4j Cypher Manual.

Gating the pipeline

Drift detection is only useful if a positive result stops the load. Wrap the two functions in a gate that runs inside a read-only scope before any write session opens, and raise on breach so the pipeline’s error path — not a half-written graph — takes over.

python

class SchemaDriftError(RuntimeError):
    """Raised when live graph schema diverges from the source contract."""


def gate(uri, auth, source_contract, *, tolerated: int = 0) -> None:
    live = extract_graph_schema(uri, auth)
    violations = detect_drift(source_contract, live)
    for v in violations:
        log.warning("drift: %s", v)
    if len(violations) > tolerated:
        raise SchemaDriftError(f"{len(violations)} drift violation(s); halting ingestion")

The control flow below shows where the gate sits relative to the load.

Validation and verification

Confirm the detector actually catches drift before you trust it in production. Two checks are enough:

Positive control. Point the gate at a contract that deliberately renames one label, then assert that detect_drift returns exactly one MISSING_NODE_LABEL entry. A detector that returns an empty list here is silently passing everything.
Live spot-check with Cypher. Run the introspection query directly to see what the driver sees:

cypher

CALL db.schema.nodeTypeProperties()
YIELD nodeType, propertyName, propertyTypes
RETURN nodeType, collect([propertyName, propertyTypes]) AS shape
ORDER BY nodeType;

Compare the returned shape against the source contract by eye for a handful of core labels. When the counts and property types line up, the automated gate is reporting the same reality you can query manually. Schedule the gate to run post-cutover as well, so a source change that lands after go-live is caught within one interval rather than at the next incident.

Edge cases and gotchas

Empty labels report no properties. db.schema.nodeTypeProperties() only returns rows for labels that currently hold at least one node. On a freshly created but unpopulated graph, a real label looks MISSING. Guard the gate with a mode flag so greenfield loads run it in --validate-only (dry-run) instead of failing on structure that has not been populated yet — the pattern used during initial load performance tuning.

Property types are reported as a list. A property that has held more than one type across nodes returns multiple entries (for example ["Long", "String"]). Treating propertyTypes as a scalar throws a false TYPE_MISMATCH; the any(... in live[prop]) intersection above handles it. Genuinely mixed types, though, are their own defect and usually trace back to a modelling mistake in the property graph anti-patterns catalogue.

A drifted validator waves bad data through. The detector is only correct while its contract matches the intended shape. When a column is deprecated or a property changes type on purpose, regenerate the contract in the same change that alters the source, and keep both in lockstep with the versioning approach from schema evolution and versioning. A stale contract either rejects valid data or, worse, passes invalid data — the failure this whole page exists to prevent. Failed loads should be serialized with payload and error code for deterministic replay, exactly as error handling and rollback mechanisms prescribes.

Parent context

This task is one gate within the broader discipline of data validation and integrity checks, which places drift detection alongside pre-load schema verification, in-transaction chunk validation, and post-load reconciliation across an automated migration pipeline.

Data Validation & Integrity Checks — the parent guide this gate plugs into.
Schema Evolution & Versioning — keep the contract and the graph model in lockstep as the source changes.
Error Handling & Rollback Mechanisms — where a chunk goes after the gate rejects it.