JSON Document Flattening & Graph Conversion

Migrating semi-structured JSON payloads into a Neo4j property graph is a schema-alignment problem, not a serialization problem. A document store hands you hierarchical nesting, polymorphic shapes, and variable-length collections that no relational extractor ever produces — and none of it can be ingested verbatim without collapsing meaning. The engineering task this guide addresses is how to translate an arbitrary document tree into an explicit, query-optimized graph: which paths become nodes, which become properties, which become typed relationships, and how to stream the result through Neo4j 5.x so that a rerun lands exactly the same graph. Within an automated migration pipeline, flattening sits between extraction and load, and its output contract — a flat, keyed record stream — is what every downstream validation and batching stage depends on.

Prerequisite concepts

Before implementing the flattening pipeline below, the reader should already have these in place:

The parent architecture from Automated Data Migration from Relational & JSON Sources — flattening is the map stage between extract and load.
A settled target model. Flattening decisions are downstream of a node label taxonomy and a relationship cardinality and direction policy — you cannot decide whether an embedded object becomes a node until you know which labels are stable domain types.
Familiarity with Relational Schema Mapping Strategies, since JSON flattening reuses the same entity-vs-property-vs-edge decision, only against nested paths rather than foreign keys.
Neo4j 5.x with the neo4j Python driver v5+, plus json and collections from the standard library. Target uniqueness constraints should already be declared so the ingest MERGE is idempotent.

Conceptual model: from document tree to graph topology

Flattening maps document paths onto graph topology. Embedded objects that represent distinct domain concepts are promoted to independent labeled nodes; primitive key-value pairs at the same depth become properties on the nearest owning node; homogeneous arrays of objects become collections of nodes joined by typed relationships. The diagram below shows how a nested customer document decomposes into nodes and relationships.

The mechanical rule that drives this decomposition is the same one used when mapping relational foreign keys to explicit relationship types (:PLACED, :LOCATED_AT, :CONTAINS): implicit structure in the source becomes explicit structure in the graph. To keep the translation deterministic and auditable, maintain a path-to-topology registry that records, for every meaningful JSONPath, exactly what it becomes.

Design rules: path-to-topology decision matrix

Every path in a source document falls into one of a small number of classes. Classifying it up front — and recording the decision in the registry — is what makes a flattening pipeline reproducible across schema variants.

JSON construct	Graph target	Rule	Key / merge behaviour
Object representing a domain entity (`customer`, `address`)	Labeled node	Promote to its own node when it carries a stable business key	`MERGE` on the business key
Primitive scalar (`name`, `zip`, `total`)	Node property	Attach to the nearest owning entity node	`SET` on create / on match
Array of homogeneous objects (`orders[]`)	Node collection + typed relationship	One node per element, joined to the parent by a directed edge	`MERGE` each element, `MERGE` the edge
Array of scalars (`tags[]`)	List-valued property	Store as a native list unless elements need their own identity	`SET n.tags = $tags`
Heterogeneous / polymorphic array	Type-disambiguated nodes	Branch on a discriminator field, route each element to its label	`MERGE` per resolved label
Deeply nested wrapper with no identity (`meta.envelope`)	Flatten through	Collapse into the descendant path; do not create a node	none

Two rules override the table. First, an embedded object only earns node status if it has a stable identity — a business key you can MERGE on; otherwise it is a property bag and belongs on its parent. Promoting keyless objects to nodes is a fast route to the duplication and fan-out problems catalogued in property graph anti-patterns. Second, the registry, not the code, is the source of truth: every path the flattener emits should trace back to a registry row naming its target label, relationship type, property projection, and uniqueness constraint.

Step-by-step implementation

Step 1 — Flatten iteratively, never recursively

Recursive parsers fail in production on deeply nested payloads: unbounded call stacks and unpredictable memory make them a denial-of-service risk against your own migration. Use an explicit queue-based (breadth-first) traversal so heap usage is bounded and the emitted record order is deterministic. Python’s standard json module (Neo4j and the standard library reference) plus collections.deque is all that is required.

python

import json
from collections import deque

def flatten_json_iterative(doc, root_path=""):
    """BFS flattening of a JSON document into a flat list of path/value pairs.

    Iterative traversal keeps heap usage bounded and record order stable,
    which is what makes the downstream MERGE idempotent across reruns.
    """
    queue = deque([(doc, root_path)])
    records = []

    while queue:
        current, path = queue.popleft()
        if isinstance(current, dict):
            for k, v in current.items():
                new_path = f"{path}.{k}" if path else k
                if isinstance(v, (dict, list)):
                    queue.append((v, new_path))          # descend into structure
                else:
                    records.append({"path": new_path, "value": v})
        elif isinstance(current, list):
            for idx, item in enumerate(current):
                item_path = f"{path}[{idx}]"
                if isinstance(item, (dict, list)):
                    queue.append((item, item_path))
                else:
                    # Scalar array elements are recorded positionally;
                    # otherwise they would be silently dropped.
                    records.append({"path": item_path, "value": item})

    return records

The flat records this yields are an intermediate representation, not the final shape sent to Cypher. The registry from the previous section is what turns generic path/value pairs into keyed entity records ready for MERGE.

Step 2 — Resolve arrays that represent relationships

Scalar arrays become list properties, but arrays of objects represent graph edges and need cardinality-aware extraction: one node per element, one relationship per element, and deduplication so a replayed document does not fan out duplicate edges. This unwinding logic is involved enough to warrant its own treatment — the positional keying, orphan handling, and edge-deduplication patterns are covered in handling nested JSON arrays during graph ingestion. The output of that stage is a list of typed edge records — (parent_key, rel_type, child_key, child_props) — that Step 3 ingests alongside the entity records.

Step 3 — Ingest with UNWIND and idempotent MERGE

Once flattened and grouped by target label, records stream to Neo4j through the official Python driver 5.x. Unbatched, single-row MERGE calls exhaust the connection pool and trigger transaction timeouts; the correct pattern is one parameterized statement per chunk that UNWINDs a list of rows inside a managed write transaction. Keying every MERGE on a stable business key makes the load an idempotent MERGE — a replayed chunk updates existing nodes instead of duplicating them.

python

from neo4j import GraphDatabase, exceptions
import logging

def ingest_chunk(tx, chunk):
    # One compiled plan, one round trip, N rows. Keys (customer_id, and the
    # address composite) are backed by the constraints created in Step 4.
    query = """
    UNWIND $records AS r
    MERGE (c:Customer {customer_id: r.customer_id})
      ON CREATE SET c.name = r.name, c.created_at = datetime()
      ON MATCH  SET c.name = r.name, c.updated_at = datetime()
    MERGE (a:Address {zip: r.zip, street: r.street})
    MERGE (c)-[:LOCATED_AT]->(a)
    """
    tx.run(query, records=chunk)

def stream_to_neo4j(driver, flattened_records, chunk_size=10_000):
    for i in range(0, len(flattened_records), chunk_size):
        batch = flattened_records[i:i + chunk_size]
        try:
            with driver.session(database="neo4j") as session:
                # execute_write retries transient errors with backoff and
                # rolls the whole chunk back on any unhandled exception.
                session.execute_write(ingest_chunk, batch)
            logging.info("Committed chunk %d", i // chunk_size + 1)
        except exceptions.Neo4jError as exc:
            logging.error("Transaction failed at chunk %d: %s", i // chunk_size + 1, exc)
            raise

execute_write handles network partitions, leader elections, and transient errors with exponential backoff automatically. Note that constraint violations (ClientError) are deliberately not retried — a retry would fail identically — so the caller must catch and route them to remediation rather than looping.

Constraint & validation layer

The MERGE keys above are only trustworthy if the database enforces their uniqueness declaratively. Application-side deduplication is not atomic with the write; a UNIQUE or NODE KEY constraint is. Create the target constraints idempotently so re-running the migration never errors on an already-present constraint:

cypher

// Neo4j 5.x — idempotent DDL; safe to run on every pipeline start
CREATE CONSTRAINT customer_id_unique IF NOT EXISTS
FOR (c:Customer) REQUIRE c.customer_id IS UNIQUE;

CREATE CONSTRAINT address_key IF NOT EXISTS
FOR (a:Address) REQUIRE (a.zip, a.street) IS NODE KEY;

Validation splits across the flattening boundary. Pre-ingestion, verify schema conformity, required-field presence, and relationship cardinality limits with a JSON Schema or Pydantic contract — nested-array boundaries are the primary source of integrity defects, so validate them before a write session ever opens. Post-commit, run topology checks that a count comparison cannot catch, such as customers left without the mandatory :LOCATED_AT edge:

cypher

// Detect entities that lost a mandatory relationship during flattening
MATCH (c:Customer)
WHERE NOT (c)-[:LOCATED_AT]->(:Address)
RETURN count(c) AS orphaned_customers;

This is the ingestion-side half of the broader gate model in Data Validation & Integrity Checks; flattening is responsible for feeding it clean, typed records so its checks describe a real graph rather than a coincidentally-correct node count.

Performance & scale considerations

Flattened JSON rarely fits into a single transaction without degrading throughput or triggering an OutOfMemoryError on the server, so chunking is not optional. The trade-offs are the same ones weighed in the batch-processing workflow, applied to document-shaped input.

Confirm index-backed merges. Run EXPLAIN and PROFILE on the ingest query and require NodeUniqueIndexSeek leaves on every MERGE key — never NodeByLabelScan. A MERGE that scans turns the load quadratic as the graph grows.
Size chunks against the transaction log. For high-volume loads, 5,000–20,000 records per chunk balances round-trip amortization against heap pressure and rollback cost. Start in that range and profile from there.
Apply backpressure. Track in-flight request counters around each execute_write and shrink chunk size dynamically when commit latency climbs, rather than letting the server queue saturate.
Stage indexes around the bulk load. Drop non-essential secondary indexes with DROP INDEX <name> IF EXISTS before a large initial load and recreate them afterward, so writes do not pay per-row index-maintenance cost mid-migration. Keep uniqueness constraints in place — they are what protects idempotency. This staging is part of initial load performance tuning.
Watch fan-out. Documents with large embedded arrays produce hub nodes whose relationship degree explodes; align ingestion with the graph partitioning strategy so degree-sensitive work runs per partition.

Native property types matter here too: a timestamp flattened as a string stays ineligible for range indexes and silently forces later queries into scans, a defect rooted in graph data type selection. Cast temporal and numeric scalars during flattening, not after.

Known pitfalls

Pitfall 1 — Recursive flattening on adversarial nesting

A recursive flattener works on shallow test fixtures and then blows the stack on a real payload with hundreds of nesting levels — or exhausts memory long before that. The failure is intermittent and load-dependent, which makes it hard to reproduce. Root cause: call-stack depth is proportional to document depth and is unbounded. Fix: the explicit deque traversal in Step 1, which caps memory at the working-set size and never recurses.

Pitfall 2 — Silently dropping scalar array elements

A naive flattener that only descends into dict and list values, but records scalars only when they are dict fields, will drop the elements of a scalar array ("tags": ["a", "b"]) entirely — no error, just missing data discovered weeks later. Fix: record scalar array elements positionally, as the else branch in Step 1 does, and decide explicitly in the registry whether they become a list property or their own nodes.

Pitfall 3 — Non-idempotent ingest that duplicates on replay

If the ingest MERGE keys on a property that is not backed by a uniqueness constraint — or worse, uses CREATE — a partial failure followed by a rerun produces duplicate nodes and fan-out edges. The count reconciliation then “passes” while the graph is corrupt. Fix: every MERGE key must have a matching UNIQUE/NODE KEY constraint (Step 4), and the pipeline must never fall back to CREATE for entities that can be replayed. The deterministic-replay discipline is detailed in implementing idempotent migration scripts for Neo4j.

Pitfall 4 — Polymorphic arrays flattened to a single label

Heterogeneous collections — an events[] array mixing login, purchase, and refund shapes — flattened under one label produce sparse nodes with mutually exclusive property sets, an anti-pattern that wrecks index selectivity. Fix: branch on the discriminator field during flattening and route each element to its correct label, as the decision matrix prescribes. Keep the branching logic in lockstep with the source contract using the versioning approach from schema evolution and versioning, so a new event type updates both the flattener and its target labels.

Automated Data Migration from Relational & JSON Sources — the parent reference this conversion stage plugs into.
Handling nested JSON arrays during graph ingestion — cardinality-aware unwinding and edge deduplication for object arrays.
Relational Schema Mapping Strategies — the same entity-vs-property-vs-edge decision applied to foreign keys.
Batch Processing & Chunking Workflows — the loop that streams flattened chunks to Neo4j.
Data Validation & Integrity Checks — the gates that consume the clean records flattening produces.

JSON Document Flattening & Graph Conversion

Explore this section