Handling nested JSON arrays during graph ingestion

You have a stream of JSON documents — API payloads, a document-store export, or a relational json/jsonb column dump — where each record carries one or more nested arrays (users[].transactions[], orders[].line_items[]), and you need every array element to land as its own node with an explicit edge back to its parent. The trap is that the obvious approach — stacking UNWIND clauses in a single query — multiplies rows into a Cartesian product, blows past the transaction heap, and leaves the graph half-written when a batch fails. This page shows the exact pattern that ingests arbitrarily nested arrays with bounded memory, deterministic throughput, and idempotent re-runs: subquery-isolated traversal, database-managed transaction chunks, and a streaming Python driver loop. It is the array-specific companion to the broader JSON document flattening work, which decides which paths become nodes; here we focus on ingesting the collections cleanly once that mapping exists.

Prerequisites

Neo4j 5.x with the neo4j Python driver 5.x installed (pip install "neo4j>=5,<6").
A stable business key on every parent and child object (parentId, childId) — a natural key drawn from your node label taxonomy, never the internal element id, which is not stable across a re-run.
Uniqueness constraints created before the first load so MERGE is index-backed rather than a full scan (DDL shown in Validation & verification).
Records already shaped by relational schema mapping or JSON flattening, so each payload exposes its arrays under known keys.
A verified pre-load snapshot taken with neo4j-admin database dump for point-in-time recovery.

Why stacked UNWIND clauses explode

Neo4j’s query planner evaluates sequential UNWIND clauses by multiplying row counts. A users array of 100 elements, each holding a transactions array of 50, produces 5,000 intermediate rows before a single MERGE executes. That intermediate result set is materialized in the transaction’s heap, so the failure surfaces as a TransactionMemoryLimitExceededException or as long garbage-collection pauses that stall the whole cluster. The secondary failure is non-deterministic anchoring: when the same identifier appears across chunks, an unguarded MERGE inside a multiplied result can create phantom nodes or duplicate edges. Both problems disappear once each array is expanded inside its own subquery scope so the multiplied rows never reach the outer query.

The diagram below shows a single nested array expanded into parent-to-child relationships:

Core implementation

The fix has two parts. First, isolate each array’s UNWIND inside a CALL { ... } subquery so its expanded rows stay local. Second, wrap that subquery in CALL { ... } IN TRANSACTIONS OF n ROWS, which decouples the commit boundary from the logical batch size — the database commits every n outer rows regardless of how many array elements each row fans out to. Because IN TRANSACTIONS is an auto-commit construct, it is issued through session.run and must not be nested inside an explicit execute_write transaction.

cypher

UNWIND $chunk AS payload
CALL {
  WITH payload
  // Expand the nested array inside its own scope so the multiplied
  // child rows never leak into the outer UNWIND's row count.
  UNWIND payload.children AS child
  MERGE (p:Parent {id: payload.parentId})     // index-backed upsert of the parent
  MERGE (c:Child  {id: child.childId})         // index-backed upsert of the child
  MERGE (p)-[:HAS_CHILD]->(c)                  // idempotent edge
  WITH c, child.properties AS props
  WHERE props IS NOT NULL
  SET c += props                               // map-merge scalar properties only
} IN TRANSACTIONS OF 500 ROWS                  // DB-managed commit every 500 payloads

On the Python side, stream fixed-size chunks with a generator so the full dataset is never materialized in RAM, and send each chunk through an auto-commit session.run:

python

import json
from neo4j import GraphDatabase
from typing import Iterator, Dict, List

def chunk_generator(data: List[Dict], batch_size: int = 500) -> Iterator[List[Dict]]:
    # Yield fixed-size slices instead of building one giant parameter list.
    for i in range(0, len(data), batch_size):
        yield data[i:i + batch_size]

CYPHER = """
UNWIND $chunk AS payload
CALL {
    WITH payload
    UNWIND payload.children AS child
    MERGE (p:Parent {id: payload.parentId})
    MERGE (c:Child  {id: child.childId})
    MERGE (p)-[:HAS_CHILD]->(c)
    WITH c, child.properties AS props
    WHERE props IS NOT NULL
    SET c += props
} IN TRANSACTIONS OF 500 ROWS
"""

def ingest_nested_arrays(uri: str, auth: tuple, payload_path: str) -> None:
    with open(payload_path, "r") as f:
        records = json.load(f)

    with GraphDatabase.driver(uri, auth=auth) as driver:
        # driver.session() defaults to auto-commit for session.run — required
        # because CALL { ... } IN TRANSACTIONS cannot run inside an explicit tx.
        with driver.session() as session:
            for chunk in chunk_generator(records, batch_size=500):
                session.run(CYPHER, chunk=chunk)

For arrays nested more than one level deep — users[].orders[].line_items[] — chain subqueries instead of stacking UNWIND clauses. Each CALL { ... } returns only the minimal anchor node needed by the next layer, keeping every intermediate row count linear rather than multiplicative. This is the same discipline that governs any batch-processing workflow on this pipeline: the outer chunk size sets the memory ceiling, and the subquery boundary keeps fan-out contained.

Validation & verification

Create the constraints first — they are what makes MERGE an index-backed upsert and what guarantees safe convergence on re-run:

cypher

CREATE CONSTRAINT parent_id IF NOT EXISTS
  FOR (p:Parent) REQUIRE p.id IS UNIQUE;
CREATE CONSTRAINT child_id IF NOT EXISTS
  FOR (c:Child)  REQUIRE c.id IS UNIQUE;

After a load, confirm the fan-out landed by counting children per parent and checking for orphaned nodes:

cypher

// Every child should have exactly one incoming HAS_CHILD edge.
MATCH (c:Child)
WHERE NOT ( ()-[:HAS_CHILD]->(c) )
RETURN count(c) AS orphaned_children;   // expect 0

// Row-count sanity: parents and edges materialized.
MATCH (p:Parent)-[r:HAS_CHILD]->(:Child)
RETURN count(DISTINCT p) AS parents, count(r) AS edges;

Run the ingest a second time against the same input: because every write is a MERGE behind a uniqueness constraint, the counts above must not change. If they do, a key is missing or unstable — fix the mapping upstream, not the load. To confirm the subquery is doing its job, prefix a single-chunk run with PROFILE; the db hits on the CALL subquery should scale with the array elements in that chunk, not with the Cartesian product of all arrays.

Edge cases & gotchas

Empty or missing arrays. If payload.children is null or absent, UNWIND on it yields zero rows and the parent is silently never created. If parents must exist regardless of their array contents, MERGE the parent in the outer query before the subquery, then UNWIND coalesce(payload.children, []) inside it.

Non-scalar array elements. SET c += props fails if props contains nested objects or lists — Neo4j properties must be scalars or arrays of scalars. Flatten or JSON-encode those values during JSON document flattening before they reach this loop; do not try to store a sub-object as a property.

A poison chunk aborting the whole load. With IN TRANSACTIONS OF 500 ROWS, the failing batch rolls back on the server but already-committed batches persist, so a blind retry double-processes nothing (writes are idempotent) but you still need to capture the bad chunk:

python

from neo4j.exceptions import ConstraintError

try:
    session.run(CYPHER, chunk=chunk)
except ConstraintError as e:
    # Server already rolled back this auto-commit batch; route it to a
    # dead-letter queue instead of aborting the remaining chunks.
    log_dead_letter(chunk, reason=str(e))

Full transactional recovery — resumable offsets, delta re-sync, and cleanly restarting a partially applied migration — is covered under error handling and rollback mechanisms.

Parent context

This task is one step within JSON Document Flattening & Graph Conversion, the stage that turns hierarchical documents into an explicit node-edge graph inside the wider automated migration pipeline.

Batch Processing & Chunking Workflows — sizing chunks and managing commit boundaries across the whole load.
Resolving Duplicate Nodes During Parallel Batch Loads — keeping array ingestion idempotent when workers run concurrently.
Data Validation & Integrity Checks — pre- and post-load structural verification for ingested arrays.
Node Label Taxonomy Design — choosing the stable labels and keys that make array MERGEs safe.

For authoritative syntax references, consult the Neo4j Cypher Manual on Subqueries in Transactions and the official Python Driver Documentation.

Handling nested JSON arrays during graph ingestion

Prerequisites

Why stacked UNWIND clauses explode

Core implementation

Validation & verification

Edge cases & gotchas

Parent context

Related