Batch Processing & Chunking Workflows

Production-grade graph ingestion requires deterministic throughput, bounded transaction footprints, and strict memory isolation. When executing large-scale Automated Data Migration from Relational & JSON Sources, monolithic LOAD CSV operations or unbounded MERGE loops rapidly saturate the JVM heap, trigger transaction log bloat, and stall connection pools. The engineering task this page addresses is precise: how to partition an arbitrarily large source dataset into transactionally isolated windows so that ingestion is resumable, memory-bounded, and horizontally scalable without sacrificing Neo4j’s ACID guarantees. The patterns below use the official Neo4j Python driver 5.x and cover chunk sizing, parallel execution boundaries, parameterized Cypher, idempotency, validation gates, and production-safe commit strategies.

Prerequisite Concepts

Before implementing a chunked pipeline, the reader should already be comfortable with the following upstream material:

Source-to-graph mapping. Chunking transports records that have already been shaped by Relational Schema Mapping Strategies — foreign keys resolved to typed relationships, business keys made deterministic. Chunking does not fix a bad mapping; it faithfully scales whatever mapping it is given.
Nested-payload normalization. Hierarchical documents must pass through JSON Document Flattening & Graph Conversion so that each chunk carries flat, UNWIND-compatible rows rather than arbitrarily deep objects.
Idempotent writes. Every chunk statement must be safely re-runnable. The idempotent MERGE discipline is what makes retry and resume safe; without it, a re-driven chunk duplicates data.
A stable node identity. Uniqueness must be anchored on a business key defined by your node label taxonomy, not on an internal element id, because element ids are not stable across a re-run.

This workflow is one branch of the parent guide, Automated Data Migration from Relational & JSON Sources; the sibling stages (validation, error handling, load tuning) are linked throughout and in the Related block below.

Conceptual Model

A chunking pipeline is a streaming loop: a source iterator is drained in fixed-size windows, and each window is committed inside its own managed transaction. The invariant is that batch N commits or fails as a unit, so a crash leaves batches 1…N-1 durably persisted and batch N cleanly rolled back. The diagram below illustrates how each chunk maps to a single committed transaction.

Batch N commits or rolls back as a unit; a crash leaves batches 1…N−1 durable and batch N cleanly reverted.

Design Rules & Chunk-Size Decision Matrix

Neo4j transactions are atomic, but their resource consumption scales superlinearly with operation count. A single transaction executing hundreds of thousands of CREATE or MERGE statements accumulates undo logs, holds schema and data locks, and risks TransactionTimedOutError or heap exhaustion. Conversely, chunks that are too small pay the fixed cost of a network round trip and a checkpoint per record. The following rules bound the trade space.

Rule	Guidance	Rationale
Baseline chunk size	5,000–25,000 records per transaction	Balances per-transaction round-trip latency against undo-log and lock accumulation. Start at 10,000 and tune.
Property-heavy rows	Bias toward the low end (≈5,000)	Wide `SET n += row.properties` maps inflate per-record heap; fewer records keeps the transaction resident set small.
Relationship-only chunks	Bias toward the high end (≈25,000)	Edge creation touches less property state per operation, so larger windows amortize better.
Active-traffic cutover	Halve the steady-state size	Smaller windows shorten lock hold time and reduce contention with live queries during phased handoff.
Heap alignment	Keep peak chunk footprint well under `dbms.memory.heap.max_size`	A chunk whose working set approaches the heap ceiling triggers stop-the-world GC pauses mid-transaction.
One chunk = one transaction	Never span a chunk across commits	Preserves the resumability invariant; a partial multi-commit chunk cannot be cleanly replayed.

The mathematical intuition is simple: total wall-clock cost is approximately (records / chunk_size) * per_commit_overhead + records * per_record_cost, while peak memory grows with chunk_size. Chunk size is the single knob that trades the first term against the last, so it should be tuned empirically against your cluster’s heap and network profile rather than guessed.

Step-by-Step Implementation

1. Stream the source without materializing it

In Python, materializing an entire dataset before ingestion defeats the purpose of streaming. Use generator-based pagination to yield fixed-size windows directly into driver sessions. The itertools.islice approach partitions an iterator without intermediate list allocation and works on every Python 3.x release.

python

from itertools import islice
from typing import Iterator

def chunk_iter(source: Iterator, size: int) -> Iterator:
    """Yield successive fixed-size lists from an iterator without full materialization."""
    it = iter(source)
    # The two-argument iter() calls the lambda until it returns the sentinel [].
    return iter(lambda: list(islice(it, size)), [])

Note on itertools.batched: itertools.batched was added in Python 3.12. The islice-based pattern above works on all Python 3.x releases and is preferred for broader compatibility.

2. Drive one managed transaction per chunk

Each chunk executes inside a dedicated session.execute_write transaction, ensuring that a failure in batch N leaves batches 1…N-1 intact. The Cypher uses a single UNWIND over the whole chunk so the query planner compiles one plan and reuses it for every row — never issue one MERGE statement per record.

python

from neo4j import GraphDatabase
import logging
from typing import Iterator, Dict, Any

def ingest_chunked(
    uri: str,
    auth: tuple[str, str],
    source_iterator: Iterator[Dict[str, Any]],
    chunk_size: int = 10000,
) -> None:
    driver = GraphDatabase.driver(
        uri,
        auth=auth,
        max_connection_pool_size=50,
        connection_acquisition_timeout=30.0,
    )

    cypher = """
    UNWIND $records AS row
    MERGE (n:Entity {id: row.id})
    SET n += row.properties
    RETURN count(n) AS processed
    """

    with driver.session(database="neo4j") as session:
        for chunk in chunk_iter(source_iterator, chunk_size):
            try:
                # Consume the result INSIDE the transaction function: the cursor
                # is invalid once execute_write commits the managed transaction.
                processed = session.execute_write(
                    lambda tx, c=chunk: tx.run(cypher, records=c).single()["processed"]
                )
                logging.info("Committed chunk: %s records", processed)
            except Exception as e:
                logging.error("Chunk ingestion failed: %s", e)
                raise
    driver.close()

Two details are load-bearing. First, the result is consumed with .single() inside the transaction function — the managed transaction commits when the function returns, after which the cursor is dead. Second, the c=chunk default-argument binding captures the current chunk by value; without it the lambda would close over the loop variable and every retry could see the wrong data.

3. Parameterize, never string-build

The $records parameter is passed as a native Python list of dicts. This keeps the Cypher static so the plan cache is hit on every chunk, and it eliminates Cypher injection. Interpolating values into the query string forces a recompile per chunk and destroys throughput.

Constraint & Validation Layer

Chunking is only safe when the database enforces identity. Create a uniqueness constraint on the business key before the load begins; the backing index also makes each MERGE an index seek instead of a full label scan, which is the difference between a fast load and a quadratic one.

cypher

CREATE CONSTRAINT entity_id_unique IF NOT EXISTS
FOR (n:Entity) REQUIRE n.id IS UNIQUE;

Chunk boundaries create natural validation gates. Before committing, each window should be checked against source expectations; after committing, lightweight aggregation confirms the graph matches the source.

python

# Pre-flight: reject a chunk with missing identities before it ever hits Neo4j.
def validate_chunk(chunk: list[dict]) -> None:
    missing = [i for i, row in enumerate(chunk) if not row.get("id")]
    if missing:
        raise ValueError(f"Chunk contains {len(missing)} rows without a business key")

# Post-commit reconciliation: source row count vs. graph node count.
reconcile = "MATCH (n:Entity) RETURN count(n) AS graph_count"

Deeper structural validation — comparing property types, relationship cardinality, and source checksums against the graph — belongs to the sibling stage Data Validation & Integrity Checks, which runs the same gate logic as a first-class reconciliation pass rather than an inline assertion.

Parallel Execution & Idempotency Guarantees

Horizontal scaling reduces wall-clock time but introduces concurrency hazards. Multiple workers targeting overlapping business keys trigger lock contention, deadlock detection, or duplicate node proliferation. Neo4j’s constraint engine enforces uniqueness, but high-contention MERGE operations degrade to serialized execution under heavy parallel load because each worker must wait on the same index lock.

The production-safe approach combines deterministic partitioning with application-level sharding. Hash the primary business key and route chunks to workers based on the hash prefix, so identical keys are always processed by the same worker and cross-process MERGE contention disappears entirely.

Routing is a pure function of the key, so a record always lands in the same lane — cross-process MERGE contention disappears and re-runs upsert rather than duplicate.

python

import hashlib

def worker_for(biz_key: str, num_workers: int) -> int:
    """Deterministically map a business key to exactly one worker lane."""
    digest = hashlib.sha256(str(biz_key).encode()).hexdigest()
    # Same key -> same worker, every run, across every process.
    return int(digest, 16) % num_workers

Because the routing is a pure function of the key, the scheme is idempotent by construction: re-running the whole migration sends every record back to the same lane, and the underlying MERGE upserts rather than duplicates. When key-based partitioning is not feasible — for example when workers consume an unpartitioned queue — you must fall back to the remediation patterns in Resolving duplicate nodes during parallel batch loads, which cover staging nodes, two-phase commit, and constraint-aware upsert logic.

Performance & Scale Considerations

Throughput is governed by three interacting budgets: the driver connection pool, the server heap, and the index selectivity of your MERGE predicate.

Index selectivity first. An unconstrained MERGE (n:Entity {id: …}) performs a label scan whose cost grows with the node count already loaded, making the load quadratic. The uniqueness constraint above converts it to an index seek — this single change often matters more than chunk size.
Right-size the pool. Set max_connection_pool_size to at least the number of concurrent worker sessions plus headroom; a pool starved below the worker count serializes the very parallelism you built. Pair it with connection_acquisition_timeout so a starved pool fails fast instead of hanging.
Suspend nonessential indexing during bulk load. For a cold initial load, drop or defer secondary indexes and rebuild them afterward, verifying SHOW INDEXES reports ONLINE before reopening the graph to queries. The full regimen — heap sizing, apoc.periodic.iterate alternatives, and index rebuild ordering — is the subject of Initial Load Performance Tuning.
Align chunk size to heap. Chunk peak footprint should sit well under dbms.memory.heap.max_size; a chunk that approaches the ceiling provokes GC pauses that manifest as sporadic TransactionTimedOutError even though the query itself is sound.
Instrument every chunk. Record per-transaction timing, record count, and retry attempts, and export them via OpenTelemetry or Prometheus. Chunk-level telemetry is what turns a mysterious “the load got slow” into “chunks after row 4M spill the page cache,” which is an actionable finding.

Known Pitfalls

1. Consuming the result after the transaction commits. Returning the raw Result cursor from execute_write and reading it afterward raises ResultConsumedError, because the managed transaction has already closed the stream. Always materialize what you need (.single(), .data()) inside the transaction function, as in step 2.

2. Blind retries without idempotency. The Neo4j driver retries transient errors, and a naive retry of a CREATE-based statement re-creates nodes. The fix is to make the chunk statement an upsert and gate retries on an idempotency key:

python

# CREATE is NOT safe to retry; MERGE on a constrained key is.
BAD  = "UNWIND $records AS row CREATE (n:Entity) SET n = row"
GOOD = "UNWIND $records AS row MERGE (n:Entity {id: row.id}) SET n += row.properties"

Never rely on implicit session retries alone; wrap writes so a replayed chunk converges to the same graph state. This is the same idempotency contract described in Error Handling & Rollback Mechanisms.

3. Oversized chunks masquerading as a “slow database.” A 200,000-record transaction that times out is not a Neo4j capacity problem — it is a chunk-size problem. The undo log and lock set grow with operation count, so the symptom (TransactionTimedOutError or GC thrash) is fixed by lowering chunk_size, not by raising the timeout. Raising the timeout only lengthens the lock hold and widens the blast radius.

4. Partial commits after a network split. If the client loses the connection after the server has committed, a naive resume re-drives the chunk. With a uniqueness constraint plus MERGE this is harmless, but without one it duplicates the entire window. Before proceeding after any mid-load network failure, run a post-commit reconciliation query to detect drift, and consult the official transaction lifecycle management guidance for retry configuration.

Phased Cutover Using Chunked Pipelines

Chunking workflows naturally support phased cutovers. Run the initial load in shadow mode, validate parity against source counts, then execute incremental delta syncs using timestamp-based filters. Once the graph reaches steady state, route application traffic to Neo4j and initiate snapshot routines with neo4j-admin database dump. The final cutover reuses the same chunked pipeline with reduced batch sizes to minimize lock contention during active traffic. By enforcing strict transaction boundaries, deterministic sharding, and comprehensive observability, engineering teams migrate legacy datasets with zero data loss, predictable memory consumption, and a seamless operational handoff.

Up: Automated Data Migration from Relational & JSON Sources — the parent guide this workflow belongs to.
Relational Schema Mapping Strategies — shapes the records that chunking transports.
JSON Document Flattening & Graph Conversion — normalizes nested payloads into UNWIND-ready rows.
Initial Load Performance Tuning — heap, index, and cache tuning for cold bulk loads.
Resolving duplicate nodes during parallel batch loads — remediation when partitioning cannot prevent contention.

Batch Processing & Chunking Workflows

Explore this section