Resolving duplicate nodes during parallel batch loads

You are compressing a migration window by fanning ingestion across a pool of worker threads or processes, and the resulting graph now contains two, three, or more Customer nodes that should have been a single entity. This page solves exactly that: how to stop concurrent MERGE statements from racing each other into duplicate CREATEs during a batch-processing workflow, and how to safely collapse the duplicates that a previous unsafe load already left behind. The fix has three moving parts — an index-backed uniqueness constraint that exists before the first worker starts, deterministic routing so identical keys never reach two workers, and idempotent batch statements that converge on re-run — plus a validated apoc.refactor.mergeNodes cleanup for the mess you inherited.

Prerequisites

Neo4j 5.x with the neo4j Python driver 5.x installed (pip install "neo4j>=5,<6").
APOC Core available on the server (required for the apoc.periodic.iterate + apoc.refactor.mergeNodes cleanup below).
A stable business key per node — an email, order_id, or natural key drawn from your node label taxonomy, never the internal element id, which is not stable across a re-run.
Permission to run DDL (CREATE CONSTRAINT) and a verified pre-load snapshot taken with neo4j-admin database dump.
Records already shaped by Relational Schema Mapping Strategies or JSON Document Flattening & Graph Conversion, so every row carries a populated business key.

Why parallel MERGE produces duplicates

Neo4j’s default isolation level (READ_COMMITTED) means each statement sees only committed data at the moment it runs. When two workers execute MERGE (c:Customer {email: $e}) against the same email at the same time and no uniqueness constraint exists, each MERGE performs a label scan, finds no committed match (the sibling’s write is still in flight), and falls through to CREATE. Both transactions then commit, and the graph holds two nodes for one identity. The hazard compounds when the driver silently retries a transient timeout without an idempotency guard, re-running CREATE logic for a row a sibling already committed.

There are two independent defenses, and a robust load uses both: a database-layer constraint that serializes writes on the same key, and application-layer routing that guarantees the same key is never handed to two workers in the first place.

Core implementation

The whole prevention strategy fits in one place: create the constraint, route deterministically, then run each worker’s chunk as an idempotent UNWIND … MERGE. The comments below mark the three load-bearing decisions.

python

import hashlib
from concurrent.futures import ThreadPoolExecutor
from neo4j import GraphDatabase

# 1) CONSTRAINT FIRST. This must be ONLINE before any worker starts. The backing
#    RANGE index makes MERGE take an index-backed write lock, so concurrent MERGEs
#    on the same email serialize instead of racing into duplicate CREATEs.
CONSTRAINT = """
CREATE CONSTRAINT customer_email_unique IF NOT EXISTS
FOR (c:Customer) REQUIRE c.email IS UNIQUE
"""

# 2) DETERMINISTIC ROUTING. A pure function of the key: the same email always maps
#    to the same worker, on every run and across every process. Identical keys can
#    therefore never reach two workers, so cross-worker MERGE contention disappears.
def worker_for(business_key: str, num_workers: int) -> int:
    digest = hashlib.sha256(str(business_key).encode()).hexdigest()
    return int(digest, 16) % num_workers

# 3) IDEMPOTENT BATCH. One UNWIND per chunk so the planner compiles one reusable
#    plan; MERGE (not CREATE) so a retried or replayed chunk converges instead of
#    duplicating. Each worker opens its OWN session — Session is NOT thread-safe.
BATCH = """
UNWIND $records AS row
MERGE (c:Customer {email: row.email})
SET c += row.properties
"""

def load_shard(driver, records):
    # A fresh session per worker; the driver instance itself IS thread-safe.
    with driver.session(database="neo4j") as session:
        session.execute_write(lambda tx: tx.run(BATCH, records=records).consume())

def run_parallel_load(uri, auth, all_records, num_workers=8, chunk_size=10_000):
    with GraphDatabase.driver(uri, auth=auth,
                              max_connection_pool_size=num_workers + 4) as driver:
        # Create + wait for the constraint's index to come ONLINE before loading.
        with driver.session(database="neo4j") as s:
            s.run(CONSTRAINT).consume()
            s.run("CALL db.awaitIndexes(300)").consume()

        # Route every record to exactly one worker lane by its business key.
        lanes: dict[int, list] = {w: [] for w in range(num_workers)}
        for r in all_records:
            lanes[worker_for(r["email"], num_workers)].append(r)

        # Fan out; each lane is chunked so no single transaction grows unbounded.
        with ThreadPoolExecutor(max_workers=num_workers) as pool:
            futures = []
            for records in lanes.values():
                for i in range(0, len(records), chunk_size):
                    futures.append(pool.submit(load_shard, driver, records[i:i + chunk_size]))
            for f in futures:
                f.result()  # re-raise any worker exception on the main thread

Two implementation details are easy to get wrong. The db.awaitIndexes call is not optional — CREATE CONSTRAINT returns before its backing index is necessarily ONLINE, and a MERGE issued against a still-populating index falls back to a label scan and can still race. And each worker must open its own session inside load_shard; sharing one Session across the ThreadPoolExecutor corrupts the Bolt stream because the driver’s Session object is explicitly not thread-safe. The routing dictionary is what makes retries safe: because worker_for is deterministic, replaying the entire migration sends every record back to the same lane, where MERGE upserts rather than duplicates. This is the same idempotency contract described in Error Handling & Rollback Mechanisms and detailed in implementing idempotent migration scripts.

The diagram below shows hash partitioning routing disjoint keys to dedicated workers before they reach the constraint-backed graph.

Routing (a pure function of the key) keeps identical keys off separate workers; the uniqueness constraint is the database-layer backstop that serializes any write that still collides.

Validation & verification

Never trust a parallel load blind — confirm both that the constraint is live and that no duplicates survived before you retire the source system.

First, prove the constraint is index-backed. A null ownedIndex or an absent row means MERGE was scanning, not seeking:

cypher

SHOW CONSTRAINTS YIELD name, type, labelsOrTypes, properties, ownedIndex
WHERE type = 'UNIQUENESS';

Second, quantify any surviving duplicates by business key. A healthy load returns zero rows:

cypher

MATCH (n:Customer)
WITH n.email AS identifier, count(*) AS occurrences
WHERE occurrences > 1
RETURN identifier, occurrences
ORDER BY occurrences DESC
LIMIT 50;

Third, run EXPLAIN over your ingestion query and confirm the plan shows a NodeUniqueIndexSeek (or NodeIndexSeek) rather than a NodeByLabelScan with a Filter step — a scan is the fingerprint of a missing or not-yet-online constraint.

If duplicates did slip through from an earlier unsafe run, collapse them idempotently with APOC, batching so the cleanup itself cannot blow the heap:

cypher

CALL apoc.periodic.iterate(
  "MATCH (n:Customer)
   WITH n.email AS email, collect(n) AS nodes
   WHERE size(nodes) > 1
   RETURN nodes",
  "CALL apoc.refactor.mergeNodes(nodes, {properties: 'combine', mergeRels: true})
   YIELD node RETURN count(*) AS merged",
  {batchSize: 5000, parallel: false}
);

Keep parallel: false here: merging nodes that share relationships concurrently reintroduces exactly the contention you are trying to remove. Deeper structural reconciliation — property-type parity and source checksums — belongs to Data Validation & Integrity Checks.

The before/after below shows what mergeNodes does to a key that already split: three Customer nodes sharing one email, each holding a fraction of the relationships, collapse into a single survivor that keeps every edge (mergeRels: true) and the union of properties (properties: 'combine').

One survivor node inherits every relationship and the union of properties from the duplicates; the batched, non-parallel iterate keeps the collapse from re-contending on shared edges.

Edge cases & gotchas

1. The constraint index is not ONLINE when the load starts. CREATE CONSTRAINT completes its DDL before its backing index finishes populating on a non-empty database. Any MERGE in that window scans and can still duplicate. Always gate the load on CALL db.awaitIndexes(300) (or poll SHOW INDEXES YIELD state WHERE state <> 'ONLINE' until it returns empty) before releasing the workers.

2. Composite or multi-label identity that the single-property constraint misses. If a Customer is only unique by (tenant_id, email), a single-property constraint on email will reject legitimate cross-tenant rows and still permit duplicates within a tenant if you merge on the wrong key. Declare the real identity as a node key so the constraint matches your MERGE predicate exactly:

cypher

CREATE CONSTRAINT customer_tenant_email IF NOT EXISTS
FOR (c:Customer) REQUIRE (c.tenant_id, c.email) IS NODE KEY;

3. Blind driver retries around a CREATE. The driver retries transient errors, and a retried CREATE re-creates the node the first attempt already committed. The corrective is to keep every write an upsert on the constrained key — never let a CREATE sit in a retryable transaction function:

python

BAD  = "UNWIND $records AS row CREATE (c:Customer) SET c = row"
GOOD = "UNWIND $records AS row MERGE (c:Customer {email: row.email}) SET c += row.properties"

Modeling MERGE on a mutable property instead of a stable key is itself a property graph anti-pattern; anchor identity on a value that never changes across re-runs.

Parent context

This task is one remediation within the broader Batch Processing & Chunking Workflows stage of Automated Data Migration from Relational & JSON Sources — reach for it when deterministic partitioning alone cannot keep identical keys off separate workers.

Up: Batch Processing & Chunking Workflows — the chunked, sharded ingestion pipeline this fix plugs into.
Implementing idempotent migration scripts for Neo4j — the upsert discipline that makes retries and replays safe.
Data Validation & Integrity Checks — post-load reconciliation that catches drift a duplicate scan would miss.
Initial Load Performance Tuning — heap, index, and pool sizing once duplicates are eliminated at the ingestion layer.

Resolving duplicate nodes during parallel batch loads

Prerequisites

Why parallel MERGE produces duplicates

Core implementation

Validation & verification

Edge cases & gotchas

Parent context

Related