Relational Schema Mapping Strategies

Transitioning a normalized relational database into Neo4j is a schema design decision, not a data copy. Mechanical table-to-node conversions fail at scale: they preserve foreign-key columns as properties, leave junction tables as anemic nodes, and produce a graph that traverses no faster than the joins it replaced. The engineering task this guide addresses is how to translate relational structures — tables, foreign keys, junction tables, nullable columns, composite keys — into labeled nodes and typed relationships deterministically, so that a rerun of the migration produces exactly the same graph and referential integrity survives the crossing. These mapping decisions are the first stage of any automated migration pipeline: every downstream chunking, validation, and cutover step inherits the topology chosen here.

Prerequisite concepts

Before applying the rules below, the reader should already have the following in place:

The parent architecture from Automated Data Migration from Relational & JSON Sources — schema mapping is its extract → map stage, and every later stage assumes the topology it produces.
A settled node label taxonomy, so each source table maps to a stable domain label rather than an ad-hoc one invented at load time.
A relationship cardinality and direction policy, since a foreign key carries direction and multiplicity that must survive translation.
Neo4j 5.x with the neo4j Python driver v5+, plus read access to the source information_schema for constraint discovery.

If the export also carries semi-structured payloads, the normalization contract from JSON Document Flattening & Graph Conversion must be settled first — nested arrays and object hierarchies have to be decomposed into rows or subgraphs before topology materialization begins.

Conceptual model: tables and foreign keys become nodes and edges

Relational models encode associations three ways — foreign keys, junction tables, and nullable columns. In a property graph each becomes an explicit structure: business entities map to labeled nodes, referential links map to typed directed relationships, and scalar attributes map to node or edge properties. The baseline rule is deterministic and admits no exceptions at the happy path: a foreign key is never a property.

The diagram below shows how two relational tables and their foreign key translate into labeled nodes and a typed relationship.

Design rules / decision matrix

Every relational construct has one canonical target. The table below is the mapping contract; the guiding principle is that the source’s intent (identity, association, attribute) determines the graph structure, not the source’s physical layout.

Relational construct	Graph target	Rule
Table with a business identity	Labeled node	One label per stable domain type; key on a business key, not the surrogate row id
One-to-many foreign key	Directed relationship	Edge points from the “one” to the “many” (`:PLACED`), FK column dropped
Many-to-many junction table	Direct relationship with properties	Junction row becomes the edge; its non-key columns become edge properties
Composite primary key	Deterministic business key or `NODE KEY`	Flatten to `tenant_id::order_id`, or declare a composite `NODE KEY` constraint
Polymorphic association (`entity_type` + `entity_id`)	Explicit typed relationships	Replace the discriminator with concrete types (`:AUTHORED_ARTICLE`, `:COMMENTED_POST`)
Self-referencing foreign key	Directed edge with depth metadata	`:MANAGES {level: 1}`; enforce acyclicity where the domain requires it
Nullable foreign key	Optional relationship (absence, not `null`)	Create the edge only when the value is present; never store a `null` endpoint
Lookup / enum table	Node property or small shared node	Low-cardinality enums become properties; shared reference data becomes a `MERGE`d node

Two rules override the table. First, resolve every structural edge case before ingestion — a composite key discovered mid-load forces a schema change under traffic. Second, model the junction table’s payload on the relationship, not on a synthetic node: an order_items row with quantity and unit_price becomes (:Order)-[:CONTAINS {quantity, unit_price}]->(:Product), keeping the traversal one hop instead of two.

The hardest construct to picture is the junction table. The diagram below traces where each of its columns lands: the two foreign keys become the edge’s endpoints, and the remaining columns become properties on that single edge — no intermediate node survives.

Step-by-step implementation

Step 1 — Discover source constraints from `information_schema`

Do not hand-transcribe the foreign keys. Query the source catalog so the mapping is generated from ground truth and stays in sync when the schema changes. Parsing information_schema metadata (PostgreSQL documentation) yields the constraint set that drives idempotent Cypher generation.

python

# Discover FK constraints so relationship generation is metadata-driven, not hand-coded.
FK_DISCOVERY = """
SELECT
    tc.table_name        AS child_table,
    kcu.column_name      AS fk_column,
    ccu.table_name       AS parent_table,
    ccu.column_name      AS parent_key
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
     ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu
     ON tc.constraint_name = ccu.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY';
"""

Each returned row is one relationship type in the target graph: child_table and parent_table name the two node labels, and fk_column is the join key that becomes the edge — never a property.

Step 2 — Create constraints before any write

Uniqueness constraints on business keys are a correctness and performance prerequisite: they prevent duplicate nodes and make every MERGE an index seek. Create them idempotently so the migration is safe to rerun.

cypher

// Neo4j 5.x — idempotent DDL; safe to run on every pipeline start
CREATE CONSTRAINT customer_id_uniq IF NOT EXISTS
FOR (c:Customer) REQUIRE c.customer_id IS UNIQUE;

CREATE CONSTRAINT order_key IF NOT EXISTS
FOR (o:Order) REQUIRE (o.tenant_id, o.order_id) IS NODE KEY;

Step 3 — Load nodes and edges in one parameterized transaction

Production migrations cannot tolerate unbounded transaction scopes or per-row round trips. The neo4j Python driver 5.x enforces managed transaction boundaries through session.execute_write(), which pools connections and retries transient errors automatically. Wrap mapping logic in a parameterized UNWIND so each chunk is a single compiled statement.

python

from neo4j import GraphDatabase
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("neo4j_migration")

def map_customer_orders(tx, batch):
    # One statement per chunk: UNWIND expands the batch server-side,
    # MERGE keys on the constrained business keys, and the FK becomes
    # the :PLACED edge rather than a stored property.
    query = """
    UNWIND $batch AS row
    MERGE (c:Customer {customer_id: row.cust_id})
      SET c.name = row.cust_name, c.region = row.region
    MERGE (o:Order {order_id: row.order_id})
      SET o.total = row.total, o.status = row.status,
          o.created_at = datetime(row.created_ts)
    MERGE (c)-[:PLACED {quantity: row.qty}]->(o)
    """
    # consume() drains the result and returns a ResultSummary with counters.
    summary = tx.run(query, batch=batch).consume()
    logger.info(f"Created {summary.counters.nodes_created} nodes in batch")

def execute_migration(driver, batches):
    with driver.session(database="neo4j") as session:
        for batch in batches:
            try:
                # execute_write manages the transaction lifecycle and retries
                # transient failures; a raised exception rolls the chunk back.
                session.execute_write(map_customer_orders, batch)
            except Exception as e:
                logger.error(f"Transaction failed for batch: {e}")
                raise

Because both MERGE statements key on constrained properties, the load is idempotent: a replayed chunk updates existing nodes instead of duplicating them, so a partial failure followed by a full rerun converges to the same graph.

Step 4 — Choose the right loader for the data volume

Driver-managed batching is the correct tool for incremental syncs and complex transformation logic. For very large single statements executed against a live database, CALL { ... } IN TRANSACTIONS OF 1000 ROWS splits the work into server-side sub-transactions (issue it via session.run, not inside execute_write, since the callback owns its own transaction). For cold loads exceeding roughly 100M records, neither path is appropriate — use neo4j-admin database import offline, then bring the validated database online. The full extraction and chunk-sizing discipline is covered in Batch Processing & Chunking Workflows.

The narrow mechanics of turning a specific PostgreSQL foreign key into a relationship — constraint generation, index alignment, and automated FK-to-edge translation — are worked end to end in migrating PostgreSQL foreign keys to Neo4j relationships automatically.

Constraint & validation layer

Constraints and ingestion-side checks are complementary. The constraint is the invariant the database will never let you violate; the ingestion check is the early, informative signal that routes a bad payload to remediation instead of hitting a hard database error mid-transaction.

Validation runs at three boundaries. Pre-load, verify primary-key uniqueness and referential integrity in the source and confirm the target constraints from Step 2 exist and are ONLINE. In-transaction, each session.execute_write() call is atomic per chunk — a violation rolls the exact chunk back with no partial writes, and the driver retries transient errors while propagating ClientError (constraint violations) to the caller. Post-load, compare source row counts against Neo4j node and relationship aggregates and probe for dangling endpoints a count check would miss.

cypher

// Post-load topology check: every Order must be placed by a Customer.
// A count match alone would not catch an orphaned Order.
MATCH (o:Order)
WHERE NOT (o)<-[:PLACED]-(:Customer)
RETURN count(o) AS orphaned_orders;

For critical financial or identity datasets, add checksum-based reconciliation on top of count comparison. The full three-gate method — including in-flight pre-flight queries and reconciliation status codes — is detailed in Data Validation & Integrity Checks.

Performance & scale considerations

Mapping choices decide throughput long before tuning does. The scaling variables specific to relational-to-graph translation are cardinality, index selectivity, and chunk size.

Constraints turn MERGE into a seek. Every MERGE on an unconstrained property is a NodeByLabelScan that degrades linearly with the graph. Run PROFILE on the load query and require NodeUniqueIndexSeek leaves on the keys you merge.
Batch with UNWIND, never per row. One parameterized statement per chunk collapses round trips and lets the planner reuse a single compiled plan; per-row MERGE calls multiply latency by the row count.
Size chunks against the transaction log and RAM. Chunk size trades round-trip amortization against rollback cost and heap pressure — typically 1,000–10,000 records per batch depending on relationship density. Denser edges mean smaller chunks. This is the same trade-off weighed in the batch-processing workflow.
Paginate the source by key, not by offset. Cursor-based extraction with ORDER BY primary_key guarantees deterministic, resumable pagination. Avoid LIMIT/OFFSET on large tables — it re-scans a growing prefix on every page. Prefer keyset pagination or database-native change data capture for incremental loads.
Watch junction-table fan-out. A many-to-many table collapsed into edges can create hub nodes with millions of relationships. Where that fan-out is inherent, align the load with a graph partitioning strategy so degree-sensitive queries stay bounded.

Property typing also has a downstream performance cost: persist a timestamp with datetime() rather than as a string, or downstream range queries silently fall back to scans — a class of defect covered in graph data type selection.

Known pitfalls

Pitfall 1 — Foreign keys persisted as node properties

The most common mapping defect is a MERGE that writes SET o.customer_id = row.cust_id and never creates the edge. The graph then looks populated but traverses like the relational source, forcing every join back into a property lookup. Root cause: treating the FK column as an attribute rather than an association. Fix: drop the FK column from the property set and materialize it as a relationship, exactly as Step 3 does with :PLACED. Audit for the anti-pattern catalogued in property graph anti-patterns by scanning for id-shaped properties on nodes that should be endpoints.

Pitfall 2 — Junction table imported as a node

Modeling order_items as an (:OrderItem) node connected to both Order and Product doubles the hop count of the most common query and buries the quantity/unit_price payload one level too deep. Root cause: mirroring the relational physical layout instead of the association it represents. Fix: collapse the junction row into a single edge with properties.

cypher

// Correct: the junction row becomes one edge carrying its own columns.
UNWIND $rows AS row
MATCH (o:Order   {order_id:   row.order_id})
MATCH (p:Product {product_id: row.product_id})
MERGE (o)-[r:CONTAINS]->(p)
  SET r.quantity = row.quantity, r.unit_price = row.unit_price;

Pitfall 3 — Non-deterministic keys break idempotency

Keying MERGE on a source surrogate row id, an auto-increment, or a value that differs between reruns creates a fresh node on every pass and silently duplicates the graph. Root cause: the merge key is not stable across executions. Fix: key on a business-meaningful value (or a deterministic composite like tenant_id::order_id) and back it with the UNIQUE/NODE KEY constraint from Step 2 so a replay converges rather than diverges.

Pitfall 4 — Cutover without a validated snapshot

Decommissioning the source before proving fidelity is unrecoverable. Root cause: treating migration as a one-time dump rather than an observable pipeline. Fix: take a known-good neo4j-admin database dump before any legacy decommission, run initial load performance tuning and index/constraint creation before the bulk load, then run identical analytical queries against both systems in parallel until reconciliation thresholds are met. Only then route traffic exclusively to Neo4j and archive the source. As the mapping itself changes over time, keep it aligned with the schema evolution and versioning approach so a source contract change updates the migration and its checks together.

Automated Data Migration from Relational & JSON Sources — the parent reference this mapping stage plugs into.
Migrating PostgreSQL foreign keys to Neo4j relationships automatically — the FK-to-edge mechanics worked end to end.
JSON Document Flattening & Graph Conversion — the sibling stage for semi-structured payloads.
Batch Processing & Chunking Workflows — how the mapped load is chunked and paginated.
Data Validation & Integrity Checks — the three-gate validation the mapped topology is checked against.

Relational Schema Mapping Strategies

Explore this section