Graph Data Type Selection

Selecting the correct property type in Neo4j is a production engineering contract, not an abstract modeling exercise. The type you write on a property decides whether the query planner can seek an index or is forced into a scan, how many bytes the record consumes in the page cache, and whether a replayed migration converges or drifts. This guide addresses one concrete decision that recurs on every property in the schema: which Neo4j 5.x type each value should carry, and how to enforce that choice from the Python driver all the way down to the storage engine. Within a disciplined Neo4j graph schema design and architecture practice, type selection is the lowest-level interface between application state and Neo4j’s storage mechanics — get it wrong and you pay for it on every traversal, forever. Platform teams and Python engineers who fix deterministic type mapping at ingestion prevent schema drift, keep composite indexes eligible, and hold latency predictable under load.

Prerequisite concepts

Before applying the type rules below, the reader should already have these in place:

The parent reference, Neo4j Graph Schema Design & Architecture — type selection sits underneath every other schema decision it describes.
A settled node label taxonomy, since type constraints and composite indexes are declared per label and must apply consistently across label boundaries.
The relationship cardinality and directionality policy for any edge properties, because relationship-property types have to stay identical across every traversal path that reads them.
Neo4j 5.x and the neo4j Python driver v5+, so property-type constraints (REQUIRE n.prop IS :: <TYPE>) and the driver’s temporal classes are available.

Type selection also assumes you treat the schema as versioned infrastructure. A property’s type is not free to change on a whim; a change is a migration, governed by the same discipline as schema evolution and versioning.

Conceptual model: fixed- vs variable-width storage

Neo4j’s property graph model enforces a strongly-typed storage layer. The engine natively handles UTF-8 strings, 64-bit signed integers, 64-bit IEEE 754 floats, booleans, temporal primitives (Date, Time, DateTime, LocalDateTime, LocalTime, Duration), spatial Point values, and structured collections (List, Map). Each primitive carries distinct storage characteristics that directly influence I/O patterns.

Fixed-width types (Integer, Float, Boolean, and temporal values, which serialize to fixed-size encodings) align predictably with record boundaries, enabling efficient range scans, equality predicates, and vectorized execution paths. Variable-width types (String, List, and large Map structures) consume dynamic storage; when a value exceeds the inline record budget it spills into the dynamic string/array store, which increases fragmentation risk and reduces page cache hit ratios when overused on high-cardinality properties.

The two storage classes route through distinct execution paths, as shown below.

The practical consequence: a value that could be an Integer or a temporal type but is stored as a String (an epoch written as "1712345678", a date written as "2025-04-12") forfeits range-index eligibility and forces string comparison semantics. The planner can no longer reason about ordering, so WHERE n.ts > $cutoff degrades from an index seek to a full scan plus per-row coercion.

Design rules: choosing the right primitive

Use the following decision matrix to map a domain value to a Neo4j type. The guiding principle is that the most specific native type that faithfully represents the value is always correct — never widen to String for convenience.

Domain value	Correct Neo4j type	Never use	Why
Surrogate / business key	`Integer` or `String`	`Float`	Floats lose precision on large keys; pick one and constrain it
Money / exact decimal	`Integer` (minor units)	`Float`	IEEE-754 cannot represent `0.10` exactly; store cents as integers
Measurement / ratio	`Float`	`String`	Preserves range-scan eligibility for thresholds
Timestamp with offset	`DateTime`	`String`, epoch `Integer`	Keeps ordering, timezone, and temporal functions
Wall-clock, no zone	`LocalDateTime`	`DateTime`	Avoids inventing a false offset at ingestion
Calendar day	`Date`	`String`	Native comparison and `duration.between` support
Elapsed interval	`Duration`	`Integer` seconds	Retains calendar-aware arithmetic
Lat/long or geometry	`Point`	two `Float` props	Enables spatial index and `point.distance`
Small closed set / flag	`Boolean` or `String`	`Integer` codes	Readable predicates, no magic-number lookups
Bounded homogeneous set	`List` of one primitive	delimited `String`	Avoids `split()` scans; keep lists short

Two rules override the table. First, keep collections homogeneous and small: a List mixing types, or one that grows unbounded per node, defeats indexing and inflates the dynamic store — a shape catalogued in property graph anti-patterns. Second, never encode structure inside a String (CSV, JSON blobs, delimited keys) when the structure is queryable; if you must query into it, it belongs as typed properties or a related node, not as text you split() at read time.

Step-by-step implementation

Step 1 — Declare property-type constraints alongside the label

Type intent should be enforced by the database, not merely documented. Neo4j 5.x supports property-type constraints, so declare them idempotently as part of the label’s DDL. This makes a wrong-typed write fail at commit rather than surface later as a silent scan.

cypher

// Neo4j 5.x — idempotent DDL; safe to run on every pipeline start
CREATE CONSTRAINT event_id_unique IF NOT EXISTS
FOR (e:Event) REQUIRE e.id IS UNIQUE;

// Property-type constraints pin the storage type of each property
CREATE CONSTRAINT event_occurred_at_type IF NOT EXISTS
FOR (e:Event) REQUIRE e.occurred_at IS :: ZONED DATETIME;

CREATE CONSTRAINT event_amount_type IF NOT EXISTS
FOR (e:Event) REQUIRE e.amount_minor IS :: INTEGER;

Step 2 — Map Python values to Neo4j types deterministically at ingestion

The neo4j Python driver 5.x serializes standard primitives over the Bolt wire protocol automatically: Python int maps to Neo4j Integer, float to Float, bool to Boolean, and list/dict to List/Map. Temporal types are where correctness is lost, so map them explicitly rather than leaning on the standard library defaults.

For timezone-aware timestamps, pass a neo4j.time.DateTime (or a datetime.datetime with tzinfo set) — the driver serializes it to Neo4j DateTime.
For local (timezone-naïve) timestamps, pass a neo4j.time.LocalDateTime (or a naïve datetime.datetime) — the driver serializes it to Neo4j LocalDateTime.

Mixing standard-library datetime objects without explicit timezone configuration frequently causes silent truncation or a ValueError during bulk transactions. For authoritative guidance on Python temporal handling, consult the official Python datetime documentation.

python

from neo4j import GraphDatabase
from neo4j.time import DateTime
from datetime import timezone, timedelta

uri = "bolt://localhost:7687"

def ingest_event(tx, event_id: int, event_ts: DateTime, amount_minor: int, payload: dict):
    # Every parameter is a pre-typed native value; nothing is stringified.
    query = """
    MERGE (e:Event {id: $event_id})
    SET e.occurred_at  = $event_ts,     // Neo4j DateTime (zoned)
        e.amount_minor = $amount_minor,  // Neo4j Integer (money in cents)
        e.metadata     = $payload,       // Neo4j Map
        e.is_processed = false           // Neo4j Boolean
    """
    tx.run(query, event_id=event_id, event_ts=event_ts,
           amount_minor=amount_minor, payload=payload)

with GraphDatabase.driver(uri, auth=("neo4j", "password")) as driver:
    with driver.session(database="neo4j") as session:
        ts = DateTime.now(timezone(timedelta(hours=0)))  # UTC-aware
        session.execute_write(ingest_event, 1042, ts, 4999, {"source": "api_gateway"})

Step 3 — Parameterize everything, never interpolate

Implicit stringification of numeric or temporal values inside a Cypher query string breaks transaction safety, bypasses prepared-statement caching, and introduces injection-adjacent risk in migration pipelines. Always pass pre-validated, strongly-typed parameters through session.execute_write() or session.execute_read(). The driver serializes parameters into the binary Bolt protocol, preserving type fidelity and enabling server-side query-plan reuse. An f-string that pastes a value into the query text loses the type the moment it becomes text — the planner then compiles a distinct plan per literal, and the cache never warms.

Step 4 — Cast and validate before the value reaches the driver

For migrations from relational or document stores, add a validation middleware layer that casts incoming payloads to Neo4j-compatible types before the driver ever sees them. Pydantic models or dataclasses with custom validators reject malformed records early, preventing partial commits and orphaned nodes. This is the same boundary described in the batch-processing workflow: validate, cast, then load.

python

from pydantic import BaseModel, field_validator
from neo4j.time import DateTime

class EventIn(BaseModel):
    id: int
    amount_minor: int          # cents, never a float
    occurred_at: DateTime       # zoned; reject naive input upstream

    @field_validator("amount_minor")
    @classmethod
    def non_negative(cls, v: int) -> int:
        if v < 0:
            raise ValueError("amount_minor must be >= 0")
        return v

Constraint & validation layer

Constraints and ingestion-side checks are complementary. The property-type constraint is the invariant Neo4j will never let you violate; the ingestion cast is the early, informative signal that routes a bad payload to remediation instead of hitting a hard database error mid-transaction.

Property-type constraints (REQUIRE n.prop IS :: <TYPE>) reject a mistyped write at commit time. Pair them with uniqueness or node-key constraints so the MERGE key is both unique and correctly typed:

cypher

// The MERGE key and the type contract, declared together
CREATE CONSTRAINT event_id_unique IF NOT EXISTS
FOR (e:Event) REQUIRE e.id IS UNIQUE;

CREATE CONSTRAINT event_ts_type IF NOT EXISTS
FOR (e:Event) REQUIRE e.occurred_at IS :: ZONED DATETIME;

On the ingestion side, a single pre-flight query can flag values that would violate the type contract before the write transaction opens — for instance, epoch integers or ISO strings that a lax upstream let through where a temporal is expected:

cypher

// $rows is the chunk. Return one row per value that is not already temporal.
UNWIND $rows AS row
WITH row
WHERE NOT row.occurred_at IS :: ZONED DATETIME
RETURN row.id AS id, 'NON_TEMPORAL_TIMESTAMP' AS reason;

Enforcing correct types at ingestion matters because native temporal, numeric, and spatial values stay eligible for range, composite, and point indexes. If a timestamp lands as a String, every downstream range query silently falls back to a scan — the exact failure class the data validation and integrity checks gates are built to catch.

Performance & scale considerations

Type selection is the cheapest lever you have on read performance, because it is paid once at write and refunded on every read. The cost model is concrete: a range predicate on a native temporal or numeric property backed by a range index costs microseconds per matching row; the same predicate over a String-typed timestamp forces a full label scan plus per-row coercion and grows linearly with node count.

Confirm index-backed lookups. Run EXPLAIN and PROFILE on the queries that filter on typed properties and require NodeIndexSeek (or NodeIndexSeekByRange) leaves — never NodeByLabelScan followed by a Filter.
Keep fixed-width properties inline. Integer, Float, Boolean, and temporals pack into the node record and ride the page cache for free. Reserve String and List for values that genuinely are text or sets.
Bound and shrink variable-width values. A high-cardinality String property, or a List that grows per node, pushes records into the dynamic store, lowers cache density, and slows every traversal that touches the node — even queries that never read that property.
Batch with UNWIND. Send one parameterized statement per chunk of typed rows rather than one per row; the planner compiles a single reusable plan and round trips collapse. Chunk sizing follows the same trade-offs as initial load performance tuning.
Type relationship properties consistently. A property read across many traversal paths must carry an identical type on every edge, or index-backed relationship lookups silently split into typed and coerced branches. Where fan-out is high, align the decision with the graph partitioning strategy so hot edges stay index-eligible.

Cardinality is the scaling variable that most often surprises teams: mixed types on the same property across nodes fragment the index and the planner’s histograms, so estimates degrade and plan choice becomes unstable long before any single query looks slow.

Known pitfalls

Pitfall 1 — Storing money as a Float

Float cannot represent most decimal fractions exactly, so 0.10 + 0.20 is not 0.30, and summed balances drift by cents. Root cause: IEEE-754 binary floating point. Store monetary values as Integer minor units (cents) and format for display in the application. Backfill existing data during a maintenance window and add a property-type constraint so the mistake cannot recur:

cypher

// One-off backfill, then pin the type
MATCH (o:Order) WHERE o.amount IS :: FLOAT
SET o.amount_minor = toInteger(round(o.amount * 100))
REMOVE o.amount;

Pitfall 2 — Naïve datetimes silently becoming LocalDateTime

Passing a timezone-naïve datetime.datetime when the schema expects a zoned DateTime serializes to LocalDateTime, dropping the offset. Later comparisons against zoned values throw or, worse, compare wall-clock times across zones as if equal. Root cause: the driver maps naïve Python datetimes to LocalDateTime by construction. Normalize to UTC-aware values at the ingestion boundary (Step 2/Step 4) and enforce the zoned type with a constraint so a naïve write is rejected at commit.

Pitfall 3 — Encoding structure inside a String

Storing a delimited key ("tenant:1042:order"), a CSV list, or a JSON blob in a String property forces split() or apoc-style parsing at read time, which no index can accelerate. Root cause: queryable structure was flattened into opaque text. Promote each queried component to its own typed property, or model it as a related node — the modelling correction detailed in property graph anti-patterns.

Pitfall 4 — Changing a property’s type in place under live traffic

Rewriting a property from String to Integer (or Integer epoch to DateTime) across a large label while readers are live leaves the index straddling two types and the planner’s histograms stale, so queries flip between seeks and scans unpredictably. Root cause: a type change is a schema migration, not an update. Run it as a versioned, dual-typed transition — write the new-typed property alongside the old, backfill in batches, cut readers over, then drop the old property and its index. Govern the whole transition with the approach in schema evolution and versioning.

For comprehensive Cypher type semantics and index behavior, reference the official Neo4j Cypher Manual: Values and Types.

Neo4j Graph Schema Design & Architecture — the parent reference this type layer sits beneath.
Node Label Taxonomy Design — the labels that type constraints and composite indexes attach to.
Relationship Cardinality & Directionality — keeping edge-property types identical across traversal paths.
Property Graph Anti-Patterns — the encode-structure-in-a-string and unbounded-list mistakes to avoid.
Schema Evolution & Versioning — running a property-type change as a safe, versioned migration.

Graph Data Type Selection

Related pages