Chapter 13 of 20
Migration Strategy
How to add a graph database alongside your existing relational systems without ripping anything out. Covers the sidecar pattern, CDC pipelines from PostgreSQL to Neo4j, phased migration timelines, dual-write vs CDC vs batch sync trade-offs, team structure, rollback strategies, and a complete CDC pipeline implementation with Debezium.
Overview
Nobody is asking you to rip out Oracle. Here is how to add graphs incrementally instead.
01. Why Nobody Rips and Replaces
Every year, someone proposes ripping out the relational database and replacing it with something new. Every year, that proposal dies. It should. Your relational database has 15 years of battle-tested stored procedures, 200 reports built on top of it, an operations team that knows how to back it up and restore it at 3am, and a vendor support contract that someone in procurement spent 6 months negotiating.
Graph databases do not replace relational databases — they augment them. You add a graph alongside your existing systems, feed it the data that benefits most from graph representation, and query the graph for questions that relational databases answer poorly or not at all.
Think of it like this: You do not tear down your house to add a workshop. You build the workshop in the backyard and run power from the main panel. The house keeps doing what the house does. The workshop handles what the house was never designed for — and they share electricity.
02. The Sidecar Pattern
The sidecar pattern is the foundational architecture for incremental graph adoption. Your relational database remains the system of record. The graph database runs alongside it as a read-optimized projection of specific, relationship-heavy data.
┌─────────────────────────────────────────────────┐
│ Applications │
│ │
│ ┌────────────┐ ┌────────────────┐ │
│ │ CRUD Ops │ │ Graph Queries │ │
│ │ Reports │ │ Traversals │ │
│ │ Transactions│ │ Path Analysis │ │
│ └─────┬──────┘ └───────┬────────┘ │
│ │ │ │
│ ┌─────▼──────┐ Sync ┌─────▼────────┐ │
│ │ PostgreSQL │ ────────> │ Neo4j │ │
│ │ (Primary) │ │ (Sidecar) │ │
│ └────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────┘
What Stays in the Relational Database
- Transactional writes (orders, payments, user accounts)
- Financial reporting and audit trails
- Data subject to strict ACID compliance requirements
- Any table that existing applications depend on
What Moves to the Graph
- Relationship-heavy data (organizational hierarchies, product dependencies)
- Data you query by traversal ("find all downstream systems affected by this failure")
- Knowledge graph projections from documents
- Any query that currently requires 4+ JOINs and runs slowly
The Rule of Thumb
If your question starts with "find all things connected to X," that belongs in the graph. If your question starts with "sum the revenue for Q3," that stays in the relational database.
03. What to Migrate First
Not all data benefits equally from graph representation. Start with data that has the highest relationship density and the most painful SQL queries.
Scoring Your Data for Graph Suitability
| Factor | Score 1 (Low) | Score 5 (High) |
|---|---|---|
| Relationship density | Each row references 1-2 other tables | Each row references 5+ other tables with recursive refs |
| Query complexity | Simple SELECTs with 1-2 JOINs | Recursive CTEs, 4+ JOINs, self-joins |
| Traversal depth | Flat lookups | "Find everything connected within 3 hops" |
| Schema variability | Fixed schema, rarely changes | New entity types and relationship types monthly |
| Business pain | Queries run fine | Queries time out, reports are late, users complain |
Score each data domain on these 5 factors. A total score of 20 or above means migrate this first. A total score below 10 means leave it in the relational database for now.
Common High-Scoring Candidates
| Data Domain | Typical Score | Why |
|---|---|---|
| Organizational hierarchy | 22 | Deep recursive queries, frequent reorgs |
| IT service dependencies | 24 | Many-to-many, blast radius queries |
| Product component trees | 20 | Deep nesting, variant tracking |
| Fraud transaction networks | 25 | Multi-hop traversals, pattern matching |
| Regulatory compliance mapping | 21 | Cross-references, traceability requirements |
| Customer relationship networks | 19 | Many relationship types, path queries |
04. Sync Strategies: Dual-Write vs CDC vs Batch
There are three ways to keep the graph in sync with the relational database. Each has trade-offs that affect your architecture choices.
Comparison Table
| Factor | Dual-Write | CDC (Change Data Capture) | Batch Sync |
|---|---|---|---|
| Latency | Near-zero (writes to both) | Seconds to low minutes | Minutes to hours |
| Consistency | Risk of partial failure | Eventually consistent | Stale between batches |
| Complexity | High (app code changes) | Medium (infrastructure) | Low (scheduled job) |
| Failure modes | One DB writes, other fails | Pipeline lag, connector crash | Job fails, data stale |
| Code changes | Every write path modified | None (reads DB log) | Minimal (add sync job) |
| Recommended for | Tiny data, tight coupling | Most production use cases | Initial load, low-change data |
| Operational burden | High (distributed transactions) | Medium (manage Kafka/Debezium) | Low (cron job) |
The Short Answer
Use CDC for production sync. Use batch for the initial data load. Avoid dual-write unless you have a specific reason for it and are prepared to handle distributed transaction failures.
Think of it like this: Dual-write is like asking someone to write a letter and simultaneously type it — they will eventually make a mistake and the two copies will diverge. CDC is like a copy machine that watches everything you write and automatically makes a copy — it can fall behind, but it never forgets. Batch is like photocopying everything at the end of the day — simple, but you are always working with yesterday's information.
05. CDC Pipeline: PostgreSQL to Neo4j
CDC (Change Data Capture) reads the database transaction log (the write-ahead log, or WAL, in PostgreSQL) and streams every INSERT, UPDATE, and DELETE as an event. Debezium is the standard open-source CDC connector. It reads the WAL, publishes events to Kafka, and a consumer writes those events to Neo4j.
Architecture
PostgreSQL WAL → Debezium Connector → Kafka Topics → Consumer → Neo4j
Each table gets its own Kafka topic. The consumer reads from those topics, transforms each relational row into a graph operation, and executes it against Neo4j.
Step 1: Configure PostgreSQL for Logical Replication
-- postgresql.conf changes (or ALTER SYSTEM)
ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_replication_slots = 4;
ALTER SYSTEM SET max_wal_senders = 4;
-- Create a replication user
CREATE ROLE cdc_user WITH REPLICATION LOGIN PASSWORD 'secure_password';
-- Grant access to the tables you want to capture
GRANT SELECT ON ALL TABLES IN SCHEMA public TO cdc_user;
-- Create a publication for the tables you want to sync
CREATE PUBLICATION graph_sync FOR TABLE
employees, departments, projects,
employee_projects, department_hierarchy;
Step 2: Debezium Connector Configuration
{
"name": "pg-graph-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres-host",
"database.port": "5432",
"database.user": "cdc_user",
"database.password": "secure_password",
"database.dbname": "enterprise_db",
"database.server.name": "enterprise",
"plugin.name": "pgoutput",
"publication.name": "graph_sync",
"slot.name": "graph_sync_slot",
"table.include.list": "public.employees,public.departments,public.projects,public.employee_projects,public.department_hierarchy",
"transforms": "route",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": "enterprise.public.(.*)",
"transforms.route.replacement": "graph.sync.$1"
}
}
Step 3: The Consumer — Transforming Rows to Graph Operations
import json
import logging
from dataclasses import dataclass
from confluent_kafka import Consumer, KafkaError
from neo4j import GraphDatabase
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("cdc_consumer")
@dataclass
class CDCEvent:
"""Represents a parsed Debezium CDC event."""
operation: str # "c" create, "u" update, "d" delete
table: str
before: dict | None # Previous row state (for updates/deletes)
after: dict | None # New row state (for creates/updates)
timestamp_ms: int
def parse_debezium_event(message_value: bytes) -> CDCEvent:
"""Parse a raw Debezium message into a CDCEvent."""
payload = json.loads(message_value)["payload"]
source = payload.get("source", {})
return CDCEvent(
operation=payload["op"],
table=source.get("table", "unknown"),
before=payload.get("before"),
after=payload.get("after"),
timestamp_ms=payload.get("ts_ms", 0)
)
# ── Mapping: relational table → graph operation ──────────────
TABLE_HANDLERS = {}
def handles(table_name: str):
"""Decorator to register a handler for a specific table."""
def decorator(func):
TABLE_HANDLERS[table_name] = func
return func
return decorator
@handles("employees")
def handle_employee(event: CDCEvent, session):
"""Sync employee rows to Person nodes."""
if event.operation in ("c", "u"):
row = event.after
session.run("""
MERGE (p:Person {employee_id: $emp_id})
SET p.name = $name,
p.title = $title,
p.email = $email,
p.status = $status,
p.updated = datetime()
""", emp_id=row["id"], name=row["name"],
title=row["title"], email=row["email"],
status=row.get("status", "active"))
# If department_id is set, link to department
if row.get("department_id"):
session.run("""
MATCH (p:Person {employee_id: $emp_id})
MATCH (d:Department {dept_id: $dept_id})
MERGE (p)-[:WORKS_IN]->(d)
""", emp_id=row["id"], dept_id=row["department_id"])
# If manager_id is set, link to manager
if row.get("manager_id"):
session.run("""
MATCH (p:Person {employee_id: $emp_id})
MATCH (m:Person {employee_id: $mgr_id})
MERGE (p)-[:REPORTS_TO]->(m)
""", emp_id=row["id"], mgr_id=row["manager_id"])
elif event.operation == "d":
row = event.before
session.run("""
MATCH (p:Person {employee_id: $emp_id})
SET p.status = 'deleted', p.deleted = datetime()
""", emp_id=row["id"])
@handles("departments")
def handle_department(event: CDCEvent, session):
"""Sync department rows to Department nodes."""
if event.operation in ("c", "u"):
row = event.after
session.run("""
MERGE (d:Department {dept_id: $dept_id})
SET d.name = $name,
d.code = $code,
d.updated = datetime()
""", dept_id=row["id"], name=row["name"],
code=row.get("code", ""))
elif event.operation == "d":
row = event.before
session.run("""
MATCH (d:Department {dept_id: $dept_id})
SET d.status = 'deleted', d.deleted = datetime()
""", dept_id=row["id"])
@handles("employee_projects")
def handle_employee_project(event: CDCEvent, session):
"""Sync the join table to WORKS_ON relationships."""
if event.operation in ("c", "u"):
row = event.after
session.run("""
MATCH (p:Person {employee_id: $emp_id})
MATCH (proj:Project {project_id: $proj_id})
MERGE (p)-[r:WORKS_ON]->(proj)
SET r.role = $role,
r.start_date = $start,
r.updated = datetime()
""", emp_id=row["employee_id"],
proj_id=row["project_id"],
role=row.get("role", "member"),
start=row.get("start_date"))
elif event.operation == "d":
row = event.before
session.run("""
MATCH (p:Person {employee_id: $emp_id})
-[r:WORKS_ON]->
(proj:Project {project_id: $proj_id})
DELETE r
""", emp_id=row["employee_id"],
proj_id=row["project_id"])
# ── Main consumer loop ───────────────────────────────────────
def run_consumer(
kafka_bootstrap: str,
neo4j_uri: str,
neo4j_user: str,
neo4j_password: str
):
"""Main CDC consumer loop."""
consumer = Consumer({
"bootstrap.servers": kafka_bootstrap,
"group.id": "graph-sync-consumer",
"auto.offset.reset": "earliest",
"enable.auto.commit": False
})
topics = [f"graph.sync.{t}" for t in TABLE_HANDLERS]
consumer.subscribe(topics)
driver = GraphDatabase.driver(
neo4j_uri, auth=(neo4j_user, neo4j_password)
)
logger.info("CDC consumer started. Listening on %s", topics)
processed = 0
errors = 0
try:
while True:
msg = consumer.poll(timeout=1.0)
if msg is None:
continue
if msg.error():
if msg.error().code() == KafkaError._PARTITION_EOF:
continue
logger.error("Kafka error: %s", msg.error())
errors += 1
continue
try:
event = parse_debezium_event(msg.value())
handler = TABLE_HANDLERS.get(event.table)
if handler:
with driver.session() as session:
handler(event, session)
processed += 1
if processed % 1000 == 0:
logger.info(
"Processed %d events (%d errors)",
processed, errors
)
consumer.commit(msg)
except Exception as e:
logger.error(
"Failed to process event from %s: %s",
msg.topic(), e
)
errors += 1
# In production: send to dead letter queue
# dlq_producer.produce("graph.sync.dlq", msg.value())
except KeyboardInterrupt:
logger.info("Shutting down. Processed %d events.", processed)
finally:
consumer.close()
driver.close()
if __name__ == "__main__":
run_consumer(
kafka_bootstrap="localhost:9092",
neo4j_uri="bolt://localhost:7687",
neo4j_user="neo4j",
neo4j_password="password"
)
Key Design Decisions in This Pipeline
-
Soft deletes in the graph. When a row is deleted from PostgreSQL, the node in Neo4j is not deleted. It is marked as deleted. This preserves historical relationships and avoids orphaning connected nodes.
-
MERGE, not CREATE. Every write uses MERGE (upsert) so the pipeline is idempotent. If Kafka redelivers a message, the result is the same.
-
One handler per table. The decorator pattern keeps the mapping clean. When you add a new table, you write one function and register it.
-
Manual commit. The Kafka offset is committed only after the Neo4j write succeeds. If the consumer crashes mid-write, it re-reads and re-processes the event on restart.
06. Phased Migration Timeline
A phased approach reduces risk and builds organizational confidence. Do not try to migrate everything at once.
Phase
Proof of Concept (Weeks 1-4)
Goal: Demonstrate graph value with one high-scoring data domain.
| Task | Owner | Duration |
|---|---|---|
| Score data domains for graph suitability | Data Engineer + BA | 3 days |
| Set up Neo4j in a dev environment | Developer | 2 days |
| Batch-load the highest-scoring data domain | Data Engineer | 1 week |
| Build 3-5 graph queries that answer painful business questions | Developer + BA | 1 week |
| Present results to stakeholders | Team Lead | 1 day |
Exit criteria: Stakeholders see concrete value. At least one query answers a question that was previously impossible or required hours of manual work.
Phase
Pipeline (Weeks 5-10)
Goal: Automated sync from PostgreSQL to Neo4j via CDC.
| Task | Owner | Duration |
|---|---|---|
| Set up Kafka and Debezium in staging | DevOps / Data Engineer | 1 week |
| Implement CDC consumer for Phase 1 tables | Developer | 2 weeks |
| Data consistency validation (compare PG vs Neo4j) | QA | 1 week |
| Build monitoring and alerting | DevOps | 1 week |
| Load testing and performance tuning | Developer + DBA | 1 week |
Exit criteria: CDC pipeline runs in staging for 2 weeks with zero data inconsistencies. Monitoring dashboard shows pipeline health.
Phase
Production (Weeks 11-16)
Goal: CDC pipeline running in production. Applications querying the graph.
| Task | Owner | Duration |
|---|---|---|
| Deploy Neo4j to production environment | DevOps | 1 week |
| Deploy CDC pipeline to production | DevOps + Data Engineer | 1 week |
| Migrate first application to read from graph | Developer | 2 weeks |
| Runbook creation and on-call training | DevOps + Developer | 1 week |
| Rollback drill (simulate failures, practice recovery) | Whole team | 1 day |
Exit criteria: One production application reads from the graph. On-call team can handle common failure scenarios without escalation.
Phase
Expansion (Weeks 17-24)
Goal: Additional data domains in the graph. More applications consuming graph data.
| Task | Owner | Duration |
|---|---|---|
| Add 2-3 more data domains to CDC pipeline | Data Engineer + Developer | 3 weeks |
| Build graph-powered features (recommendations, impact analysis) | Developer | 3 weeks |
| Knowledge graph layer (document extraction, Chapter 7) | Data Engineer + Developer | 2 weeks |
| Performance optimization based on production metrics | DBA + Developer | Ongoing |
Exit criteria: Graph serves multiple use cases. Team operates independently without external consulting.
07. Team Structure: Who Does What
Graph migration is a team effort. Here is who does what.
| Role | Phase 1 Responsibilities | Phase 2-4 Responsibilities |
|---|---|---|
| DBA | Assess relational schema for graph suitability, configure WAL for CDC | Monitor replication lag, tune PostgreSQL for CDC load, manage replication slots |
| Data Engineer | Design graph data model, batch-load initial data | Build and maintain CDC pipeline, add new table handlers, data quality checks |
| Developer | Write Cypher queries, build graph API layer | Integrate graph queries into applications, build new features on graph data |
| BA / PM | Identify high-value use cases, score data domains | Define requirements for new graph features, validate business value |
| QA | Validate batch-loaded data accuracy | Build data consistency tests, regression tests for graph queries (Chapter 14) |
| DevOps | Set up Neo4j dev environment | Deploy and monitor Kafka, Debezium, Neo4j in production, alerting, backups |
The Common Mistake
Teams often assign graph migration to one person. That person becomes a single point of failure and burns out. The minimum viable team is three people: one data engineer (pipeline), one developer (queries and application integration), and one DevOps person (infrastructure). Everyone else contributes part-time.
08. Risk Mitigation
Rollback Strategy
Every phase needs a rollback plan. Here is the framework:
| Risk | Detection | Rollback | Prevention |
|---|---|---|---|
| CDC pipeline drops events | Comparison query shows count mismatch between PG and Neo4j | Re-run batch sync for affected tables | Dead letter queue, offset tracking, monitoring |
| Neo4j goes down | Health check fails, application timeout | Applications fall back to relational queries | Read replicas, automated failover |
| Data inconsistency | Consistency check job finds mismatches | Flag affected nodes, re-sync from PG | Nightly consistency validation job |
| Performance degradation | Query latency exceeds baseline | Disable graph queries, route to relational | Load testing before launch, query optimization |
| Schema change in PG | CDC consumer throws parsing errors | Pause pipeline, update handler, replay | Schema change notifications from DBA, versioned handlers |
The Fallback Rule
Because the graph is a sidecar and not the system of record, rollback is always the same: stop reading from the graph and go back to reading from the relational database. Your application code should have a feature flag for every graph-powered query path:
from functools import wraps
# Feature flags — toggle graph queries without deployment
GRAPH_FEATURES = {
"org_hierarchy_traversal": True,
"impact_analysis": True,
"dependency_lookup": False, # Not yet in production
}
def with_graph_fallback(feature_name: str, relational_fallback):
"""Decorator: try graph query, fall back to relational."""
def decorator(graph_func):
@wraps(graph_func)
def wrapper(*args, **kwargs):
if not GRAPH_FEATURES.get(feature_name, False):
return relational_fallback(*args, **kwargs)
try:
return graph_func(*args, **kwargs)
except Exception as e:
logger.warning(
"Graph query failed for %s, falling back: %s",
feature_name, e
)
return relational_fallback(*args, **kwargs)
return wrapper
return decorator
# Usage:
def get_org_hierarchy_sql(person_id: int) -> list[dict]:
"""Fallback: recursive CTE in PostgreSQL."""
# ... existing SQL query ...
pass
@with_graph_fallback(
"org_hierarchy_traversal",
relational_fallback=get_org_hierarchy_sql
)
def get_org_hierarchy(person_id: int) -> list[dict]:
"""Primary: graph traversal in Neo4j."""
with neo4j_driver.session() as session:
result = session.run("""
MATCH path = (p:Person {employee_id: $pid})
-[:REPORTS_TO*]->(m:Person)
RETURN [n IN nodes(path) | n.name] AS chain
""", pid=person_id)
return [dict(r) for r in result]
Data Consistency Checks
Run these nightly as a scheduled job:
def validate_sync_consistency(pg_conn, neo4j_session):
"""Compare counts and spot-check records between PG and Neo4j."""
checks = []
# 1. Count comparison
pg_count = pg_conn.execute(
"SELECT COUNT(*) FROM employees WHERE status = 'active'"
).fetchone()[0]
neo4j_count = neo4j_session.run(
"MATCH (p:Person) WHERE p.status <> 'deleted' RETURN count(p)"
).single()[0]
count_match = abs(pg_count - neo4j_count) <= 5 # Allow small lag
checks.append({
"check": "employee_count",
"pg": pg_count,
"neo4j": neo4j_count,
"pass": count_match
})
# 2. Spot-check: pick 10 random employees, verify fields match
sample = pg_conn.execute(
"SELECT id, name, title, email FROM employees "
"ORDER BY RANDOM() LIMIT 10"
).fetchall()
mismatches = 0
for emp in sample:
neo4j_record = neo4j_session.run(
"MATCH (p:Person {employee_id: $eid}) "
"RETURN p.name AS name, p.title AS title, p.email AS email",
eid=emp.id
).single()
if neo4j_record is None:
mismatches += 1
elif (neo4j_record["name"] != emp.name
or neo4j_record["title"] != emp.title):
mismatches += 1
checks.append({
"check": "spot_check_10",
"mismatches": mismatches,
"pass": mismatches == 0
})
# 3. Relationship count check
pg_rels = pg_conn.execute(
"SELECT COUNT(*) FROM employee_projects"
).fetchone()[0]
neo4j_rels = neo4j_session.run(
"MATCH ()-[r:WORKS_ON]->() RETURN count(r)"
).single()[0]
rel_match = abs(pg_rels - neo4j_rels) <= 5
checks.append({
"check": "works_on_count",
"pg": pg_rels,
"neo4j": neo4j_rels,
"pass": rel_match
})
return checks
09. Chapter Checklist
Before you move on, verify:
- You have identified which data domains score highest for graph suitability
- You understand the sidecar pattern — graph augments, does not replace
- You can explain why CDC beats dual-write for production sync
- You have a phased plan with clear exit criteria for each phase
- Every graph query path has a relational fallback with a feature flag
- Nightly consistency checks are in place before you go live
- Your rollback plan does not require a deployment. A feature flag flip is sufficient.