Chapter 15 of 20

Monitoring and Operations

How to keep a graph database running in production. Covers the three things that will break at 3am, key metrics dashboards, alerting rules, backup and disaster recovery, scaling patterns including read replicas and connection pooling, cost management, and a runbook template for common operational issues.

13 min read

Overview

It is in production. Three things will wake you up at 3am if you do not monitor them.

01. The Three Things

After running graph databases in production across multiple organizations, three categories of failure account for the vast majority of 3am pages:

Query latency spikes. A query that ran in 200ms starts taking 15 seconds, and users see timeouts. The cause is usually an unindexed property lookup, a Cartesian product in a Cypher query, or a traversal that hit a supernode: a single node with 100,000 or more relationships.
Graph size growth. The graph was designed for 1 million nodes. It now has 50 million because someone added a data source without considering cardinality. Memory is exhausted, queries slow to a crawl, and the page cache starts evicting.
Extraction pipeline failures. The CDC pipeline from Chapter 13 stops processing events. The graph falls hours behind the relational database. Applications serve stale data, and nobody notices until a business user reports wrong results.

Think of it like this: A graph database in production is like a highway system. Latency spikes are traffic jams — something is blocking a lane. Growth issues are running out of road capacity. Pipeline failures are on-ramps that have been closed — new data cannot get onto the highway.

02. Key Metrics Dashboard

Every graph database deployment needs a dashboard with these metrics. Whether you use Grafana, Datadog, or CloudWatch, track all of them from day one.

Query Metrics

Metric	What It Tells You	Collection Method
Query latency p50	Median response time	Neo4j query log + Prometheus
Query latency p95	Tail latency experienced by 1 in 20 users	Neo4j query log + Prometheus
Query latency p99	Worst-case for 1 in 100 users	Neo4j query log + Prometheus
Slow query count	Queries exceeding threshold (e.g., > 1s)	Neo4j query log
Query throughput	Queries per second	Neo4j metrics endpoint
Active transactions	Currently executing queries	`dbms.listTransactions()`

Resource Metrics

Metric	What It Tells You	Collection Method
Heap usage	JVM memory pressure	JMX / Prometheus
Page cache hit ratio	% of reads served from memory vs disk	Neo4j metrics
Page cache evictions	Data being pushed out of memory	Neo4j metrics
Disk usage	Total store size and growth rate	OS metrics
CPU utilization	Processing capacity headroom	OS metrics
GC pause time	JVM garbage collection impact	JMX

Graph-Specific Metrics

Metric	What It Tells You	Collection Method
Node count by label	Graph size and composition	Periodic Cypher query
Relationship count by type	Relationship density	Periodic Cypher query
Max node degree	Supernode detection	Periodic Cypher query
Connection pool utilization	Driver connection pressure	Neo4j driver metrics
Cluster replication lag	How far behind read replicas are	Neo4j cluster metrics

Pipeline Metrics

Metric	What It Tells You	Collection Method
CDC consumer lag	Events waiting to be processed	Kafka consumer group metrics
Events processed/sec	Pipeline throughput	Consumer application metrics
Dead letter queue depth	Failed events needing attention	DLQ topic metrics
Last successful sync	How fresh the graph data is	Consumer heartbeat

Collecting Graph-Specific Metrics

import time
import logging
from dataclasses import dataclass, field
from neo4j import GraphDatabase

logger = logging.getLogger("graph_metrics")


@dataclass
class GraphMetrics:
    """Snapshot of graph database metrics."""
    timestamp: float = field(default_factory=time.time)
    node_counts: dict = field(default_factory=dict)
    rel_counts: dict = field(default_factory=dict)
    total_nodes: int = 0
    total_relationships: int = 0
    max_degree: int = 0
    max_degree_node: str = ""
    store_size_mb: float = 0.0


def collect_graph_metrics(session) -> GraphMetrics:
    """Collect key graph metrics for dashboard."""
    metrics = GraphMetrics()

    # Node counts by label
    result = session.run("""
        CALL db.labels() YIELD label
        CALL {
            WITH label
            MATCH (n) WHERE label IN labels(n)
            RETURN count(n) AS cnt
        }
        RETURN label, cnt ORDER BY cnt DESC
    """)
    for record in result:
        metrics.node_counts[record["label"]] = record["cnt"]
        metrics.total_nodes += record["cnt"]

    # Relationship counts by type
    result = session.run("""
        CALL db.relationshipTypes() YIELD relationshipType AS type
        CALL {
            WITH type
            MATCH ()-[r]->() WHERE type(r) = type
            RETURN count(r) AS cnt
        }
        RETURN type, cnt ORDER BY cnt DESC
    """)
    for record in result:
        metrics.rel_counts[record["type"]] = record["cnt"]
        metrics.total_relationships += record["cnt"]

    # Supernode detection: find the node with most relationships
    result = session.run("""
        MATCH (n)
        WITH n, size([(n)--() | 1]) AS degree
        ORDER BY degree DESC
        LIMIT 1
        RETURN coalesce(n.name, n.id, toString(id(n))) AS node_id,
               labels(n) AS labels, degree
    """)
    record = result.single()
    if record:
        metrics.max_degree = record["degree"]
        metrics.max_degree_node = (
            f"{record['labels'][0]}:{record['node_id']}"
        )

    # Store size
    result = session.run("""
        CALL dbms.queryJmx('org.neo4j:*,name=Store sizes')
        YIELD attributes
        RETURN attributes
    """)
    # Note: exact JMX query depends on Neo4j version

    return metrics


def format_metrics_for_prometheus(metrics: GraphMetrics) -> str:
    """Format metrics as Prometheus text exposition."""
    lines = []
    lines.append(
        f"neo4j_total_nodes {metrics.total_nodes}"
    )
    lines.append(
        f"neo4j_total_relationships {metrics.total_relationships}"
    )
    lines.append(
        f"neo4j_max_node_degree {metrics.max_degree}"
    )

    for label, count in metrics.node_counts.items():
        lines.append(
            f'neo4j_node_count{{label="{label}"}} {count}'
        )

    for rel_type, count in metrics.rel_counts.items():
        lines.append(
            f'neo4j_rel_count{{type="{rel_type}"}} {count}'
        )

    return "\n".join(lines)

03. Alerting Rules: What to Alert On vs What to Log

Not everything is an alert. Alert fatigue kills on-call teams faster than actual incidents. Use this framework to separate signal from noise:

Alert Hierarchy

Level	Action	Example	Channel
Page (P1)	Wake someone up. Production is broken or about to break.	Query p99 > 10s for 5 min, CDC lag > 1 hour, disk > 95%	PagerDuty / phone
Urgent (P2)	Fix during business hours today.	Query p95 > 2s, cache hit ratio < 80%, DLQ depth > 100	Slack alert channel
Warning (P3)	Review this week. Trend is concerning.	Node count growth > 20%/week, GC pauses > 500ms	Slack ops channel
Info	Log only. Useful for debugging, not actionable now.	Individual slow queries, connection pool fluctuations	Logs / dashboard

Specific Thresholds

Metric	Warning	Urgent	Page
Query latency p95	> 500ms	> 2s	> 5s for 5 min
Query latency p99	> 2s	> 5s	> 10s for 5 min
Page cache hit ratio	< 90%	< 80%	< 60%
Heap utilization	> 70%	> 85%	> 95%
Disk utilization	> 70%	> 85%	> 95%
CDC consumer lag (events)	> 10,000	> 50,000	> 200,000
CDC consumer lag (time)	> 5 min	> 15 min	> 1 hour
Dead letter queue depth	> 10	> 100	> 1,000
Max node degree	> 10,000	> 50,000	> 200,000
Active transactions	> 50	> 100	> 200

Alerting Implementation

from dataclasses import dataclass
from enum import Enum

class AlertLevel(Enum):
    INFO = "info"
    WARNING = "warning"
    URGENT = "urgent"
    PAGE = "page"


@dataclass
class AlertRule:
    name: str
    metric: str
    warning_threshold: float
    urgent_threshold: float
    page_threshold: float
    sustained_minutes: int = 0    # 0 = instant alert
    description: str = ""


ALERT_RULES = [
    AlertRule(
        name="query_latency_p95",
        metric="neo4j_query_latency_p95_ms",
        warning_threshold=500,
        urgent_threshold=2000,
        page_threshold=5000,
        sustained_minutes=5,
        description="Query latency p95 exceeds threshold"
    ),
    AlertRule(
        name="cache_hit_ratio",
        metric="neo4j_page_cache_hit_ratio",
        warning_threshold=90,   # Alert when BELOW threshold
        urgent_threshold=80,
        page_threshold=60,
        description="Page cache hit ratio below threshold"
    ),
    AlertRule(
        name="cdc_consumer_lag",
        metric="kafka_consumer_lag_seconds",
        warning_threshold=300,
        urgent_threshold=900,
        page_threshold=3600,
        sustained_minutes=2,
        description="CDC pipeline falling behind"
    ),
    AlertRule(
        name="disk_utilization",
        metric="disk_utilization_percent",
        warning_threshold=70,
        urgent_threshold=85,
        page_threshold=95,
        description="Disk space running low"
    ),
    AlertRule(
        name="supernode_degree",
        metric="neo4j_max_node_degree",
        warning_threshold=10_000,
        urgent_threshold=50_000,
        page_threshold=200_000,
        description="Supernode detected — high relationship count"
    ),
]


def evaluate_alert(rule: AlertRule, current_value: float) -> AlertLevel:
    """Evaluate a metric against alert thresholds."""
    # For "lower is worse" metrics like cache hit ratio
    if rule.name in ("cache_hit_ratio",):
        if current_value < rule.page_threshold:
            return AlertLevel.PAGE
        if current_value < rule.urgent_threshold:
            return AlertLevel.URGENT
        if current_value < rule.warning_threshold:
            return AlertLevel.WARNING
        return AlertLevel.INFO

    # For "higher is worse" metrics
    if current_value >= rule.page_threshold:
        return AlertLevel.PAGE
    if current_value >= rule.urgent_threshold:
        return AlertLevel.URGENT
    if current_value >= rule.warning_threshold:
        return AlertLevel.WARNING
    return AlertLevel.INFO

04. Backup and Disaster Recovery

Graph databases need backup strategies just as relational databases do. The approach depends on whether you are running self-managed Neo4j or a cloud-managed service.

Backup Strategy

Method	When to Use	RPO	RTO	Complexity
neo4j-admin dump	Small to medium databases (< 50GB)	Point-in-time	15-60 min	Low
neo4j-admin backup	Enterprise, online backup while running	Near-zero (with tx logs)	10-30 min	Medium
Cloud snapshots	Cloud-managed (AuraDB, VM snapshots)	Snapshot interval	5-15 min	Low
CDC replay	Rebuild from source relational DB	Depends on CDC lag	Hours	High

Backup Script

#!/bin/bash
# neo4j-backup.sh — daily backup with rotation

set -euo pipefail

BACKUP_DIR="/backups/neo4j"
RETENTION_DAYS=14
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_NAME="neo4j_backup_${TIMESTAMP}"

echo "Starting backup: ${BACKUP_NAME}"

# Online backup (Enterprise Edition)
neo4j-admin database backup \
    --to-path="${BACKUP_DIR}" \
    --type=full \
    neo4j

# Compress
tar -czf "${BACKUP_DIR}/${BACKUP_NAME}.tar.gz" \
    -C "${BACKUP_DIR}" "${BACKUP_NAME}"

# Upload to cloud storage
aws s3 cp "${BACKUP_DIR}/${BACKUP_NAME}.tar.gz" \
    "s3://company-backups/neo4j/${BACKUP_NAME}.tar.gz"

# Clean up local backup directory
rm -rf "${BACKUP_DIR}/${BACKUP_NAME}"

# Rotate old backups
find "${BACKUP_DIR}" -name "*.tar.gz" \
    -mtime +${RETENTION_DAYS} -delete

echo "Backup complete: ${BACKUP_NAME}.tar.gz"

Disaster Recovery Plan

Scenario	Detection	Recovery Steps	Expected Time
Corrupted database	Startup failure, query errors	Restore from latest backup	30-60 min
Accidental data deletion	User report, monitoring	Restore backup, replay CDC from timestamp	1-2 hours
Hardware failure	Node unreachable	Failover to read replica, promote to primary	5-15 min
Cloud region outage	Health checks fail	Switch to cross-region replica	15-30 min
Complete data loss	Everything is gone	Rebuild from relational source via batch sync	2-8 hours

The last row is the ultimate safety net of the sidecar pattern. Because the relational database is the system of record, you can always rebuild the graph from scratch. It takes time, but you never truly lose data.

05. Scaling Patterns

Read Replicas

For read-heavy workloads, which graph databases almost always are, add read replicas:

                    ┌──────────────────┐
                    │   Application    │
                    └────────┬─────────┘
                             │
                    ┌────────▼─────────┐
                    │  Load Balancer   │
                    └──┬─────┬─────┬───┘
                       │     │     │
            ┌──────────▼┐ ┌─▼──────▼──┐
            │  Primary  │ │  Replica   │
            │  (writes) │ │  (reads)   │
            └───────────┘ └────────────┘

from neo4j import GraphDatabase, READ_ACCESS, WRITE_ACCESS


def create_driver_with_routing(
    primary_uri: str,
    user: str,
    password: str
) -> GraphDatabase.driver:
    """Create a driver that routes reads to replicas."""
    # Use neo4j:// protocol for automatic routing
    driver = GraphDatabase.driver(
        primary_uri.replace("bolt://", "neo4j://"),
        auth=(user, password),
        max_connection_pool_size=100,
        connection_acquisition_timeout=30
    )
    return driver


def read_query(driver, cypher: str, params: dict = None):
    """Execute a read query — routed to a replica."""
    with driver.session(
        default_access_mode=READ_ACCESS
    ) as session:
        result = session.run(cypher, params or {})
        return [dict(r) for r in result]


def write_query(driver, cypher: str, params: dict = None):
    """Execute a write query — always goes to primary."""
    with driver.session(
        default_access_mode=WRITE_ACCESS
    ) as session:
        result = session.run(cypher, params or {})
        return result.consume()

Connection Pooling

Connection pool exhaustion is a common production issue. Configure these settings before you go live:

Setting	Default	Recommended	Why
`max_connection_pool_size`	100	50-200	Match your concurrency level
`connection_acquisition_timeout`	60s	30s	Fail fast, don't queue forever
`max_connection_lifetime`	1 hour	30 min	Prevent stale connections
`connection_timeout`	30s	10s	Detect unreachable server quickly
`keep_alive`	True	True	Detect broken connections

When to Consider Sharding

Most graph databases, including Neo4j, do not natively shard like relational databases. Before you consider sharding, exhaust these options:

Index everything you query by. Most latency problems are missing indexes, not capacity problems.
Add read replicas. If reads are the bottleneck, add replicas.
Optimize queries. Profile with EXPLAIN and PROFILE in Cypher. Fix Cartesian products and unbounded variable-length paths.
Increase page cache. If the cache hit ratio is below 95%, give Neo4j more RAM.
Separate workloads. Run analytics queries on a replica and operational queries on the primary.

If you have exhausted all of these options and still need more capacity, consider:

Domain-based partitioning: Separate graphs per tenant or per business domain, with cross-graph federation at the application layer.
Fabric (Neo4j 5+): Query across multiple Neo4j databases as if they were one.
Purpose-built distributed graph databases: TigerGraph, Amazon Neptune, or JanusGraph for truly massive scale.

06. Cost Management

Cloud graph database costs can surprise teams that do not watch them. Here is what to monitor:

Cost Driver	How It Grows	How to Control It
Instance size	You picked a bigger machine	Right-size based on actual memory needs
Storage	Graph grows from new data sources	Monitor node/relationship count growth rate
Read replicas	You added replicas for read scaling	Only add when p95 latency requires it
Network transfer	Queries returning large result sets	Limit results in Cypher (`LIMIT`), paginate
Backup storage	Daily backups accumulate	Set retention policy, compress, tier to cold storage

Monthly Cost Estimation Template

Instance:     db.r6g.xlarge (4 vCPU, 32GB RAM)     $X/month
Storage:      100GB SSD                              $Y/month
Replicas:     1 read replica (same size)             $X/month
Backups:      14-day retention, ~50GB compressed     $Z/month
Network:      Estimated 500GB transfer               $W/month
──────────────────────────────────────────────────────────────
Total estimated:                                     $TOTAL/month

Track actual against estimated cost monthly. Set a budget alert at 80% of your estimate.

07. Runbook: Common Issues and Fixes

This runbook covers the 8 most common operational issues. Put it in your team's wiki where on-call engineers can find it at 3am.

Issue 1: Query Latency Spike

Symptoms: p95 latency jumps from 200ms to 5 seconds or more. Users report timeouts.

Diagnosis:

-- Find the slowest currently running queries
CALL dbms.listQueries() YIELD queryId, query, elapsedTimeMillis
WHERE elapsedTimeMillis > 5000
RETURN queryId, query, elapsedTimeMillis
ORDER BY elapsedTimeMillis DESC

Common causes and fixes:

Missing index: CREATE INDEX FOR (n:Label) ON (n.property)
Cartesian product: Add relationship patterns to connect disconnected MATCH clauses.
Supernode hit: Add degree filter WHERE size((n)--()) < 10000
Unbounded path: Change [:REL*] to [:REL*..5] with an explicit depth limit.

Issue 2: CDC Pipeline Stopped

Symptoms: Kafka consumer lag growing, graph data is stale.

Diagnosis:

# Check consumer group lag
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
    --group graph-sync-consumer --describe

# Check connector status
curl -s http://localhost:8083/connectors/pg-graph-connector/status

Common causes and fixes:

Consumer crashed: Restart consumer, check dead letter queue for failed events.
Connector lost: Restart Debezium connector, verify PostgreSQL replication slot exists.
Schema change: Update CDC handler for new or changed columns, replay from last committed offset.
Kafka disk full: Increase retention, add brokers, or reduce topic retention time.

Issue 3: Out of Memory

Symptoms: GC pauses over 1 second, heap usage above 95%, queries timing out.

Diagnosis:

# Check JVM heap usage
curl -s http://localhost:2004/metrics | grep heap

# Check for memory-hungry queries
CALL dbms.listQueries() YIELD query, allocatedBytes
ORDER BY allocatedBytes DESC LIMIT 5

Common causes and fixes:

Page cache too small: Increase server.memory.pagecache.size.
Heap too small: Increase server.memory.heap.max_size.
Query returning too much data: Add LIMIT, filter earlier in the query.
Graph outgrew instance: Upgrade instance size or add read replicas.

Issue 4: Data Inconsistency Between PostgreSQL and Neo4j

Symptoms: Application shows different data depending on which database it queries.

Diagnosis:

# Run the consistency check from Chapter 13
results = validate_sync_consistency(pg_conn, neo4j_session)
for check in results:
    if not check["pass"]:
        print(f"FAIL: {check}")

Common causes and fixes:

CDC lag: Wait for consumer to catch up and monitor lag.
Dropped events: Check dead letter queue, re-sync affected tables.
Handler bug: Fix the CDC handler, replay events from before the bug.
Schema drift: PostgreSQL column renamed but handler not updated.

Issue 5: Supernode Causing Slow Queries

Symptoms: Specific traversal queries suddenly slow after a data load.

Diagnosis:

-- Find nodes with highest degree
MATCH (n)
WITH n, size([(n)--() | 1]) AS degree
WHERE degree > 1000
RETURN labels(n) AS labels, n.name AS name, degree
ORDER BY degree DESC
LIMIT 20

Common causes and fixes:

Data modeling issue: Break the supernode into subgroups (by time period or category, for example).
Query issue: Filter relationships by type or property before traversing from the supernode.
Legitimate high-degree node: Add a degree check in the query and handle high-degree nodes separately.

Issue 6: Backup Failure

Symptoms: Backup job exited with error. No recent backup available.

Diagnosis: Check backup job logs, verify disk space on backup target.

Fixes:

Disk full: Free space or increase the backup volume.
Database locked: Ensure backup runs during a low-traffic period.
Permission error: Verify the backup user has correct filesystem permissions.
Network error (cloud backup): Retry and verify cloud credentials.

Issue 7: Read Replica Lag

Symptoms: Read replica returns stale data, replication lag metrics climbing.

Fixes:

Network bottleneck: Check bandwidth between primary and replica.
Replica overloaded: Reduce query load or add another replica.
Large transaction on primary: Wait for it to complete and replicate.
Restart the replica if lag does not recover within a reasonable window.

Issue 8: Connection Pool Exhaustion

Symptoms: "Unable to acquire connection" errors in application logs.

Fixes:

Increase pool size if current size is too small for the workload.
Fix connection leaks: ensure every session is closed (use with blocks).
Reduce query time to free connections faster.
Add a connection acquisition timeout to fail fast rather than queue indefinitely.

08. Chapter Checklist

Before you move on, verify:

You are monitoring query latency (p50/p95/p99), graph size, and pipeline health
Alert thresholds are set and you have tested that alerts fire correctly
Backup runs daily and you have tested restore at least once
You have a disaster recovery plan that includes "rebuild from relational source"
Connection pool is sized correctly and you monitor pool utilization
The on-call team has a runbook they can find at 3am
Cost monitoring is in place with budget alerts
You have identified any supernodes and have queries that handle them gracefully

← Back to Knowledge Graphs for Enterprise AI