n8n Monitoring Setup
Comprehensive monitoring and observability are essential for production n8n deployments. This guide covers all monitoring options including Prometheus metrics, logging, health checks, and alerting.
Production Monitoring: Proper monitoring is crucial for production deployments. It helps you identify issues early, track performance, and ensure system reliability.
Monitoring Overview
Monitoring Strategy: Implement monitoring at multiple levels: application metrics, infrastructure metrics, and business metrics for comprehensive observability.
Available Metrics
- Application Metrics: Workflow executions, API calls, performance
- Queue Metrics: Job processing, queue depths, worker utilization
- System Metrics: CPU, memory, disk usage
- Custom Metrics: Business-specific KPIs
Monitoring Components
- Prometheus: Metrics collection and storage
- ServiceMonitor: Kubernetes-native monitoring
- Logging: Structured logging with configurable levels
- Health Checks: Liveness and readiness probes
- Alerting: Prometheus alerting rules
Prometheus Metrics Configuration
Prometheus Operator: This guide assumes you have Prometheus Operator installed. If not, you'll need to set up Prometheus separately.
Basic ServiceMonitor Setup
serviceMonitor:
enabled: true
interval: 30s
timeout: 10s
labels:
release: prometheus
include:
defaultMetrics: true
cacheMetrics: false
messageEventBusMetrics: false
workflowIdLabel: false
nodeTypeLabel: false
credentialTypeLabel: false
apiEndpoints: false
queueMetrics: false
Advanced ServiceMonitor Configuration
serviceMonitor:
enabled: true
namespace: monitoring
interval: 15s
timeout: 10s
labels:
release: prometheus
team: platform
targetLabels:
- app.kubernetes.io/name
- app.kubernetes.io/instance
metricRelabelings:
- sourceLabels: [prometheus_replica]
regex: (.*)
targetLabel: another_prometheus_replica
action: replace
include:
defaultMetrics: true
cacheMetrics: true
messageEventBusMetrics: true
workflowIdLabel: true
nodeTypeLabel: true
credentialTypeLabel: true
apiEndpoints: true
apiPathLabel: true
apiMethodLabel: true
apiStatusCodeLabel: true
queueMetrics: true
Metric Volume: Be cautious with enabling all metrics in high-traffic environments as it can increase Prometheus storage requirements and query latency.
Available Metrics
Default Metrics
n8n_execution_total
- Total workflow executionsn8n_execution_duration_seconds
- Execution duration histogramn8n_execution_failed_total
- Failed executionsn8n_workflow_total
- Total workflowsn8n_credential_total
- Total credentialsn8n_node_total
- Total nodes
Queue Metrics (Queue Mode)
n8n_queue_bull_queue_waiting
- Jobs waiting in queuen8n_queue_bull_queue_active
- Active jobsn8n_queue_bull_queue_completed
- Completed jobsn8n_queue_bull_queue_failed
- Failed jobsn8n_queue_bull_queue_delayed
- Delayed jobs
API Metrics
n8n_api_requests_total
- Total API requestsn8n_api_request_duration_seconds
- API request durationn8n_api_requests_failed_total
- Failed API requests
Queue Mode Endpoint Metrics
- MCP Endpoints: Monitor
/mcp/
and/mcp-test/
endpoint performance - Form Endpoints: Monitor
/form/
,/form-test/
, and/form-waiting/
endpoint performance - Webhook Endpoints: Monitor
/webhook/
and/webhook-test/
endpoint performance
Endpoint Monitoring: In queue mode, different endpoints are handled by different node types. Monitor webhook nodes for MCP and Form endpoint performance.
Cache Metrics
n8n_cache_hits_total
- Cache hitsn8n_cache_misses_total
- Cache missesn8n_cache_size_bytes
- Cache size
Prometheus Queries
Query Optimization: Use appropriate time ranges and aggregation functions to optimize query performance in Prometheus.
Basic Queries
# Execution rate (executions per second)
rate(n8n_execution_total[5m])
# Average execution duration
histogram_quantile(0.95, rate(n8n_execution_duration_seconds_bucket[5m]))
# Error rate
rate(n8n_execution_failed_total[5m])
# Queue depth (queue mode)
n8n_queue_bull_queue_waiting
# API request rate
rate(n8n_api_requests_total[5m])
Advanced Queries
# Success rate
(
rate(n8n_execution_total[5m]) - rate(n8n_execution_failed_total[5m])
) / rate(n8n_execution_total[5m]) * 100
# 95th percentile execution time
histogram_quantile(0.95, rate(n8n_execution_duration_seconds_bucket[5m]))
# Queue utilization (queue mode)
n8n_queue_bull_queue_active / (n8n_queue_bull_queue_waiting + n8n_queue_bull_queue_active) * 100
# Cache hit ratio
rate(n8n_cache_hits_total[5m]) / (rate(n8n_cache_hits_total[5m]) + rate(n8n_cache_misses_total[5m])) * 100
Logging Configuration
Log Management: Configure appropriate log levels and outputs based on your environment. Use structured logging for better log analysis.
Basic Logging
log:
level: info
output:
- console
scopes: []
Advanced Logging
log:
level: info
output:
- console
- file
scopes:
- concurrency
- external-secrets
- license
- multi-main-setup
- pubsub
- redis
- scaling
- waiting-executions
file:
location: "logs/n8n.log"
maxsize: 16
maxcount: "100"
Structured Logging
log:
level: info
output:
- console
scopes:
- concurrency
- redis
- scaling
Log Aggregation
# Configure log forwarding to external systems
main:
extraContainers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
Health Checks
Basic Health Checks
main:
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /healthz/readiness
port: http
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Advanced Health Checks
main:
livenessProbe:
httpGet:
path: /healthz
port: http
httpHeaders:
- name: X-Custom-Header
value: health-check
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
successThreshold: 1
readinessProbe:
httpGet:
path: /healthz/readiness
port: http
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1
startupProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30
Queue Mode Health Checks
worker:
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /healthz/readiness
port: http
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
startupProbe:
exec:
command: ["/bin/sh", "-c", "ps aux | grep '[n]8n'"]
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 30
webhook:
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /healthz/readiness
port: http
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Persistence Monitoring
Persistence Monitoring: Monitor persistent volumes, PVC status, and storage usage for each node type independently. Persistence monitoring is separate from hostAliases monitoring.
Volume Health Checks
# Monitor PVC status and binding
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: n8n-persistence-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: n8n-persistence
rules:
- alert: N8NPVCNotBound
expr: kube_persistentvolumeclaim_status_phase{namespace="n8n"} != 1
for: 5m
labels:
severity: critical
annotations:
summary: "n8n PVC not bound"
description: "PersistentVolumeClaim {{ $labels.persistentvolumeclaim }} is not bound"
- alert: N8NVolumeFull
expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "n8n volume nearly full"
description: "Volume {{ $labels.persistentvolumeclaim }} is {{ $value | humanizePercentage }} full"
StatefulSet vs Deployment Monitoring
Workload Type Monitoring
# Monitor workload type selection
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: n8n-workload-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: n8n-workload
rules:
- alert: N8NUnexpectedWorkloadType
expr: |
(
kube_statefulset_status_replicas_ready{namespace="n8n"} > 0
and
kube_deployment_status_replicas_ready{namespace="n8n"} > 0
)
for: 1m
labels:
severity: warning
annotations:
summary: "Unexpected workload type detected"
description: "Both StatefulSet and Deployment are running for n8n"
Persistence Configuration Monitoring
Access Mode Validation
# Monitor persistence access modes
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: n8n-persistence-config-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: n8n-persistence-config
rules:
- alert: N8NInvalidAccessMode
expr: |
(
kube_persistentvolumeclaim_access_mode{namespace="n8n", access_mode="ReadWriteOnce"}
and
kube_horizontalpodautoscaler_status_current_replicas{namespace="n8n"} > 1
)
for: 5m
labels:
severity: warning
annotations:
summary: "Invalid persistence access mode for autoscaling"
description: "ReadWriteOnce persistence with autoscaling may cause issues"
Storage Class Monitoring
Storage Class Availability
# Monitor storage class availability
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: n8n-storage-class-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: n8n-storage-class
rules:
- alert: N8NStorageClassNotFound
expr: |
(
kube_persistentvolumeclaim_info{namespace="n8n"}
and
kube_storageclass_info == 0
)
for: 5m
labels:
severity: critical
annotations:
summary: "Storage class not found"
description: "Storage class for n8n PVC is not available"
Volume Permission Monitoring
Permission Health Checks
# Monitor volume permissions
main:
extraContainers:
- name: permission-monitor
image: busybox:1.35
command:
- /bin/sh
- -c
- |
while true; do
if [ ! -w /home/node/.n8n ]; then
echo "Volume not writable"
exit 1
fi
if [ "$(stat -c %u /home/node/.n8n)" != "1000" ]; then
echo "Wrong ownership"
exit 1
fi
sleep 60
done
volumeMounts:
- name: n8n-data
mountPath: /home/node/.n8n
Monitoring Best Practices
Comprehensive Monitoring Setup
# Complete monitoring configuration
serviceMonitor:
enabled: true
include:
defaultMetrics: true
cacheMetrics: true
queueMetrics: true
interval: 15s
timeout: 5s
# Persistence monitoring
main:
extraContainers:
- name: volume-monitor
image: busybox:1.35
command:
- /bin/sh
- -c
- |
while true; do
# Check volume space
df -h /home/node/.n8n
# Check permissions
ls -la /home/node/.n8n
# Check host aliases
cat /etc/hosts | grep -E "(internal-api|database|redis)"
sleep 300
done
volumeMounts:
- name: n8n-data
mountPath: /home/node/.n8n
Monitoring Strategy: Combine Kubernetes-native monitoring with custom health checks to ensure comprehensive coverage of persistence features.
Resource Overhead: Custom monitoring containers add resource overhead. Monitor their impact and adjust resource limits accordingly.
Alerting Rules
Basic Alerting Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: n8n-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: n8n
rules:
- alert: N8NHighErrorRate
expr: rate(n8n_execution_failed_total[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High n8n execution error rate"
description: "n8n is experiencing a high error rate of {{ $value }} errors per second"
- alert: N8NHighExecutionTime
expr: histogram_quantile(0.95, rate(n8n_execution_duration_seconds_bucket[5m])) > 300
for: 5m
labels:
severity: warning
annotations:
summary: "High n8n execution time"
description: "95th percentile execution time is {{ $value }} seconds"
- alert: N8NQueueDepth
expr: n8n_queue_bull_queue_waiting > 100
for: 2m
labels:
severity: warning
annotations:
summary: "High n8n queue depth"
description: "Queue depth is {{ $value }} jobs"
Advanced Alerting Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: n8n-advanced-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: n8n.advanced
rules:
- alert: N8NLowSuccessRate
expr: (
rate(n8n_execution_total[5m]) - rate(n8n_execution_failed_total[5m])
) / rate(n8n_execution_total[5m]) < 0.95
for: 5m
labels:
severity: critical
annotations:
summary: "Low n8n success rate"
description: "Success rate is {{ $value | humanizePercentage }}"
- alert: N8NHighMemoryUsage
expr: (container_memory_usage_bytes{container="n8n"} / container_spec_memory_limit_bytes{container="n8n"}) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High n8n memory usage"
description: "Memory usage is {{ $value | humanizePercentage }}"
- alert: N8NHighCPUUsage
expr: rate(container_cpu_usage_seconds_total{container="n8n"}[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High n8n CPU usage"
description: "CPU usage is {{ $value | humanizePercentage }}"
- alert: N8NQueueStuck
expr: n8n_queue_bull_queue_active > 0 and rate(n8n_queue_bull_queue_completed[5m]) == 0
for: 10m
labels:
severity: critical
annotations:
summary: "n8n queue appears stuck"
description: "Active jobs: {{ $value }}, no completions in 5m"
- alert: N8NHighAPILatency
expr: histogram_quantile(0.95, rate(n8n_api_request_duration_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High n8n API latency"
description: "95th percentile API latency is {{ $value }} seconds"
Grafana Dashboards
Dashboard Templates: These dashboard examples are based on actual n8n metrics and can be imported directly into Grafana. Customize them based on your specific monitoring needs.
Comprehensive n8n Dashboard
Grafana dashboard json file
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"target": {
"limit": 100,
"matchAny": false,
"tags": [],
"type": "dashboard"
},
"type": "dashboard"
}
]
},
"description": "n8n prometheus client basic metrics",
"editable": true,
"fiscalYearStartMonth": 0,
"gnetId": 11159,
"graphTooltip": 0,
"iteration": 1750529070188,
"links": [],
"liveNow": false,
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 7,
"w": 9,
"x": 0,
"y": 0
},
"hiddenSeries": false,
"id": 6,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"max": true,
"min": true,
"show": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"paceLength": 10,
"percentage": false,
"pluginVersion": "8.2.7",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"exemplar": true,
"expr": "irate(n8n_process_cpu_user_seconds_total{instance=~\"$instance\"}[2m]) * 100",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "User CPU - {{instance}}",
"refId": "A"
},
{
"exemplar": true,
"expr": "irate(n8n_process_cpu_system_seconds_total{instance=~\"$instance\"}[2m]) * 100",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "Sys CPU - {{instance}}",
"refId": "B"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Process CPU Usage",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "percent",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 7,
"w": 8,
"x": 9,
"y": 0
},
"hiddenSeries": false,
"id": 8,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"max": true,
"min": true,
"show": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"paceLength": 10,
"percentage": false,
"pluginVersion": "8.2.7",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"exemplar": true,
"expr": "n8n_nodejs_eventloop_lag_seconds{instance=~\"$instance\"}",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{role}}",
"refId": "A"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Event Loop Lag",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "s",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"datasource": "prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"decimals": 0,
"mappings": [],
"noValue": "0",
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "none"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 3,
"x": 17,
"y": 0
},
"id": 14,
"options": {
"colorMode": "value",
"graphMode": "none",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"calcs": [
"last"
],
"fields": "",
"values": false
},
"text": {},
"textMode": "value"
},
"pluginVersion": "8.2.7",
"targets": [
{
"exemplar": true,
"expr": "sum(increase(n8n_scaling_mode_queue_jobs_completed[1w]))",
"interval": "",
"legendFormat": "",
"refId": "A"
}
],
"title": "Last Week Completed Jobs",
"type": "stat"
},
{
"cacheTimeout": null,
"datasource": "prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [
{
"options": {
"match": "null",
"result": {
"text": "N/A"
}
},
"type": "special"
}
],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "none"
},
"overrides": []
},
"gridPos": {
"h": 3,
"w": 4,
"x": 20,
"y": 0
},
"id": 2,
"interval": "",
"links": [],
"maxDataPoints": 100,
"options": {
"colorMode": "none",
"graphMode": "none",
"justifyMode": "auto",
"orientation": "horizontal",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"text": {},
"textMode": "name"
},
"pluginVersion": "8.2.7",
"targets": [
{
"exemplar": true,
"expr": "sum(n8n_nodejs_version_info{instance=~\"$instance\"}) by (version)",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{version}}",
"refId": "A"
}
],
"timeFrom": null,
"timeShift": null,
"title": "Node.js Version",
"type": "stat"
},
{
"cacheTimeout": null,
"datasource": "prometheus",
"fieldConfig": {
"defaults": {
"color": {
"fixedColor": "#F2495C",
"mode": "fixed"
},
"mappings": [
{
"options": {
"match": "null",
"result": {
"text": "N/A"
}
},
"type": "special"
}
],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "none"
},
"overrides": []
},
"gridPos": {
"h": 4,
"w": 4,
"x": 20,
"y": 3
},
"id": 4,
"interval": null,
"links": [],
"maxDataPoints": 100,
"options": {
"colorMode": "none",
"graphMode": "none",
"justifyMode": "auto",
"orientation": "horizontal",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"text": {},
"textMode": "name"
},
"pluginVersion": "8.2.7",
"targets": [
{
"exemplar": true,
"expr": "sum(n8n_version_info{instance=~\"$instance\"}) by (version)",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{version}}",
"refId": "A"
}
],
"timeFrom": null,
"timeShift": null,
"title": "n8n version",
"type": "stat"
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 7,
"w": 16,
"x": 0,
"y": 7
},
"hiddenSeries": false,
"id": 7,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"max": true,
"min": true,
"rightSide": true,
"show": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"paceLength": 10,
"percentage": false,
"pluginVersion": "8.2.7",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"exemplar": true,
"expr": "n8n_process_resident_memory_bytes{instance=~\"$instance\"}",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "Process Memory - {{role}}",
"refId": "A"
},
{
"exemplar": true,
"expr": "n8n_nodejs_heap_size_total_bytes{instance=~\"$instance\"}",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "Heap Total - {{role}}",
"refId": "B"
},
{
"exemplar": true,
"expr": "n8n_nodejs_heap_size_used_bytes{instance=~\"$instance\"}",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "Heap Used - {{role}}",
"refId": "C"
},
{
"exemplar": true,
"expr": "n8n_nodejs_external_memory_bytes{instance=~\"$instance\"}",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "External Memory - {{role}}",
"refId": "D"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Process Memory Usage",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 7,
"w": 8,
"x": 16,
"y": 7
},
"hiddenSeries": false,
"id": 9,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"max": true,
"min": true,
"show": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"paceLength": 10,
"percentage": false,
"pluginVersion": "8.2.7",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"exemplar": true,
"expr": "n8n_nodejs_active_handles_total{instance=~\"$instance\"}",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "Active Handler - {{role}}",
"refId": "A"
},
{
"exemplar": true,
"expr": "n8n_nodejs_active_requests_total{instance=~\"$instance\"}",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "Active Request - {{role}}",
"refId": "B"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Active Handlers/Requests Total",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 0,
"y": 14
},
"hiddenSeries": false,
"id": 10,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"max": true,
"min": true,
"rightSide": false,
"show": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"paceLength": 10,
"percentage": false,
"pluginVersion": "8.2.7",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"exemplar": true,
"expr": "n8n_nodejs_heap_space_size_total_bytes{instance=~\"$instance\"}",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "Heap Total - {{role}} - {{space}}",
"refId": "A"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Heap Total Detail",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 8,
"y": 14
},
"hiddenSeries": false,
"id": 11,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"max": true,
"min": true,
"rightSide": false,
"show": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"paceLength": 10,
"percentage": false,
"pluginVersion": "8.2.7",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"exemplar": true,
"expr": "n8n_nodejs_heap_space_size_used_bytes{instance=~\"$instance\"}",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "Heap Used - {{role}} - {{space}}",
"refId": "A"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Heap Used Detail",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 16,
"y": 14
},
"hiddenSeries": false,
"id": 12,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"max": true,
"min": true,
"rightSide": false,
"show": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"paceLength": 10,
"percentage": false,
"pluginVersion": "8.2.7",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"exemplar": true,
"expr": "n8n_nodejs_heap_space_size_available_bytes{instance=~\"$instance\"}",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "Heap Used - {{role}} - {{space}}",
"refId": "A"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Heap Available Detail",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
}
],
"refresh": "30s",
"schemaVersion": 32,
"style": "dark",
"tags": [
"n8n"
],
"templating": {
"list": [
{
"allValue": null,
"current": {
"selected": true,
"text": [
"All"
],
"value": [
"$__all"
]
},
"datasource": "prometheus",
"definition": "label_values(n8n_nodejs_version_info, instance)",
"description": null,
"error": null,
"hide": 0,
"includeAll": true,
"label": "instance",
"multi": true,
"name": "instance",
"options": [],
"query": {
"query": "label_values(n8n_nodejs_version_info, instance)",
"refId": "StandardVariableQuery"
},
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 1,
"tagValuesQuery": "",
"tagsQuery": "",
"type": "query",
"useTags": false
}
]
},
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "N8N Application Dashboard",
"uid": "PTSqcpJWk",
"version": 1
}
Enhanced Prometheus Queries
Query Optimization: These queries are optimized for production use and include proper rate calculations and aggregations.
System Performance Queries
# CPU Usage (per instance)
irate(n8n_process_cpu_user_seconds_total{instance=~"$instance"}[2m]) * 100
# Memory Usage (per instance)
n8n_process_resident_memory_bytes{instance=~"$instance"}
# Event Loop Lag (critical for performance)
n8n_nodejs_eventloop_lag_seconds{instance=~"$instance"}
# Active Handles and Requests
n8n_nodejs_active_handles_total{instance=~"$instance"}
n8n_nodejs_active_requests_total{instance=~"$instance"}
Workflow Execution Queries
# Execution Rate (per instance)
rate(n8n_execution_total{instance=~"$instance"}[5m])
# Success Rate (per instance)
(
rate(n8n_execution_total{instance=~"$instance"}[5m]) -
rate(n8n_execution_failed_total{instance=~"$instance"}[5m])
) / rate(n8n_execution_total{instance=~"$instance"}[5m]) * 100
# Execution Duration Percentiles
histogram_quantile(0.50, rate(n8n_execution_duration_seconds_bucket{instance=~"$instance"}[5m]))
histogram_quantile(0.95, rate(n8n_execution_duration_seconds_bucket{instance=~"$instance"}[5m]))
histogram_quantile(0.99, rate(n8n_execution_duration_seconds_bucket{instance=~"$instance"}[5m]))
# Error Rate
rate(n8n_execution_failed_total{instance=~"$instance"}[5m])
Queue Mode Queries
# Queue Depth by Status
n8n_queue_bull_queue_waiting{instance=~"$instance"}
n8n_queue_bull_queue_active{instance=~"$instance"}
n8n_queue_bull_queue_delayed{instance=~"$instance"}
# Queue Processing Rate
rate(n8n_queue_bull_queue_completed{instance=~"$instance"}[5m])
rate(n8n_queue_bull_queue_failed{instance=~"$instance"}[5m])
# Worker Utilization
n8n_queue_bull_queue_active{role="worker"} /
(n8n_queue_bull_queue_waiting{role="worker"} + n8n_queue_bull_queue_active{role="worker"}) * 100
# Queue Stuck Detection
n8n_queue_bull_queue_active{instance=~"$instance"} > 0 and
rate(n8n_queue_bull_queue_completed{instance=~"$instance"}[5m]) == 0
API Performance Queries
# API Request Rate
rate(n8n_api_requests_total{instance=~"$instance"}[5m])
# API Response Time Percentiles
histogram_quantile(0.50, rate(n8n_api_request_duration_seconds_bucket{instance=~"$instance"}[5m]))
histogram_quantile(0.95, rate(n8n_api_request_duration_seconds_bucket{instance=~"$instance"}[5m]))
# API Error Rate
rate(n8n_api_requests_failed_total{instance=~"$instance"}[5m])
Memory and Heap Queries
# Heap Memory Usage
n8n_nodejs_heap_size_used_bytes{instance=~"$instance"}
n8n_nodejs_heap_size_total_bytes{instance=~"$instance"}
# Heap Space Details
n8n_nodejs_heap_space_size_used_bytes{instance=~"$instance"}
n8n_nodejs_heap_space_size_available_bytes{instance=~"$instance"}
# External Memory
n8n_nodejs_external_memory_bytes{instance=~"$instance"}
Advanced Analytics Queries
# Execution Trends (hourly)
increase(n8n_execution_total{instance=~"$instance"}[1h])
# Success Rate Trends (hourly)
(
increase(n8n_execution_total{instance=~"$instance"}[1h]) -
increase(n8n_execution_failed_total{instance=~"$instance"}[1h])
) / increase(n8n_execution_total{instance=~"$instance"}[1h]) * 100
# Queue Processing Efficiency
rate(n8n_queue_bull_queue_completed{instance=~"$instance"}[5m]) /
(n8n_queue_bull_queue_waiting{instance=~"$instance"} + n8n_queue_bull_queue_active{instance=~"$instance"})
# Resource Utilization Score
(
irate(n8n_process_cpu_user_seconds_total{instance=~"$instance"}[2m]) * 100 +
(n8n_process_resident_memory_bytes{instance=~"$instance"} /
n8n_nodejs_heap_size_total_bytes{instance=~"$instance"}) * 100
) / 2
Query Performance: Use appropriate time ranges and consider using recording rules for complex queries that are frequently executed.
Sentry Integration
Basic Sentry Setup
sentry:
enabled: true
backendDsn: "https://your-sentry-dsn@sentry.io/project"
frontendDsn: "https://your-sentry-dsn@sentry.io/project"
Advanced Sentry Configuration
sentry:
enabled: true
backendDsn: "https://your-sentry-dsn@sentry.io/project"
frontendDsn: "https://your-sentry-dsn@sentry.io/project"
externalTaskRunnersDsn: "https://your-sentry-dsn@sentry.io/project"
Database Monitoring
PostgreSQL Monitoring
# Enable PostgreSQL metrics
postgresql:
enabled: true
metrics:
enabled: true
serviceMonitor:
enabled: true
interval: 30s
labels:
release: prometheus
Redis Monitoring (Queue Mode)
# Enable Redis metrics
redis:
enabled: true
metrics:
enabled: true
serviceMonitor:
enabled: true
interval: 30s
labels:
release: prometheus
Storage Monitoring
Filesystem Storage Monitoring
# Monitor filesystem usage
main:
extraContainers:
- name: storage-monitor
image: busybox
command:
- /bin/sh
- -c
- |
while true; do
df -h /data | tail -1 | awk '{print $5}' | sed 's/%//' > /tmp/disk-usage
sleep 60
done
volumeMounts:
- name: n8n-binary-data
mountPath: /data
Pod Affinity and Anti-Affinity
Monitoring Affinity: Proper affinity configuration can improve monitoring performance and reliability by ensuring pods are distributed optimally across your cluster.
Deprecation Notice: The top-level affinity
field is deprecated. Use the specific affinity configurations under main
, worker
, and webhook
blocks instead.
Affinity for Monitoring Optimization
Spread Monitoring Pods
# Spread pods to avoid monitoring bottlenecks
main:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- n8n
topologyKey: kubernetes.io/hostname
worker:
mode: queue
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- n8n
- key: app.kubernetes.io/component
operator: In
values:
- worker
topologyKey: kubernetes.io/hostname
Zone Distribution for High Availability
# Distribute pods across availability zones for monitoring resilience
main:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- n8n
topologyKey: topology.kubernetes.io/zone
webhook:
mode: queue
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- n8n
- key: app.kubernetes.io/component
operator: In
values:
- webhook
topologyKey: topology.kubernetes.io/zone
Node Affinity for Monitoring Nodes
# Place pods on nodes with monitoring capabilities
main:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: monitoring-enabled
operator: In
values:
- "true"
worker:
mode: queue
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: monitoring-enabled
operator: In
values:
- "true"
Monitoring Benefits: Proper affinity configuration ensures that monitoring data collection is distributed across your cluster, preventing bottlenecks and improving overall monitoring performance.
Troubleshooting
Common Monitoring Issues
ServiceMonitor Not Scraping
# Check ServiceMonitor status
kubectl get servicemonitor -n monitoring
# Check Prometheus targets
kubectl port-forward svc/prometheus-operated 9090:9090 -n monitoring
# Check n8n metrics endpoint
kubectl exec -it <n8n-pod> -- curl -s http://localhost:5678/metrics
High Memory Usage
# Check memory usage
kubectl top pods -l app.kubernetes.io/name=n8n
# Check memory limits
kubectl describe pod <n8n-pod>
# Check for memory leaks
kubectl logs <n8n-pod> | grep -i memory
High CPU Usage
# Check CPU usage
kubectl top pods -l app.kubernetes.io/name=n8n
# Check for CPU-intensive operations
kubectl logs <n8n-pod> | grep -i cpu
# Check execution metrics
kubectl exec -it <n8n-pod> -- curl -s http://localhost:5678/metrics | grep execution
Performance Optimization
Metrics Collection Optimization
serviceMonitor:
enabled: true
interval: 60s # Increase interval for high-volume deployments
timeout: 30s
include:
defaultMetrics: true
cacheMetrics: false # Disable if not needed
messageEventBusMetrics: false
queueMetrics: true
Logging Optimization
log:
level: warn # Reduce log level in production
output:
- console
scopes:
- redis
- scaling
file:
maxsize: 32 # Increase file size
maxcount: "50" # Reduce file count
Best Practices
Monitoring Strategy
- Start with basic metrics and expand gradually
- Use appropriate alert thresholds
- Monitor both application and infrastructure metrics
- Set up dashboards for different user roles
- Regular review and tuning of alerts
Performance
- Use appropriate scrape intervals
- Filter metrics to reduce cardinality
- Optimize Prometheus queries
- Use recording rules for complex queries
- Monitor monitoring system performance
Reliability
- Set up monitoring for the monitoring system
- Use multiple alerting channels
- Test alerting rules regularly
- Document alert procedures
- Set up escalation policies
Security
- Secure Prometheus endpoints
- Use RBAC for monitoring access
- Encrypt sensitive metrics
- Audit monitoring access
- Regular security updates
Next Steps
- Usage Guide - Quick start and basic deployment
- Configuration Guide - Detailed configuration options
- Database Setup - PostgreSQL and external database configuration
- Queue Mode Setup - Distributed execution with Redis
- Storage Configuration - Binary data storage options
- Troubleshooting - Common issues and solutions