grafana-dashboards
Create and manage production-ready Grafana dashboards for comprehensive system observability.
- risk
- unknown
- source
- community
- date added
- 2026-02-27
Grafana Dashboards
Create and manage production-ready Grafana dashboards for comprehensive system observability.
Do not use this skill when
- The task is unrelated to grafana dashboards
- You need a different domain or tool outside this scope
Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open
resources/implementation-playbook.md.
Purpose
Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics.
Use this skill when
- Visualize Prometheus metrics
- Create custom dashboards
- Implement SLO dashboards
- Monitor infrastructure
- Track business KPIs
Dashboard Design Principles
1. Hierarchy of Information
┌─────────────────────────────────────┐ │ Critical Metrics (Big Numbers) │ ├─────────────────────────────────────┤ │ Key Trends (Time Series) │ ├─────────────────────────────────────┤ │ Detailed Metrics (Tables/Heatmaps) │ └─────────────────────────────────────┘
2. RED Method (Services)
- Rate - Requests per second
- Errors - Error rate
- Duration - Latency/response time
3. USE Method (Resources)
- Utilization - % time resource is busy
- Saturation - Queue length/wait time
- Errors - Error count
Dashboard Structure
API Monitoring Dashboard
{ "dashboard": { "title": "API Monitoring", "tags": ["api", "production"], "timezone": "browser", "refresh": "30s", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "sum(rate(http_requests_total[5m])) by (service)", "legendFormat": "{{service}}" } ], "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8} }, { "title": "Error Rate %", "type": "graph", "targets": [ { "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100", "legendFormat": "Error Rate" } ], "alert": { "conditions": [ { "evaluator": {"params": [5], "type": "gt"}, "operator": {"type": "and"}, "query": {"params": ["A", "5m", "now"]}, "type": "query" } ] }, "gridPos": {"x": 12, "y": 0, "w": 12, "h": 8} }, { "title": "P95 Latency", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))", "legendFormat": "{{service}}" } ], "gridPos": {"x": 0, "y": 8, "w": 24, "h": 8} } ] } }
Reference: See assets/api-dashboard.json
Panel Types
1. Stat Panel (Single Value)
{ "type": "stat", "title": "Total Requests", "targets": [{ "expr": "sum(http_requests_total)" }], "options": { "reduceOptions": { "values": false, "calcs": ["lastNotNull"] }, "orientation": "auto", "textMode": "auto", "colorMode": "value" }, "fieldConfig": { "defaults": { "thresholds": { "mode": "absolute", "steps": [ {"value": 0, "color": "green"}, {"value": 80, "color": "yellow"}, {"value": 90, "color": "red"} ] } } } }
2. Time Series Graph
{ "type": "graph", "title": "CPU Usage", "targets": [{ "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)" }], "yaxes": [ {"format": "percent", "max": 100, "min": 0}, {"format": "short"} ] }
3. Table Panel
{ "type": "table", "title": "Service Status", "targets": [{ "expr": "up", "format": "table", "instant": true }], "transformations": [ { "id": "organize", "options": { "excludeByName": {"Time": true}, "indexByName": {}, "renameByName": { "instance": "Instance", "job": "Service", "Value": "Status" } } } ] }
4. Heatmap
{ "type": "heatmap", "title": "Latency Heatmap", "targets": [{ "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)", "format": "heatmap" }], "dataFormat": "tsbuckets", "yAxis": { "format": "s" } }
Variables
Query Variables
{ "templating": { "list": [ { "name": "namespace", "type": "query", "datasource": "Prometheus", "query": "label_values(kube_pod_info, namespace)", "refresh": 1, "multi": false }, { "name": "service", "type": "query", "datasource": "Prometheus", "query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)", "refresh": 1, "multi": true } ] } }
Use Variables in Queries
sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))
Alerts in Dashboards
{ "alert": { "name": "High Error Rate", "conditions": [ { "evaluator": { "params": [5], "type": "gt" }, "operator": {"type": "and"}, "query": { "params": ["A", "5m", "now"] }, "reducer": {"type": "avg"}, "type": "query" } ], "executionErrorState": "alerting", "for": "5m", "frequency": "1m", "message": "Error rate is above 5%", "noDataState": "no_data", "notifications": [ {"uid": "slack-channel"} ] } }
Dashboard Provisioning
dashboards.yml:
apiVersion: 1 providers: - name: 'default' orgId: 1 folder: 'General' type: file disableDeletion: false updateIntervalSeconds: 10 allowUiUpdates: true options: path: /etc/grafana/dashboards
Common Dashboard Patterns
Infrastructure Dashboard
Key Panels:
- CPU utilization per node
- Memory usage per node
- Disk I/O
- Network traffic
- Pod count by namespace
- Node status
Reference: See assets/infrastructure-dashboard.json
Database Dashboard
Key Panels:
- Queries per second
- Connection pool usage
- Query latency (P50, P95, P99)
- Active connections
- Database size
- Replication lag
- Slow queries
Reference: See assets/database-dashboard.json
Application Dashboard
Key Panels:
- Request rate
- Error rate
- Response time (percentiles)
- Active users/sessions
- Cache hit rate
- Queue length
Best Practices
- Start with templates (Grafana community dashboards)
- Use consistent naming for panels and variables
- Group related metrics in rows
- Set appropriate time ranges (default: Last 6 hours)
- Use variables for flexibility
- Add panel descriptions for context
- Configure units correctly
- Set meaningful thresholds for colors
- Use consistent colors across dashboards
- Test with different time ranges
Dashboard as Code
Terraform Provisioning
resource "grafana_dashboard" "api_monitoring" { config_json = file("${path.module}/dashboards/api-monitoring.json") folder = grafana_folder.monitoring.id } resource "grafana_folder" "monitoring" { title = "Production Monitoring" }
Ansible Provisioning
- name: Deploy Grafana dashboards copy: src: "{{ item }}" dest: /etc/grafana/dashboards/ with_fileglob: - "dashboards/*.json" notify: restart grafana
Reference Files
assets/api-dashboard.json- API monitoring dashboardassets/infrastructure-dashboard.json- Infrastructure dashboardassets/database-dashboard.json- Database monitoring dashboardreferences/dashboard-design.md- Dashboard design guide
Related Skills
prometheus-configuration- For metric collectionslo-implementation- For SLO dashboards