Files
gridpilot.gg/docs/OBSERVABILITY.md
2025-12-31 19:55:43 +01:00

199 lines
3.9 KiB
Markdown

GridPilot — Observability & Data Separation Design
Purpose
This document defines how GridPilot separates business-critical domain data from infrastructure / observability data, while keeping operations simple, self-hosted, and cognitively manageable.
Goals:
• protect domain data at all costs
• avoid tool sprawl
• keep one clear mental model for operations
• enable debugging without polluting business logic
• ensure long-term maintainability
Core Principle
Domain data and infrastructure data must never share the same storage, lifecycle, or access path.
They serve different purposes, have different risk profiles, and must be handled independently.
Data Categories
1. Domain (Business) Data
Includes
• users
• leagues
• seasons
• races
• results
• penalties
• escrow balances
• sponsorship contracts
• payments & payouts
Characteristics
• legally relevant
• trust-critical
• user-facing
• must never be lost
• requires strict migrations and backups
Storage
• Relational database (PostgreSQL)
• Strong consistency (ACID)
• Backups and disaster recovery mandatory
Access
• Application backend
• Custom Admin UI (primary control surface)
2. Infrastructure / Observability Data
Includes
• application logs
• error traces
• metrics (latency, throughput, failures)
• background job status
• system health signals
Characteristics
• high volume
• ephemeral by design
• not user-facing
• safe to rotate or delete
• supports debugging, not business logic
Storage
• Dedicated observability stack
• Completely separate from domain database
Access
• Grafana UI only
• Never exposed to users
• Never queried by application logic
Observability Architecture (Self-Hosted)
GridPilot uses a single consolidated self-hosted observability stack.
Components
• Grafana
• Central UI
• Dashboards
• Alerting
• Single login
• Loki
• Log aggregation
• Append-only
• Schema-less
• Optimized for high-volume logs
• Prometheus
• Metrics collection
• Time-series data
• Alert rules
• Tempo (optional)
• Distributed traces
• Request flow analysis
All components are accessed exclusively through Grafana.
Responsibility Split
Custom Admin (GridPilot)
Handles:
• business workflows
• escrow state visibility
• payment events
• league integrity checks
• moderation actions
• audit views
Never handles:
• raw logs
• metrics
• system traces
Observability Stack (Grafana)
Handles:
• system health
• performance bottlenecks
• error rates
• background job failures
• infrastructure alerts
Never handles:
• business decisions
• user-visible data
• domain state
Logging & Metrics Policy
What is logged
• errors and exceptions
• payment and escrow failures
• background job failures
• unexpected external API responses
• startup and shutdown events
What is not logged
• user personal data
• credentials
• domain state snapshots
• high-frequency debug spam
Alerting Philosophy
Alerts are:
• minimal
• actionable
• rare
Examples:
• payment failure spike
• escrow release delay
• background jobs failing repeatedly
• sustained error rate increase
No vanity alerts.
Rationale
This separation ensures:
• domain data remains clean and safe
• observability data can scale freely
• infra failures never corrupt business data
• operational complexity stays manageable
The system favors clarity over completeness and stability over tooling hype.
Summary
• Domain data lives in PostgreSQL
• Observability data lives in a dedicated stack
• Grafana is the single infra control surface
• Custom Admin is the single business control surface
• No shared storage, no shared lifecycle
This design minimizes risk, cognitive load, and operational overhead while remaining fully extensible.