3.9 KiB
GridPilot — Observability & Data Separation Design
Purpose
This document defines how GridPilot separates business-critical domain data from infrastructure / observability data, while keeping operations simple, self-hosted, and cognitively manageable.
Goals: • protect domain data at all costs • avoid tool sprawl • keep one clear mental model for operations • enable debugging without polluting business logic • ensure long-term maintainability
⸻
Core Principle
Domain data and infrastructure data must never share the same storage, lifecycle, or access path.
They serve different purposes, have different risk profiles, and must be handled independently.
⸻
Data Categories
- Domain (Business) Data
Includes • users • leagues • seasons • races • results • penalties • escrow balances • sponsorship contracts • payments & payouts
Characteristics • legally relevant • trust-critical • user-facing • must never be lost • requires strict migrations and backups
Storage • Relational database (PostgreSQL) • Strong consistency (ACID) • Backups and disaster recovery mandatory
Access • Application backend • Custom Admin UI (primary control surface)
⸻
- Infrastructure / Observability Data
Includes • application logs • error traces • metrics (latency, throughput, failures) • background job status • system health signals
Characteristics • high volume • ephemeral by design • not user-facing • safe to rotate or delete • supports debugging, not business logic
Storage • Dedicated observability stack • Completely separate from domain database
Access • Grafana UI only • Never exposed to users • Never queried by application logic
⸻
Observability Architecture (Self-Hosted)
GridPilot uses a single consolidated self-hosted observability stack.
Components • Grafana • Central UI • Dashboards • Alerting • Single login • Loki • Log aggregation • Append-only • Schema-less • Optimized for high-volume logs • Prometheus • Metrics collection • Time-series data • Alert rules • Tempo (optional) • Distributed traces • Request flow analysis
All components are accessed exclusively through Grafana.
⸻
Responsibility Split
Custom Admin (GridPilot)
Handles: • business workflows • escrow state visibility • payment events • league integrity checks • moderation actions • audit views
Never handles: • raw logs • metrics • system traces
⸻
Observability Stack (Grafana)
Handles: • system health • performance bottlenecks • error rates • background job failures • infrastructure alerts
Never handles: • business decisions • user-visible data • domain state
⸻
Logging & Metrics Policy
What is logged • errors and exceptions • payment and escrow failures • background job failures • unexpected external API responses • startup and shutdown events
What is not logged • user personal data • credentials • domain state snapshots • high-frequency debug spam
⸻
Alerting Philosophy
Alerts are: • minimal • actionable • rare
Examples: • payment failure spike • escrow release delay • background jobs failing repeatedly • sustained error rate increase
No vanity alerts.
⸻
Rationale
This separation ensures: • domain data remains clean and safe • observability data can scale freely • infra failures never corrupt business data • operational complexity stays manageable
The system favors clarity over completeness and stability over tooling hype.
⸻
Summary • Domain data lives in PostgreSQL • Observability data lives in a dedicated stack • Grafana is the single infra control surface • Custom Admin is the single business control surface • No shared storage, no shared lifecycle
This design minimizes risk, cognitive load, and operational overhead while remaining fully extensible.