GridPilot — Observability & Data Separation Design Purpose This document defines how GridPilot separates business-critical domain data from infrastructure / observability data, while keeping operations simple, self-hosted, and cognitively manageable. Goals: • protect domain data at all costs • avoid tool sprawl • keep one clear mental model for operations • enable debugging without polluting business logic • ensure long-term maintainability ⸻ Core Principle Domain data and infrastructure data must never share the same storage, lifecycle, or access path. They serve different purposes, have different risk profiles, and must be handled independently. ⸻ Data Categories 1. Domain (Business) Data Includes • users • leagues • seasons • races • results • penalties • escrow balances • sponsorship contracts • payments & payouts Characteristics • legally relevant • trust-critical • user-facing • must never be lost • requires strict migrations and backups Storage • Relational database (PostgreSQL) • Strong consistency (ACID) • Backups and disaster recovery mandatory Access • Application backend • Custom Admin UI (primary control surface) ⸻ 2. Infrastructure / Observability Data Includes • application logs • error traces • metrics (latency, throughput, failures) • background job status • system health signals Characteristics • high volume • ephemeral by design • not user-facing • safe to rotate or delete • supports debugging, not business logic Storage • Dedicated observability stack • Completely separate from domain database Access • Grafana UI only • Never exposed to users • Never queried by application logic ⸻ Observability Architecture (Self-Hosted) GridPilot uses a single consolidated self-hosted observability stack. Components • Grafana • Central UI • Dashboards • Alerting • Single login • Loki • Log aggregation • Append-only • Schema-less • Optimized for high-volume logs • Prometheus • Metrics collection • Time-series data • Alert rules • Tempo (optional) • Distributed traces • Request flow analysis All components are accessed exclusively through Grafana. ⸻ Responsibility Split Custom Admin (GridPilot) Handles: • business workflows • escrow state visibility • payment events • league integrity checks • moderation actions • audit views Never handles: • raw logs • metrics • system traces ⸻ Observability Stack (Grafana) Handles: • system health • performance bottlenecks • error rates • background job failures • infrastructure alerts Never handles: • business decisions • user-visible data • domain state ⸻ Logging & Metrics Policy What is logged • errors and exceptions • payment and escrow failures • background job failures • unexpected external API responses • startup and shutdown events What is not logged • user personal data • credentials • domain state snapshots • high-frequency debug spam ⸻ Alerting Philosophy Alerts are: • minimal • actionable • rare Examples: • payment failure spike • escrow release delay • background jobs failing repeatedly • sustained error rate increase No vanity alerts. ⸻ Rationale This separation ensures: • domain data remains clean and safe • observability data can scale freely • infra failures never corrupt business data • operational complexity stays manageable The system favors clarity over completeness and stability over tooling hype. ⸻ Summary • Domain data lives in PostgreSQL • Observability data lives in a dedicated stack • Grafana is the single infra control surface • Custom Admin is the single business control surface • No shared storage, no shared lifecycle This design minimizes risk, cognitive load, and operational overhead while remaining fully extensible.