Files
gridpilot.gg/docs/OBSERVABILITY.md
2025-12-31 19:55:43 +01:00

3.9 KiB

GridPilot — Observability & Data Separation Design

Purpose

This document defines how GridPilot separates business-critical domain data from infrastructure / observability data, while keeping operations simple, self-hosted, and cognitively manageable.

Goals: • protect domain data at all costs • avoid tool sprawl • keep one clear mental model for operations • enable debugging without polluting business logic • ensure long-term maintainability

Core Principle

Domain data and infrastructure data must never share the same storage, lifecycle, or access path.

They serve different purposes, have different risk profiles, and must be handled independently.

Data Categories

  1. Domain (Business) Data

Includes • users • leagues • seasons • races • results • penalties • escrow balances • sponsorship contracts • payments & payouts

Characteristics • legally relevant • trust-critical • user-facing • must never be lost • requires strict migrations and backups

Storage • Relational database (PostgreSQL) • Strong consistency (ACID) • Backups and disaster recovery mandatory

Access • Application backend • Custom Admin UI (primary control surface)

  1. Infrastructure / Observability Data

Includes • application logs • error traces • metrics (latency, throughput, failures) • background job status • system health signals

Characteristics • high volume • ephemeral by design • not user-facing • safe to rotate or delete • supports debugging, not business logic

Storage • Dedicated observability stack • Completely separate from domain database

Access • Grafana UI only • Never exposed to users • Never queried by application logic

Observability Architecture (Self-Hosted)

GridPilot uses a single consolidated self-hosted observability stack.

Components • Grafana • Central UI • Dashboards • Alerting • Single login • Loki • Log aggregation • Append-only • Schema-less • Optimized for high-volume logs • Prometheus • Metrics collection • Time-series data • Alert rules • Tempo (optional) • Distributed traces • Request flow analysis

All components are accessed exclusively through Grafana.

Responsibility Split

Custom Admin (GridPilot)

Handles: • business workflows • escrow state visibility • payment events • league integrity checks • moderation actions • audit views

Never handles: • raw logs • metrics • system traces

Observability Stack (Grafana)

Handles: • system health • performance bottlenecks • error rates • background job failures • infrastructure alerts

Never handles: • business decisions • user-visible data • domain state

Logging & Metrics Policy

What is logged • errors and exceptions • payment and escrow failures • background job failures • unexpected external API responses • startup and shutdown events

What is not logged • user personal data • credentials • domain state snapshots • high-frequency debug spam

Alerting Philosophy

Alerts are: • minimal • actionable • rare

Examples: • payment failure spike • escrow release delay • background jobs failing repeatedly • sustained error rate increase

No vanity alerts.

Rationale

This separation ensures: • domain data remains clean and safe • observability data can scale freely • infra failures never corrupt business data • operational complexity stays manageable

The system favors clarity over completeness and stability over tooling hype.

Summary • Domain data lives in PostgreSQL • Observability data lives in a dedicated stack • Grafana is the single infra control surface • Custom Admin is the single business control surface • No shared storage, no shared lifecycle

This design minimizes risk, cognitive load, and operational overhead while remaining fully extensible.