gridpilot.gg/docs/OBSERVABILITY.md at 6154d544356806217f3bf477edf54c10bdab2d78

mmintel/gridpilot.gg

Fork 0

Files

Marc Mintel 167e82a52b auth

2025-12-31 19:55:43 +01:00

3.9 KiB

Raw Blame History

GridPilot — Observability & Data Separation Design

Purpose

This document defines how GridPilot separates business-critical domain data from infrastructure / observability data, while keeping operations simple, self-hosted, and cognitively manageable.

Goals: • protect domain data at all costs • avoid tool sprawl • keep one clear mental model for operations • enable debugging without polluting business logic • ensure long-term maintainability

⸻

Core Principle

Domain data and infrastructure data must never share the same storage, lifecycle, or access path.

They serve different purposes, have different risk profiles, and must be handled independently.

⸻

Data Categories

Domain (Business) Data

Includes • users • leagues • seasons • races • results • penalties • escrow balances • sponsorship contracts • payments & payouts

Characteristics • legally relevant • trust-critical • user-facing • must never be lost • requires strict migrations and backups

Storage • Relational database (PostgreSQL) • Strong consistency (ACID) • Backups and disaster recovery mandatory

Access • Application backend • Custom Admin UI (primary control surface)

⸻

Infrastructure / Observability Data

Includes • application logs • error traces • metrics (latency, throughput, failures) • background job status • system health signals

Characteristics • high volume • ephemeral by design • not user-facing • safe to rotate or delete • supports debugging, not business logic

Storage • Dedicated observability stack • Completely separate from domain database

Access • Grafana UI only • Never exposed to users • Never queried by application logic

⸻

Observability Architecture (Self-Hosted)

GridPilot uses a single consolidated self-hosted observability stack.

Components • Grafana • Central UI • Dashboards • Alerting • Single login • Loki • Log aggregation • Append-only • Schema-less • Optimized for high-volume logs • Prometheus • Metrics collection • Time-series data • Alert rules • Tempo (optional) • Distributed traces • Request flow analysis

All components are accessed exclusively through Grafana.

⸻

Responsibility Split

Custom Admin (GridPilot)

Handles: • business workflows • escrow state visibility • payment events • league integrity checks • moderation actions • audit views

Never handles: • raw logs • metrics • system traces

⸻

Observability Stack (Grafana)

Handles: • system health • performance bottlenecks • error rates • background job failures • infrastructure alerts

Never handles: • business decisions • user-visible data • domain state

⸻

Logging & Metrics Policy

What is logged • errors and exceptions • payment and escrow failures • background job failures • unexpected external API responses • startup and shutdown events

What is not logged • user personal data • credentials • domain state snapshots • high-frequency debug spam

⸻

Alerting Philosophy

Alerts are: • minimal • actionable • rare

Examples: • payment failure spike • escrow release delay • background jobs failing repeatedly • sustained error rate increase

No vanity alerts.

⸻

Rationale

This separation ensures: • domain data remains clean and safe • observability data can scale freely • infra failures never corrupt business data • operational complexity stays manageable

The system favors clarity over completeness and stability over tooling hype.

⸻

Summary • Domain data lives in PostgreSQL • Observability data lives in a dedicated stack • Grafana is the single infra control surface • Custom Admin is the single business control surface • No shared storage, no shared lifecycle

This design minimizes risk, cognitive load, and operational overhead while remaining fully extensible.

3.9 KiB Raw Blame History

3.9 KiB

Raw Blame History