auth
This commit is contained in:
199
docs/OBSERVABILITY.md
Normal file
199
docs/OBSERVABILITY.md
Normal file
@@ -0,0 +1,199 @@
|
||||
GridPilot — Observability & Data Separation Design
|
||||
|
||||
Purpose
|
||||
|
||||
This document defines how GridPilot separates business-critical domain data from infrastructure / observability data, while keeping operations simple, self-hosted, and cognitively manageable.
|
||||
|
||||
Goals:
|
||||
• protect domain data at all costs
|
||||
• avoid tool sprawl
|
||||
• keep one clear mental model for operations
|
||||
• enable debugging without polluting business logic
|
||||
• ensure long-term maintainability
|
||||
|
||||
⸻
|
||||
|
||||
Core Principle
|
||||
|
||||
Domain data and infrastructure data must never share the same storage, lifecycle, or access path.
|
||||
|
||||
They serve different purposes, have different risk profiles, and must be handled independently.
|
||||
|
||||
⸻
|
||||
|
||||
Data Categories
|
||||
|
||||
1. Domain (Business) Data
|
||||
|
||||
Includes
|
||||
• users
|
||||
• leagues
|
||||
• seasons
|
||||
• races
|
||||
• results
|
||||
• penalties
|
||||
• escrow balances
|
||||
• sponsorship contracts
|
||||
• payments & payouts
|
||||
|
||||
Characteristics
|
||||
• legally relevant
|
||||
• trust-critical
|
||||
• user-facing
|
||||
• must never be lost
|
||||
• requires strict migrations and backups
|
||||
|
||||
Storage
|
||||
• Relational database (PostgreSQL)
|
||||
• Strong consistency (ACID)
|
||||
• Backups and disaster recovery mandatory
|
||||
|
||||
Access
|
||||
• Application backend
|
||||
• Custom Admin UI (primary control surface)
|
||||
|
||||
⸻
|
||||
|
||||
2. Infrastructure / Observability Data
|
||||
|
||||
Includes
|
||||
• application logs
|
||||
• error traces
|
||||
• metrics (latency, throughput, failures)
|
||||
• background job status
|
||||
• system health signals
|
||||
|
||||
Characteristics
|
||||
• high volume
|
||||
• ephemeral by design
|
||||
• not user-facing
|
||||
• safe to rotate or delete
|
||||
• supports debugging, not business logic
|
||||
|
||||
Storage
|
||||
• Dedicated observability stack
|
||||
• Completely separate from domain database
|
||||
|
||||
Access
|
||||
• Grafana UI only
|
||||
• Never exposed to users
|
||||
• Never queried by application logic
|
||||
|
||||
⸻
|
||||
|
||||
Observability Architecture (Self-Hosted)
|
||||
|
||||
GridPilot uses a single consolidated self-hosted observability stack.
|
||||
|
||||
Components
|
||||
• Grafana
|
||||
• Central UI
|
||||
• Dashboards
|
||||
• Alerting
|
||||
• Single login
|
||||
• Loki
|
||||
• Log aggregation
|
||||
• Append-only
|
||||
• Schema-less
|
||||
• Optimized for high-volume logs
|
||||
• Prometheus
|
||||
• Metrics collection
|
||||
• Time-series data
|
||||
• Alert rules
|
||||
• Tempo (optional)
|
||||
• Distributed traces
|
||||
• Request flow analysis
|
||||
|
||||
All components are accessed exclusively through Grafana.
|
||||
|
||||
⸻
|
||||
|
||||
Responsibility Split
|
||||
|
||||
Custom Admin (GridPilot)
|
||||
|
||||
Handles:
|
||||
• business workflows
|
||||
• escrow state visibility
|
||||
• payment events
|
||||
• league integrity checks
|
||||
• moderation actions
|
||||
• audit views
|
||||
|
||||
Never handles:
|
||||
• raw logs
|
||||
• metrics
|
||||
• system traces
|
||||
|
||||
⸻
|
||||
|
||||
Observability Stack (Grafana)
|
||||
|
||||
Handles:
|
||||
• system health
|
||||
• performance bottlenecks
|
||||
• error rates
|
||||
• background job failures
|
||||
• infrastructure alerts
|
||||
|
||||
Never handles:
|
||||
• business decisions
|
||||
• user-visible data
|
||||
• domain state
|
||||
|
||||
⸻
|
||||
|
||||
Logging & Metrics Policy
|
||||
|
||||
What is logged
|
||||
• errors and exceptions
|
||||
• payment and escrow failures
|
||||
• background job failures
|
||||
• unexpected external API responses
|
||||
• startup and shutdown events
|
||||
|
||||
What is not logged
|
||||
• user personal data
|
||||
• credentials
|
||||
• domain state snapshots
|
||||
• high-frequency debug spam
|
||||
|
||||
⸻
|
||||
|
||||
Alerting Philosophy
|
||||
|
||||
Alerts are:
|
||||
• minimal
|
||||
• actionable
|
||||
• rare
|
||||
|
||||
Examples:
|
||||
• payment failure spike
|
||||
• escrow release delay
|
||||
• background jobs failing repeatedly
|
||||
• sustained error rate increase
|
||||
|
||||
No vanity alerts.
|
||||
|
||||
⸻
|
||||
|
||||
Rationale
|
||||
|
||||
This separation ensures:
|
||||
• domain data remains clean and safe
|
||||
• observability data can scale freely
|
||||
• infra failures never corrupt business data
|
||||
• operational complexity stays manageable
|
||||
|
||||
The system favors clarity over completeness and stability over tooling hype.
|
||||
|
||||
⸻
|
||||
|
||||
Summary
|
||||
• Domain data lives in PostgreSQL
|
||||
• Observability data lives in a dedicated stack
|
||||
• Grafana is the single infra control surface
|
||||
• Custom Admin is the single business control surface
|
||||
• No shared storage, no shared lifecycle
|
||||
|
||||
This design minimizes risk, cognitive load, and operational overhead while remaining fully extensible.
|
||||
Reference in New Issue
Block a user