199 lines
3.9 KiB
Markdown
199 lines
3.9 KiB
Markdown
GridPilot — Observability & Data Separation Design
|
|
|
|
Purpose
|
|
|
|
This document defines how GridPilot separates business-critical domain data from infrastructure / observability data, while keeping operations simple, self-hosted, and cognitively manageable.
|
|
|
|
Goals:
|
|
• protect domain data at all costs
|
|
• avoid tool sprawl
|
|
• keep one clear mental model for operations
|
|
• enable debugging without polluting business logic
|
|
• ensure long-term maintainability
|
|
|
|
⸻
|
|
|
|
Core Principle
|
|
|
|
Domain data and infrastructure data must never share the same storage, lifecycle, or access path.
|
|
|
|
They serve different purposes, have different risk profiles, and must be handled independently.
|
|
|
|
⸻
|
|
|
|
Data Categories
|
|
|
|
1. Domain (Business) Data
|
|
|
|
Includes
|
|
• users
|
|
• leagues
|
|
• seasons
|
|
• races
|
|
• results
|
|
• penalties
|
|
• escrow balances
|
|
• sponsorship contracts
|
|
• payments & payouts
|
|
|
|
Characteristics
|
|
• legally relevant
|
|
• trust-critical
|
|
• user-facing
|
|
• must never be lost
|
|
• requires strict migrations and backups
|
|
|
|
Storage
|
|
• Relational database (PostgreSQL)
|
|
• Strong consistency (ACID)
|
|
• Backups and disaster recovery mandatory
|
|
|
|
Access
|
|
• Application backend
|
|
• Custom Admin UI (primary control surface)
|
|
|
|
⸻
|
|
|
|
2. Infrastructure / Observability Data
|
|
|
|
Includes
|
|
• application logs
|
|
• error traces
|
|
• metrics (latency, throughput, failures)
|
|
• background job status
|
|
• system health signals
|
|
|
|
Characteristics
|
|
• high volume
|
|
• ephemeral by design
|
|
• not user-facing
|
|
• safe to rotate or delete
|
|
• supports debugging, not business logic
|
|
|
|
Storage
|
|
• Dedicated observability stack
|
|
• Completely separate from domain database
|
|
|
|
Access
|
|
• Grafana UI only
|
|
• Never exposed to users
|
|
• Never queried by application logic
|
|
|
|
⸻
|
|
|
|
Observability Architecture (Self-Hosted)
|
|
|
|
GridPilot uses a single consolidated self-hosted observability stack.
|
|
|
|
Components
|
|
• Grafana
|
|
• Central UI
|
|
• Dashboards
|
|
• Alerting
|
|
• Single login
|
|
• Loki
|
|
• Log aggregation
|
|
• Append-only
|
|
• Schema-less
|
|
• Optimized for high-volume logs
|
|
• Prometheus
|
|
• Metrics collection
|
|
• Time-series data
|
|
• Alert rules
|
|
• Tempo (optional)
|
|
• Distributed traces
|
|
• Request flow analysis
|
|
|
|
All components are accessed exclusively through Grafana.
|
|
|
|
⸻
|
|
|
|
Responsibility Split
|
|
|
|
Custom Admin (GridPilot)
|
|
|
|
Handles:
|
|
• business workflows
|
|
• escrow state visibility
|
|
• payment events
|
|
• league integrity checks
|
|
• moderation actions
|
|
• audit views
|
|
|
|
Never handles:
|
|
• raw logs
|
|
• metrics
|
|
• system traces
|
|
|
|
⸻
|
|
|
|
Observability Stack (Grafana)
|
|
|
|
Handles:
|
|
• system health
|
|
• performance bottlenecks
|
|
• error rates
|
|
• background job failures
|
|
• infrastructure alerts
|
|
|
|
Never handles:
|
|
• business decisions
|
|
• user-visible data
|
|
• domain state
|
|
|
|
⸻
|
|
|
|
Logging & Metrics Policy
|
|
|
|
What is logged
|
|
• errors and exceptions
|
|
• payment and escrow failures
|
|
• background job failures
|
|
• unexpected external API responses
|
|
• startup and shutdown events
|
|
|
|
What is not logged
|
|
• user personal data
|
|
• credentials
|
|
• domain state snapshots
|
|
• high-frequency debug spam
|
|
|
|
⸻
|
|
|
|
Alerting Philosophy
|
|
|
|
Alerts are:
|
|
• minimal
|
|
• actionable
|
|
• rare
|
|
|
|
Examples:
|
|
• payment failure spike
|
|
• escrow release delay
|
|
• background jobs failing repeatedly
|
|
• sustained error rate increase
|
|
|
|
No vanity alerts.
|
|
|
|
⸻
|
|
|
|
Rationale
|
|
|
|
This separation ensures:
|
|
• domain data remains clean and safe
|
|
• observability data can scale freely
|
|
• infra failures never corrupt business data
|
|
• operational complexity stays manageable
|
|
|
|
The system favors clarity over completeness and stability over tooling hype.
|
|
|
|
⸻
|
|
|
|
Summary
|
|
• Domain data lives in PostgreSQL
|
|
• Observability data lives in a dedicated stack
|
|
• Grafana is the single infra control surface
|
|
• Custom Admin is the single business control surface
|
|
• No shared storage, no shared lifecycle
|
|
|
|
This design minimizes risk, cognitive load, and operational overhead while remaining fully extensible. |