Goals
π― Primary Goalsβ
The main objective is to design and implement a Highly Available Centralized Logging System for our on-premise Kubernetes infrastructure that ensures:
- Consistent log collection across multiple services and nodes.
- High Availability (HA) during pod, node, or network failures.
- Scalability, both in terms of log volume and system components.
- Structured logs with rich context for better observability.
- Search and visualization capabilities through Grafana/Kibana.
- Secure and auditable access for internal teams.
- Storage and retention using on-premise solutions like MinIO or NFS.
β Success Criteriaβ
A solution will be considered successful if it meets the following:
- β Logs from Node.js, Go, Python services are collected consistently.
- β System continues operating when one or more Kubernetes nodes go down.
- β
Structured JSON logs with fields like
request_id
,timestamp
,service
,severity
are supported. - β Query performance remains acceptable under load.
- β Helm-based deployment is available and repeatable for on-prem K8s.
- β No single point of failure exists in ingestion or query pipeline.
π Acceptance Criteriaβ
Criteria | Requirement |
---|---|
π Logging Stack | Support for Loki, ELK, or Graylog |
βΈοΈ Kubernetes Native | Components deployed using Helm, StatefulSet, DaemonSet, Ingress |
π¦ Storage | Compatible with MinIO, NFS, or CephFS |
π HA & Scalability | Ingesters and query services must scale horizontally |
π Log Format | Must support JSON format with standardized fields |
π Search & Alerting | Must integrate with Grafana or Kibana for dashboards and alert rules |
π Security | Access control via RBAC or reverse proxy auth |
π Persistence | Each stateful component has persistent volumes configured |
π§ͺ Fault Tolerance | System tolerates node or pod crashes without data loss |
π Out of Scopeβ
- Managed logging solutions like Datadog, CloudWatch, or GCP Logging
- Multi-cloud or hybrid-cloud scenarios
- Full SIEM integration (only optional)
π§βπ» Stakeholdersβ
- Backend Team β for log visibility and debugging
- DevOps / Infra Team β for deployment, scaling, and resilience
- Security / Compliance β for audit logs and access control