Mission
Own production reliability across Albright services. Establish the SRE practice — SLOs, on-call, incident response, and the runbooks that let us sleep at night.
Responsibilities
- Define and maintain SLOs, error budgets, and on-call rotations
- Operate the observability stack — Prometheus, Grafana, Loki, OpenTelemetry
- Lead incident response and post-incident review
- Build runbooks and automate toil
- Partner with platform engineering on K8s, CI/CD, and infra
- Establish disaster-recovery and backup posture
- Improve mean-time-to-detect and mean-time-to-recover metric
Required qualifications
- 5+ years SRE or DevOps engineering
- Production K8s and Linux administration experience
- Strong observability stack experience (Prom/Grafana/OTel)
- Comfort writing Python or Go automation
Preferred qualifications
- Experience at a fintech or trading firm with strict reliability requirements
- On-call leadership at a 24x7 service
- Incident-Command-System or comparable training
- Open-source contributions