Mission
Own the AI inference and training infrastructure that powers Victoria, BrightFlow ML, and Albright Studios generative pipelines. Make NVIDIA Triton, NeMo, and Ollama on GPU K8s reliable and fast.
Responsibilities
- Own the Triton inference server cluster on GPU K8s nodes
- Maintain the NeMo Curator + fine-tuning pipeline for in-house LLMs
- Operate Ollama and other lightweight inference engines for dev workflows
- Build model-serving APIs with autoscaling, quotas, and SLOs
- Lead GPU capacity planning across A6000, H100, and consumer-tier cards
- Partner with ML engineers on model packaging, optimization, and deployment
- Establish observability — token/sec, GPU utilization, model latency
Required qualifications
- 8+ years infrastructure engineering; 3+ years with ML serving
- Hands-on experience with NVIDIA Triton, TensorRT, or vLLM
- Strong K8s background; comfortable writing operators and CRDs
- Deep Python and one systems language (Go, Rust, C++)
Preferred qualifications
- Experience with NeMo Curator or modern LLM fine-tuning pipelines
- Background managing multi-GPU training (FSDP, DeepSpeed, Megatron)
- Open-source contributions to ML infrastructure tooling
- Prior Principal or Staff Engineer title at a hyperscaler or AI lab