Senior SRE Engineer at Nvidia

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

NVIDIA is seeking a passionate, motivated, and technical Engineer to join its multifaceted and fast-paced Infrastructure, Planning, and Processes organization as a Senior SRE Engineer. You will own and scale our internal CI as a Service platform. This platform includes the shared GitLab CI and GitHub Actions infrastructure used daily by thousands of engineers. You will manage this platform like a product: highly available, self-service, observable, and elastic to handle build and test workloads across the company. The position is part of a fast-paced team that develops and maintains complex build and test environments. These environments support various hardware platforms, including NVIDIA GPUs and Tegra Processors, as well as multiple operating systems like Windows, Linux, and Android. The team collaborates with other NVIDIA Software units such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, Robotics, and Autonomous cars to meet their infrastructure and system needs.

What you'll be doing:

Develop, handle, and expand a multi-tenant CI platform built on GitLab’s CI framework and GitHub’s action-based automation, encompassing runner fleets, shared caches, artifact storage, and secrets brokering.
Own the underlying Kubernetes substrate end-to-end. This includes cluster lifecycle, upgrades, and autoscaling. Manage node pools for GPU, CPU, and ARM workloads. Handle network and storage policy. Operate the controllers and operators that schedule runner pods on demand.
Drive reliability and capacity engineering: SLOs and error budgets for queue time, job success, and runner availability; on-call, incident response, postmortems, and structural fixes that keep toil flat as usage grows.
Build the self-service layer pipeline templates, reusable workflows, golden images, policy-as-code, and guardrails so product teams onboard in hours, not weeks, with secure-by-default pipelines.
Improve developer experience continuously: faster cold-starts, smarter caching, hermetic builds, test sharding and flakiness reduction, and deep observability into pipeline performance and cost per team.

What we need to see:

5+ years in SRE/platform roles with strong fundamentals — SLO/SLI build, incident command, resource planning, performance tuning, and production Linux administration at scale.
Deep Kubernetes administration experience: CRDs and operators, HPA/VPA/cluster-autoscaling, ingress, service mesh, RBAC, network policies, storage classes and deep problem-solving skills.
Hands-on expertise with GitLab continuous integration and GitHub automated workflows at scale — runner architecture, executor tuning, self-hosted runner controllers (ARC, GitLab-runner Helm chart), cache and artifact strategy, and pipeline development involving DAGs or equivalent experience.
Strong scripting and automation skills in Python, Go, bash scripting or equivalent. You should have production experience with IaC and configuration management tools like Terraform, Helm, and Ansible. Experience with GitOps tools such as Argo CD and Flux is also required.
BS/MS in CS or equivalent experience in building observability tools like Prometheus, Grafana, Loki/ELK, OpenTelemetry or similar products. You have shipped platforms that other specialists enjoy using.

Ways to stand out from the crowd:

Strong understanding of containerization and microservices architecture. Certified Kubernetes Administrator (CKA), Certified Kubernetes Security Specialist (CKS) & Certified Kubernetes Application Developer (CKAD) preferred.
Built or extended the CI control plane itself — custom runner schedulers, autoscaling, webhooks routers, or pipeline orchestration on top of GitLab/GitHub APIs.
Thrives in a multi-tasking environment with continuously evolving priorities.
Ability to analyze complex problems into simple sub problems and then reuse available solutions to implement most of those. Ability to build simple systems that can work efficiently without needing much support.
Prior experience with large scale operations team. Experience with using and improving data centers. Background with computer algorithms and ability to choose the best possible algorithms to meet the scaling challenge.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 148,000 USD - 235,750 USD for Level 3, and 176,000 USD - 276,000 USD for Level 4.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until July 6, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Senior SRE Engineer

Summary

Required Skills

Details

Description