Site Reliability Engineer Job at Berkley Hunt, San Jose, CA

OVQ0eUMxMm5BSmkyZTRnZ2hRR1RpaTQ0N2c9PQ==
  • Berkley Hunt
  • San Jose, CA

Job Description

Senior Site Reliability Engineer (GPU Compute) | Hybrid – Bay Area, CA

Berkley Hunt is supporting a fast-growing AI startup building a high-performance, cloud-native platform to power cutting-edge machine learning workloads. As they scale, they’re hiring a Senior/Staff Infrastructure Engineer to lead the development of a scalable GPU compute environment from the ground up.

About the Role:

This is a high-impact role for an experienced infrastructure engineer who thrives in fast-paced environments and wants to shape the future of AI infrastructure. You’ll design, build, and operate the systems that enable high-throughput GPU workloads at scale—collaborating closely with the core engineering team to optimize performance, efficiency, and reliability.

If you're excited about solving deep technical challenges in distributed compute and cloud automation, this could be a standout opportunity.

Responsibilities:

  • Build and maintain a large-scale, distributed GPU compute platform powering AI workloads.
  • Develop backend systems in Python to orchestrate GPU jobs, manage routing, observability, and capacity.
  • Design and implement infrastructure with tools like Terraform, Ansible, and Kubernetes across cloud and bare metal environments.
  • Own the reliability, scalability, and performance of the platform, from provisioning to deployment and monitoring.
  • Collaborate with the engineering team to shape infrastructure vision and technical strategy over the next 1–5 years.
  • Drive automation and improvements to minimize operational overhead and scale efficiently.

Requirements:

  • 6+ years of experience in cloud infrastructure or backend engineering roles.
  • Deep knowledge of distributed compute systems, especially involving GPU orchestration.
  • Proficiency with Python and infrastructure-as-code tools (e.g., Terraform, Ansible).
  • Solid experience with Kubernetes and CI/CD pipelines.
  • Strong understanding of cloud platforms (AWS, GCP, or Azure); bare metal experience is a plus.
  • Excellent problem-solving skills and a proactive, ownership-driven mindset.

Nice to Have:

  • Experience at a high-growth startup or in scaling large infrastructure systems.
  • Familiarity with GPU resource scheduling and performance optimization.
  • Hands-on experience with observability stacks (Prometheus, Grafana, Loki, Thanos).
  • A passion for automation, infrastructure design, and moving fast without breaking things.

Job Tags

Similar Jobs

Town + Country Resources

Part-time Independent Living Aide & Personal Assistant Job at Town + Country Resources

 ...Job # 10606 Position: Part-Time Independent Living Aide / Personal Assistant Location: Pleasanton, CA Start Date: Targeting a June hire; includes a one-week paid trial period with the client Schedule: MondayFriday, minimum 20 hours per week, up to... 

Mission Staffing

Receptionist-Private Equity Job at Mission Staffing

 ...meeting rooms and appointments as needed Receive and distribute mail, packages, and deliveries Assist with light administrative duties. Support internal teams with ad hoc requests to ensure smooth office operations Requirements: Minimum 2 years of... 

Upward Health

Nurse Practitioner Job at Upward Health

 ...Nurse Practitioner (NP) Upward Health is a home-based medical group specializing in primary care and behavioral health for individuals with complex needs. We serve patients throughout their communities, and we diagnose, treat, and prescribe anywhere our patients call... 

bet365

Senior UI-UX Designer Job at bet365

 ...You will work within the Product Design team in the Design and UX department, who are responsible for the strategic design, visual direction and development of our product. With a new focus on the US market, we are looking to craft innovative mobile app experiences that... 

Mix Talent

Computer Systems Validation Engineer Job at Mix Talent

JOB TITLE: Computer System Validation Engineer DESCRIPTION POSITION OVERVIEW As a Computer System Validation (CSV) Engineer in a biotechnology environment, you will play a critical role in ensuring that computerized systems used in GxP-regulated activities (e.g....