Lead AI-HPC Cluster Engineer - MLOps Job in Santa Clara

885 IT & Software Developer jobs in the US

Company Size

5k+

Company Type

Product

Exp Level

Senior

Job Type

Full-Time

Language

English

Visa sponsorship

Requirements

Must:

- I hold a Bachelor’s degree in Computer Science, Electrical Engineering, or a related discipline, or possess equivalent experience. - I have a minimum of 6 years of experience in designing and managing large-scale compute infrastructure. - I am experienced with AI/HPC job schedulers and orchestrators, such as Slurm, Kubernetes, or LSF, as well as AI/HPC workflows incorporating MPI and NCCL. - I possess proficiency in using Linux environments, specifically CentOS/RHEL or Ubuntu distributions, and have a robust understanding of container technologies such as Enroot, Docker, and Podman. - I am skilled in at least one scripting language (e.g., Python or Bash) and one compiled language (e.g., Golang, Rust, C, or C++). - I have experience with analyzing and optimizing performance across various AI/HPC workloads, along with strong problem-solving capabilities to assess intricate systems, pinpoint bottlenecks, and execute scalable solutions. - I demonstrate exceptional communication and teamwork abilities, effectively collaborating with diverse teams and individuals. - I am passionate about continuous learning and staying updated with the latest technologies and methods within the HPC and AI/ML infrastructure domains.

Technologies

CUDA

InfiniBand

Machine Learning

Podman

Prometheus

PyTorch

Python

RDMA

Responsibilities

- I will provide leadership and strategic guidance in managing extensive HPC systems, which includes deploying compute, networking, and storage resources. - I will develop and enhance our ecosystem surrounding GPU-accelerated computing, focusing on creating scalable automation solutions. - I will cultivate and maintain strong relationships with customers and cross-functional teams to consistently support the clusters and adapt to evolving user needs. - I will assist researchers in executing their workloads by conducting performance analysis and optimizations. - I will perform root cause analyses and propose corrective measures, identifying and resolving potential issues before they arise. - I will create innovative tools to enhance researchers' productivity, streamline troubleshooting, and improve software performance at scale.

Description

NVIDIA has been at the forefront of transforming computer graphics, PC gaming, and accelerated computing for over 25 years, backed by remarkable technology and exceptional talent. We are currently exploring the vast possibilities of AI to create the next generation of computing, in which our GPUs will serve as the brains for computers, robots, and autonomous vehicles capable of comprehending their surroundings. Being an NVIDIAN means being part of a diverse and supportive environment that inspires everyone to deliver their best work. We invite you to join our team and discover how you can have a lasting influence on the world. Our competitive salaries and benefits reflect the contributions of our skilled employees, driving the rapid growth of our esteemed engineering team. If you're a tech enthusiast, we encourage you to apply! Base salaries will be determined by location, experience, and the compensation of employees in similar roles, with ranges set between $184,000 - $287,500 for Level 4 and $224,000 - $356,500 for Level 5. Additionally, you will be eligible for equity and benefits. We will be accepting applications for this position until at least September 12, 2025. At NVIDIA, we are dedicated to promoting a diverse workplace and take pride in being an equal opportunity employer. We highly value diversity among our employees and do not discriminate based on race, religion, color, national origin, gender, gender identity, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.

Something wrong or incorrect with this job? Tell us in the chat 💬 on the right ➡️

IT & Software developer jobs in the USMachine-Learning Developer jobs in the USMachine-Learning Developer jobs San Jose, CA

You can find Machine Learning Engineer salaries in the United States here.

How many Machine Learning Engineer jobs are in the United States?

Currently, there are 885 ML, AI openings. Check also: TensorFlow jobs, Python jobs, Computer-Vision jobs - all with salary brackets.

Is the US a good place for Machine Learning Engineers?

The US is one of the best countries to work as a Machine Learning Engineer. It has a vibrant startup community, growing tech hubs and, most important: lots of interesting jobs for people who work in tech.

Which companies are hiring for Machine Learning Engineer jobs in the United States?

Sperasoft, Bain Magique, Archon Systems Inc, Journey Freight International inc, Puter Technologies Inc., Ontario One Call, HAPLY Robotics Inc. among others, are currently hiring for ML, AI roles in the United States.

The company with most openings is Leidos as they are hiring for 88 different Machine Learning Engineer jobs in the United States. They are probably quite committed to find good Machine Learning Engineers.