G

ML Infrastructure Engineer

Gridmatic
Full-time
On-site
Cupertino, CA
Software

The Company

Gridmatic Inc. is a high-growth startup with offices in the Bay Area and Houston that is accelerating the clean energy transition by applying our expertise in data, machine learning, and energy to power markets. We are the rare startup that has multiple years of profitability without raising venture capital. At Gridmatic, we foster a collaborative and inclusive culture where learning and growth are constant. We move quickly, solve problems with integrity, and balance environmental responsibility with data-driven excellence.


We are looking for a Machine Learning Infrastructure Engineer to accelerate the decarbonization of the electricity system by building and optimizing the backbone of our ML platform. The ideal candidate will have solid expertise in machine learning, distributed systems and GPU-based training, and will design scalable, high-performance infrastructure for training, inference, and evaluation. They will push the boundaries of throughput and efficiency on large-scale time-series and weather datasets, while shaping the long-term vision of our ML platform and generalizing solutions for broader use. A successful candidate will thrive on continuous learning across engineering, ML systems, and energy markets, while contributing to a collaborative, mission-driven team. The ideal candidate must have strong deep learning fundamentals in addition to strong software engineering skills.

\n


You will:
  • Own a significant piece of our ML platform while rapidly building and iterating scalable, robust distributed infrastructure for ML training, inference, and evaluation on large-scale time-series and weather datasets.
  • Optimize throughput and cost by supporting model training and deployment across multiple clusters and clouds.
  • Improve the efficiency of machine learning models and other workloads by optimizing latency, throughput, and memory consumption. This involves pushing the boundaries of current hardware capabilities through techniques like GPU performance engineering.
  • Help define the long-term vision for Gridmatic’s ML platform.
  • Play a key role in mentoring junior engineers and interns, contributing to a collaborative, innovative, and growth-oriented team culture.


You might be a good fit if you are:
  • A strong engineer with 3+ years of experience who is committed to technical excellence. You possess a deep understanding of the codebases you work in and write readable, scalable code.
  • Experienced in researching and implementing deep learning models.
  • Experienced in distributed training and inference of large models on GPU clusters, utilizing core libraries and frameworks such as PyTorch, PyTorch Lightning, and Ray.
  • Comfortable with large-scale data storage infrastructure and formats, e.g. Zarr, SQL, and feature stores
  • A self-starter with a strong sense of independence and ownership, and the capability to engineer large, robust systems from the initial design and conceptualization to productionization.
  • A mission-driven individual who is enthusiastic about working toward a renewable grid and diving into the intersection of ML and energy. No prior energy experience required, but curiosity and a willingness to learn are must-haves!


Nice to haves:
  • End to end proficiency in building, maintaining, and debugging cluster infrastructure, utilizing Kubernetes and Terraform.
  • Expertise in identifying performance bottlenecks and designing and writing high-performance code for large-scale ML workloads.
  • Experience with at least one of: torch.profiler, TorchDynamo, TorchInductor, Triton, or other deep learning compiler stacks.
  • Knowledge of cluster communication protocols such as nccl or gloo
  • Experience working with any of the following: weather data, energy systems, time-series forecasting, electricity markets, or financial trading.


\n
$174,000 - $231,000 a year
\n

#LI-DNI


Join our team and make a difference! Click below or email us at careers@gridmatic.com.