300+ HQ jobs
100+ Top-tier companies
Meticulously curated, no BS

Senior Distributed Systems Engineer

Lumalabs

Our mission is to build multimodal AI to expand human imagination and capabilities

We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for systems that can see and understand, show and explain, and eventually interact with our world to effect change.

We will deploy these systems to make a new kind of intelligent creative partner that can imagine with us. Free and away from the pressure of being creative. It's for all of us whose imaginations have been constrained, who've had to channel vivid dreams through broken words, hoping others will see what we see in our mind's eye. A partner that can help us show — not just tell.

Dream Machine is an early step to building that. Try it here

Why you should join us:

  • Luma is bringing together the best team in the world to achieve our goal, from researchers to engineers and designers to growth operators
  • Luma is not just a lab - we are deeply product focused and our vision merging AI models and delightful products is unique in the industry
  • We build. We ship. Our early products have been wildly successful

What do we value?
  • Expertise in your field
  • Urgency, velocity and execution
  • Problem solving mindset
  • Clear communication
  • Product focus

We are looking for people with strong ML & Distributed systems backgrounds. This role will work within our Research team, closely collaborating with researchers to build the platforms for training our next generation of foundation models.

Competencies

Responsibilities

  • Work with researchers to scale up the systems required for our next generation of models trained on multi-thousand GPU clusters.
  • Profile and optimize our model training code-base to achieve best in class hardware efficiency.
  • Build systems to distribute work across massive GPU clusters efficiently.
  • Design and implement methods to robustly train models in the presence of hardware failures.
  • Build tooling to help us better understand problems in our largest training jobs.

Experience

  • 5+ years of work experience.
  • Experience working with multi-modal ML pipelines, high performance computing and/or low level systems.
  • Passion for diving deep into systems implementations and understanding their fundamentals in order to improve their performance and maintainability.
  • Experience building stable and highly efficient distributed systems.
  • Strong generalist Python and Software skills including significant experience with Pytorch.
  • Good to have experience working with high performance C++ or CUDA.

Please note this role is not meant for recent grads.