Field Site Reliability Engineer

Remote, San Francisco

Determined AI

Role Locations

  • Remote
  • San Francisco


26 - 50 people


324 5 Th St
San Francisco, CA, 94107-1002, US

Tech Stack

  • Go
  • Python
  • Docker
  • Tensorflow
  • PyTorch
  • Keras
  • Elm
  • Kubernetes
  • PostgreSQL
  • AWS

Role Description

Deep learning has enormous promise, but developing practical applications powered by deep learning is extremely complex and expensive. At Determined AI, we are working to change that by building software to make machine learning engineers dramatically more productive.

We’re hiring exceptional people to help us solve hard problems, design and build our product, shape our culture, and grow our company. We are looking for mature, productive, and intelligent people who share our passion for delivering value to our customers. We value diversity in opinions and background.

Join our small team of machine learning and distributed system experts, including key contributors to Apache Spark MLlib, Apache Mesos, and PostgreSQL, PhDs and faculty from UC Berkeley, Chicago, and CMU. We value open communication, collaboration, and empathy: strong opinions, weakly held.

As a Field Site Reliability Engineer, you will: * Build software, tools, and processes to enable our customers to deploy, operate, and monitor Determined AI’s software in both cloud and on-premise environments. * Work closely with customers and our sales organization to troubleshoot and resolve infrastructure challenges that arise when deploying and/or operating Determined AI’s software. * Lead technically-oriented customer meetings to gather system requirements and provide guidance on deploying the platform. * Establish best practices for onboarding system administrators to manage the Determined AI platform.

Requirements: * 2+ years experience in designing and operating enterprise infrastructure. * Familiarity with at least one high-level programming or scripting language such as Python, Go, bash. * Deep understanding of Unix/Linux operating systems internals and administration (e.g., filesystems, inodes, system calls) * Deep understanding of standard networking protocols and stacks (e.g., TCP/IP, routing, network topologies and hardware, SDN). * Strong ability to debug, troubleshoot, and resolve complex technical issues spanning multiple levels of the stack. * Strong communication skills, both written and verbal.

Preferred: * Experience working with modern distributed systems such as Kubernetes, Mesos, Hadoop / HDFS, and Apache Spark. * Experience working with cloud infrastructure providers such as AWS, GCP, and Azure. * Experience working with HPC computing clusters or NVIDIA GPUs (e.g. CUDA, cuDNN) * Experience in sales engineering or customer-facing roles in the enterprise software industry.

At Determined AI, we are committed to building a team that welcomes colleagues with a diverse set of identities, backgrounds, experiences, and perspectives. We're proud to be an equal opportunity employer and consider qualified applicants without regard to race, color, religion, sex, national origin, ancestry, age, pregnancy, citizenship, genetic information, sexual orientation, gender identity, marital or family status, veteran status, medical condition or disability.

About Determined AI

We're working together to empower data scientists and ML engineers everywhere.

Our customers are highly skilled ML engineers and domain experts working on exciting problems in biotech, hardware design, autonomous vehicles, and more. We interact with them to learn more about their data sets, modeling problems, and infrastructure, to help them with our product, and to improve our product offering.

Company Culture

We believe the best ideas can come from anyone and anywhere, and we have to be humble enough to listen for them. We are customer-focused, but don't think the customer is always right. We are excited about the latest in ML and distributed systems research but try to implement the minimum valuable product. We believe in open communication and transparency in our process and priorities. We believe in the healing power of karaoke and hot sauce.

Interested in this role?
Skip straight to final-round interviews by applying through Triplebyte.