Senior Site Reliability Engineer

New York, NY, United States


Role Location

  • New York, NY, United States


51 - 100 people


7 W 18 Th St Fl 5
New York, NY, 10011, US

Tech Stack

  • React
  • Python
  • Flask
  • AWS
  • Docker
  • PostgreSQL
  • S3

Role Description

At Arthur, we are building the first platform for Responsible AI. We’re looking for an experienced Site Reliability Engineer who will be responsible for the design and implementation of scalable, resilient infrastructure to power our SaaS & On-Prem platform. The ideal candidate is a hands-on engineer, but is also able to work closely with management to define a strategy to evolve our platform and our SRE team as we grow.

As a Senior SRE, you will:

  • Architect & design the infrastructure responsible for our distributed & highly scalable services with both SaaS and on-prem offerings
  • Build monitoring to assess system & pipeline health, ensure performance and reliability
  • Collaborate with product to determine low-cost support models for our multi-tenant SaaS & On-Prem deployments
  • Automate and optimize developer pipelines to make them as frictionless as possible, sharing the responsibility of delivering capabilities rapidly with our product engineering team
  • Mentor teammates on SRE best practices & guide technical direction
  • Exhibit continuous curiosity for emerging technology that could solve our challenges


  • 5+ years experience in site reliability engineering, devops, and system administration
  • Expert in working with DevOps tools such as Kubernetes, Terraform, Ansible, Puppet and Chef
  • Have built, managed, and operated highly scalable, performant, and reliable large scale infrastructure/platform for complex service
  • Proficiency in Python and one other development languages such as Go, Ruby, Java
  • Strong competencies in managing and automating cloud infrastructure in AWS, GCP, and or Azure
  • Expertise in Linux administration, configuration, and networking protocols
  • CS (preferred) or other technical degree, or equivalent practical experience


  • 1+ years experience as technical lead or principal SRE
  • Competency configuring and deploying API gateways and microservice orchestration control planes
  • Experience with on-prem deployment architectures
  • Experience running a 24x7 SaaS platform with SLI, SLO, SLA
  • Experience with machine learning & AI

We offer:

  • Working with a small, fast-growing team, lots of opportunity to take ownership and run with projects
  • The opportunity to get in on the ground floor of a rapidly growing startup, working with a cutting-edge technology stack
  • Generous equity
  • A culture that empowers great people to accomplish great things
  • Full benefits package

About Arthur

Arthur AI is the first production AI monitoring platform, giving enterprises the tools to detect model issues proactively, in real-time to maximize their effectiveness. The Arthur AI platform brings auditability and transparency to black box models, and can be configured to monitor for unwanted bias.

Arthur’s Trusted AI platform offers a single pane of glass to all of your production models, outlining where models may be inaccurate due to any one of many statistical metrics

Company Culture

We are a scrappy, highly motivated team with varied & diverse backgrounds, who value honest feedback, transparency, and collaboration in order to make our product & our processes better. We are early enough in our journey that our next set of hires can influence the culture!

Interested in this role?
Skip straight to final-round interviews by applying through Triplebyte.