300+ HQ jobs
100+ Top-tier companies
Meticulously curated, no BS

Senior Site Reliability Engineer

Nordhealth

About the role 

We are seeking a dedicated individual for a role that centers around Provet Cloud, our cloud-based veterinary practice management software (https://www.provet.cloud/). Provet Cloud is designed to help veterinary practitioners save time so they can devote more attention to caring for their patients and to make managing a veterinary practice more efficient and simpler. It offers features for appointment scheduling, electronic medical records, inventory management, billing, and communication within the veterinary team.

The purpose of the Senior SRE role in our company is to ensure the scalability, reliability, and high availability of our platform. This includes automating our infrastructure to accommodate higher loads resulting from increased usage and monitoring the cloud hosting costs to keep them at a proper level as our user base expands. Additionally, the SRE plays a crucial role in maintaining the system's reliability, especially with multiple enterprise-class customers relying on our platform. The SRE team's focus on automation, monitoring, and proactive maintenance helps us meet the demands of our expanding user base while ensuring that our services remain consistently available and performant.

This is a unique opportunity to join our team and contribute to enhancing the efficiency and simplicity of veterinary practice management through Provet Cloud! 🚀

Your key responsibilities include:

  • Automate infrastructure to accommodate growing user base and workload.
  • Monitor and optimize cloud hosting costs to maintain efficiency.
  • Ensure the system is highly available and reliable for all our customers especially for enterprise-class customers.
  • Implement and maintain monitoring systems for performance and reliability.
  • Troubleshoot and resolve incidents to minimize downtime.
  • Collaborate with development and operations teams to improve system performance and stability.
  • Plan and execute capacity planning to meet future demands.
  • Implement and maintain disaster recovery and failover procedures.
  • Continuously evaluate and improve system architecture for scalability and reliability.

What will help you to be successful in this role?

Ideally, you have already gained some experience from working in a fast growing, global SaaS company.

Success factors and key challenges of the role:

  • Maintaining high availability while simultaneously optimizing costs is crucial for the SRE role. This involves balancing the need for reliability with cost-effectiveness to ensure efficient operations.
  • Keeping infrastructure maintained and updated with minimal downtime is essential, ideally with no noticeable interruptions for our clients and users. This requires careful planning and execution to minimize disruptions while making necessary changes.
  • Effective resource planning in a rapidly changing environment is critical to avoid overprovisioning while still meeting increasing demands. This involves staying proactive and adaptable to ensure resources are utilized optimally.
  • Continuous review and improvement of disaster recovery plans and procedures are necessary to mitigate potential risks effectively. Regular testing and updates are vital to ensure readiness for any unforeseen events.
  • Quick analysis and mitigation of any issues or incidents is essential, along with a clear plan for permanent resolution. This includes identifying root causes and implementing corrective measures to prevent recurrence.

Critical Knowledge and Experience:

  • Proficiency in AWS, Azure, or Google Cloud, and infrastructure as code (IaC) tools like Terraform.
  • Strong scripting abilities using Python, Bash, or PowerShell for infrastructure automation.
  • Experience with monitoring tools like Prometheus or Grafana for real-time monitoring and alerting.
  • Knowledge of incident management processes and tools like PagerDuty for effective incident resolution.
  • Understanding of HA and reliability principles, including failover and disaster recovery strategies.
  • Familiarity with networking concepts such as TCP/IP, DNS, and VPNs.

Having one or more of these skills will help in succeeding in this role:

  • Experience with tools like Ansible or Terraform for managing infrastructure configuration.
  • Understanding of CI/CD pipelines and experience with Jenkins or GitLab CI/CD for automating software delivery.
  • Awareness of security best practices and experience implementing security controls like IAM and encryption.
  • Basic knowledge of DBMS and experience with MySQL, PostgreSQL, or MongoDB.
  • Familiarity with logging frameworks like ELK or Splunk for analyzing log data.
  • Experience in performance optimization techniques to improve system performance.
  • Understanding of Agile methodologies and experience with Scrum or Kanban for iterative development.

What’s in it for you?

At Nordhealth, we do things a little bit differently. We value continuous improvement, diverse teams and autonomy which drive our collaboration. Our global healthcare domain is rapidly developing and we are seeking colleagues who enjoy working in this type of environment. 🌎

In addition, we offer:
  • The chance to work in a meaningful industry and in a fast-growing, global company on a path to changing digital healthcare
  • Competitive compensation and benefits
  • Learning and professional growth opportunities
  • The tools you need, and enjoy using
  • Frequent company events and talented colleagues from around the world

If you enjoy working in a fast-growing and international environment with the possibility to make an impact, this might be the perfect job for you. Apply now! We'll fill the position as soon as we find the right person.