Staff Site Reliability Engineer, PaaS

Algolia • Paris, France • 1w ago

Algolia is set to enable every company to create world-class Search and Discovery experiences with an API-first approach. Performance and Scalability is at the heart of our mission: we power 1.5 trillion searches a year, for 10K+ customers all over the world.

If you're a problem solver, able to think outside the box and eager to nurture others and learn from them, then this is your challenge!

The Team

The Platform as a Service (PaaS) team is dedicated to empowering development teams by creating toolchains, guidelines, and standards. Our focus is on enabling seamless automation and CI/CD, comprehensive observability, and unwavering reliability in a secured cloud-native environment.

The Opportunity

The Staff Engineer position within the Platform As a Service team offers a compelling opportunity for an adept professional with a rich background in architecting, constructing, and managing scalable infrastructures. This role specifically concentrates on three key areas: CI/CD, Observability, and application hosting.

As a senior member of the Platform As a Service team, you will wield significant influence over Algolia’s Search Products. Your responsibilities will revolve around crafting and executing systems pivotal to ensuring reliability, scalability, and cost optimisation. You will be instrumental in architecting robust CI/CD pipelines, establishing comprehensive observability frameworks, and managing hosting solutions focused on API Management and micro-services management. Moreover, as a expert within the team, you will actively participate in mentoring and guiding fellow team members, fostering a culture of collaboration and excellence. In addition, this role entails actively engaging in cross-team collaboration, spearheading projects alongside SREs and SWEs.

Your role will consist of:

Design and deploy a cloud-native API Management to boost platform scalability, security, and reliability, while expediting new feature setup for swift and seamless onboarding of development teams.
Spearhead the design and implementation of a robust and scalable CI/CD toolchain, serving as a centralised build factory to streamline development processes and ensure consistent quality across all services hosted on the product platform
Lead the development and deployment of comprehensive observability standards and automation solutions, empowering teams with actionable insights and enabling proactive resolution of issues, enhancing overall system reliability and performance.
Drive the evolution and maintenance of a Kubernetes-based architecture, optimising resource utilisation, enhancing fault tolerance, and ensuring the platform's ability to meet evolving demands efficiently and effectively.
You provide guidance and mentorship to other SRE team members, helping them to develop their skills and knowledge of best practices in site reliability engineering
You establish and enforce engineering processes and best practices that ensure high-quality, reliable, and scalable systems, and you work with other teams to promote the adoption of these processes and practices across the organization
You collaborate with senior leadership to shape the vision and direction of the company (cloud) infrastructures, and you help drive the development of SRE-specific strategies and initiatives that align with business objectives
You build and maintain strong relationships with stakeholders across the organization, and you represent the SRE organization in cross-functional meetings and discussions

You might be a fit if you have:

Strong knowledge of programming language Golang and Python and familiar with software craftsmanship. Knowledge on Ruby is a plus
Experience designing and building API Management and Kubernetes based architecture
Experience building and operating distributed systems at scale.
Experience on CI/CD setup and architecture. Strong knowledge on Github Actions, Circle-CI or alternatives is expected
Experience designing new applications with reliability, operability, and availability in mind
Experience with Public Cloud Providers such as GCP, AWS or Microsoft Azure, and administration of Kubernetes
Excellent communication and organisation skills

We’re looking for someone who can live our values:

GRIT - Problem-solving and perseverance capability in an ever-changing and growing environment
TRUST - Willingness to trust our co-workers and to take ownership
CANDOR - Ability to receive and give constructive feedback.
CARE - Genuine care about other team members, our clients and the decisions we make in the company.
HUMILITY- Aptitude for learning from others, putting ego aside.

#LI-Remote