We are looking for a Site Reliability Engineer in a start-up company in Antwerp. You will work closely with the SRE Lead and the different engineering teams to build out a service. Your passion and enthusiasm about building a reliable, performing and most importantly observable solution that will guide the company to become a team where we feel in control, have predictable releases and operational resilience, while keeping an eye on having minimal toil.
You would love to start from the ground up, being able to make your mark on what will be our observable culture.
Join the Site Reliability Engineering team which focusses on 4 aspects, Self Service (removal of toil), Incident Response, Production ready design and Observability, for this role your focus (not only) will be on Observability
- You will drive the observability momentum forward by:
- Defining best practices for engineering teams and guiding them to get deep insights into their applications in production
- Ensuring that dashboards and information radiators provide the right level of information to the right people in the organisation
- Making events traceable and introducing improvements to help the people that operate the services
- You will continuously refine monitoring processes, thresholds, and configuration for example like SLO/SLI
- You will build out the base of what we call "instrumenting our code" with libraries, examples and tooling.
- You will manage our processing environments with Infrastructure-as-Code and automate processes (Observability-as-Code)
- You will facilitate and improve the release management pipeline in regards to observability
- Overall, you will have an enormous influence on the way we approach reliability, which will be a crucial aspect of our service.
- You'll be part of an international team brought together by a culture of technical excellence, grit, integrity and open communication. You'll find our compensation and rewards highly competitive and better yet, expect an Agile and flat structure, dynamic growth opportunities, flexibility, and a lot of room for innovation and technologic advancements.
- You have a bachelor's or master's degree in computer science or related field
- You have a track record working as a Site Reliability Engineer, Operations Engineer, or a Software Engineer
- You have experience with scripting and automation and know Linux inside out
- You have experience in working with cloud environments, including hands-on experience with Amazon Web Services.
- You have experience with TerraForm and config management tools (Ansible, Chef, Puppet...)
- You have experience with MicroServices and Orchestrators (Kubernetes, Nomad...)
- You have worked with monitoring frameworks (e.g. DataDog, ELK...)
- Are up to date on the recent developments in the observability lanscape (OpenTelemetry is not new to you)
- Have experience with multiple different deployment methods (Blue/Green, Canary, ...)
- Nice to have and are considered a big plus:
- Experience rolling out SLO's and Error budgets
- Delivered workshops or training on topics like monitoring, logging standards, coding guidelines
- You are fluent in English and a great communicator
- You have an open and entrepreneurial mindset
- You are able to work in an environment with rapidly changing priorities
- You maintain a high-quality standard, but can strike a balance between quality, flexibility and timely delivery, without compromising on reliability.