An international organization is looking for a Site Reliability Engineer to strengthen its Engineering team by improving the reliability, scalability, monitoring, and performance of its on-premises infrastructure. The role focuses on building modern observability solutions, automating operations, and ensuring high system availability while maintaining security compliance.
Responsibilities
- Design and maintain monitoring infrastructure.
- Build custom dashboards, alerts, and visualization solutions.
- Implement distributed tracing using Opentelemetry.
- Implement centralized log aggregation with Elasticsearch.
- Establish monitoring best practices and Sli/SLO frameworks.
- Maintain security compliance for monitoring platforms.
- Automate deployment and configuration management.
- Collaborate with development teams on application instrumentation.
- Participate in 24/7 on-duty incident support rotations.
- Improve system reliability and operational efficiency.
Technical skills
Must have:
- 3+ years of experience in monitoring and observability.
- 2+ years of production experience with Grafana and Prometheus.
- Advanced knowledge of Grafana.
- Strong expertise in Prometheus (Promql).
- Experience with Opentelemetry.
- Experience with Elasticsearch.
- Strong Linux system administration skills.
- Networking knowledge.
- Experience with on-premises infrastructure.
- Experience with enterprise security and compliance.
- Programming experience with Python, Bash, or Go.
- Ability to balance technical and business priorities.
- English level C1.
Should have:
- Experience defining Sli/SLO frameworks.
- Experience with deployment and configuration automation.
- Experience collaborating closely with software development teams.
- Incident response experience in production environments.
- Ability to work effectively within cross-functional teams.
Nice to have:
- German language
- French language
- Dutch language