Medior Site Reliability Engineer
Would you like to go into the back backbone we rely on ?
We are looking for an experienced Site Reliability Engineer (SRE) to join the Engineering Chapter team and help ensure the reliability, scalability, and performance of critical on-premises services within the ERA product organization.
In this role, you'll focus on building and maintaining a modern observability platform, implementing monitoring best practices, and automating operational processes. Working closely with cross-functional engineering teams, you'll help improve system resilience, reduce incident response times, and ensure the availability of business-critical services.
If you're passionate about observability, automation, and operational excellence, this opportunity is for you.
Role
Observability & Monitoring
- Design, implement, and maintain enterprise monitoring solutions.
- Build intuitive Grafana dashboards and visualizations.
- Configure meaningful alerts to proactively detect issues.
- Implement distributed tracing and centralized log aggregation.
- Define and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
- Continuously improve monitoring coverage and platform visibility.
Infrastructure & Reliability
- Manage and optimize on-premises monitoring infrastructure.
- Ensure platform reliability, scalability, and high availability.
- Support Linux-based environments and troubleshoot infrastructure issues.
- Participate in 24/7 on-duty rotations for incident response.
- Contribute to reducing Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).
Automation & DevOps
- Automate deployment, configuration, and operational tasks.
- Develop automation scripts using Python, Bash, or Go.
- Improve infrastructure management through automation and standardization.
- Support Infrastructure as Code and operational best practices.
Collaboration
- Work closely with development teams to improve application instrumentation.
- Promote observability best practices across engineering teams.
- Balance technical improvements with business priorities.
- Contribute to continuous improvement initiatives within the Engineering Chapter.
Security & Compliance
- Ensure monitoring solutions comply with enterprise security standards.
- Maintain secure on-premises monitoring environments.
- Support compliance and governance requirements.
Profile
Core Technical Skills
- Advanced experience with Grafana
- Strong expertise in Prometheus and PromQL
- Hands-on experience with OpenTelemetry
- Experience with Elasticsearch
- Strong Linux system administration skills
- Good understanding of networking fundamentals
- Experience securing on-premises infrastructure
Programming & Automation
Experience with one or more of:
- Python
- Bash
- Go
Experience
- 3+ years of experience in monitoring, observability, or Site Reliability Engineering.
- At least 2 years of hands-on experience with Grafana and Prometheus in production environments.
- Strong experience supporting Linux-based production systems.
- Proven experience managing enterprise on-premises infrastructure.
- Experience participating in 24/7 operational support or on-call rotations.
Security
- Understanding of enterprise security practices.
- Experience working within compliance-driven environments.
Who You Are
- Passionate about reliability, automation, and operational excellence.
- Analytical with strong troubleshooting skills.
- Comfortable working in production-critical environments.
- Able to prioritize effectively and balance technical improvements with business needs.
- Collaborative and proactive in working with cross-functional teams.
- Committed to continuous improvement and knowledge sharing.
Offer
Freelance Long term Contract
What You'll Help Deliver
As a Site Reliability Engineer, you'll contribute directly to:
- Improved platform reliability and system availability.
- Reduced MTTD (Mean Time to Detect) and MTTR (Mean Time to Recover).
- Comprehensive observability across critical services.
- Automated deployment, monitoring, and operational processes.
- Secure and compliant monitoring infrastructure supporting business-critical applications.
3 days remote