Site Reliability Devops Engineer
InnovaziT View all jobs
- Abu Dhabi
- Permanent
- Full-time
- Define and implement SLIs / SLOs and error budgets for business-critical digital banking services.
- Build actionable observability (metrics, logs, traces, dashboards, and alerts) using Dynatrace, Prometheus, Grafana, and ELK, while reducing alert fatigue.
- Leverage AI-driven insights and anomaly detection (Dynatrace Davis AI or equivalent AIOps platform) to proactively predict and resolve reliability issues before impact.
- Lead incident management — from on-call triage and root-cause analysis to blameless postmortems with actionable follow-ups.
- Improve deployment safety with robust rollout / rollback strategies, canary and blue-green deployments, and production readiness reviews.
- Support and optimize microservices-based architectures, ensuring service reliability, scalability, and inter-service resilience.
- Conduct capacity planning, performance tuning, and resilience testing, optimizing for both reliability and cost efficiency.
- Automate operational toil — from runbooks and remediation scripts to proactive health checks and self-healing workflows.
- Collaborate with DevOps to embed reliability gates and validations into CI / CD pipelines (GitHub Actions, Jenkins, GitLab CI / CD or Azure DevOps). • Own and evolve the observability and AIOps stack, driving intelligent automation and predictive alerting capabilities.
- Maintain high-quality documentation, playbooks, and operational standards across environments.
- Ensure operational compliance and security alignment with internal controls and regulatory standards.
- Analyze system performance, availability, and cost data to continually optimize operations.
- Provide reliability support and escalation guidance for critical production systems during major incidents
Bayt