
Principal Lead- Observability and Incident Management
- Abu Dhabi
- Permanent
- Full-time
- Ensure proactive detection, diagnosis, and resolution of service health issues across all IT environments
- Establish a modern observability function that delivers full visibility into the critical services, applications and infra layers
- Own and lead the major incident management process, ensuring rapid containment, clear communication and structured resolution
- Drive actionable insights through metrics and logs (MTL) and ensure system health telemetry is used to improve availability, performance and user experience
- Support operational risk reduction and continuous improvement through RCA, trend reporting and resilience engineering
- Monitoring and observability engineering
- Alerting, noise reduction and event correlation
- Incident management
- Poset incident review and RCA
- Dashboarding and health visibility
- Service reliability metrics
- Define the observability architecture strategy ensuring scalability, data security and cost optimisation
- Collaborate with app, infra and security teams to ensure instrumentation coverage and logging compliance
- Maintain operational documentation, runbooks, escalation matrices and incident playbooks
- Drive blameless culture of improvement and incident learning
- Align monitoring practices with regulatory and compliance obligations
- Represent the observability and incident management function at governance forums
- Engage with vendors, SaaS providers, and cloud platforms to ensure integration with internal monitoring and incident workflows
- Coach and mentor monitoring and incident managers to raise maturity across people, processes and tooling
- Deep expertise in monitoring platforms e.g., ELK, AppDynamics, Grafana, Elastic, Datadog, APM, synthetic monitoring and log aggregation
- Solid understanding of distributed systems, microservices and hybrid cloud environments
- Strong command of SRE, telemetry pipelines, SLI/SLO and alerting strategies
- Experience running 24/7 incident command processes, leading war rooms, managing comms to executives and driving post-mortems
- Ability to align observability practices to business-critical services and customer impact, not just infra health
- Mastery of ITIL event management and incitement management with ITSM platforms like ServiceNow
- Calm decisive leadership in high pressure scenarios, excellent cross functional coordination and communication skills
- Overall 15+ years of technology experience is desirable