Principal Lead- Observability and Incident Management

First Abu Dhabi Bank

  • Abu Dhabi
  • Permanent
  • Full-time
  • 10 days ago
Company DescriptionAre you ready to join us on our exciting transformation journey at the largest bank in the UAE? This is an opportunity to make a real impact on our customers, employees, shareholders, and communities, as part of the FAB team. We're committed to our grow stronger movement, and as a member of our team, you'll have access to everything you need to advance your career and make a meaningful contribution to our shared success. If you're looking for a career that will help you stand out and make a difference, now is the time to join us. Let's work together to achieve great things.Job DescriptionOverall objectives
  • Ensure proactive detection, diagnosis, and resolution of service health issues across all IT environments
  • Establish a modern observability function that delivers full visibility into the critical services, applications and infra layers
  • Own and lead the major incident management process, ensuring rapid containment, clear communication and structured resolution
  • Drive actionable insights through metrics and logs (MTL) and ensure system health telemetry is used to improve availability, performance and user experience
  • Support operational risk reduction and continuous improvement through RCA, trend reporting and resilience engineering
Job scopeRole specific responsibilities
  • Monitoring and observability engineering
  • Alerting, noise reduction and event correlation
  • Incident management
  • Poset incident review and RCA
  • Dashboarding and health visibility
  • Service reliability metrics
General functional responsibilities
  • Define the observability architecture strategy ensuring scalability, data security and cost optimisation
  • Collaborate with app, infra and security teams to ensure instrumentation coverage and logging compliance
  • Maintain operational documentation, runbooks, escalation matrices and incident playbooks
  • Drive blameless culture of improvement and incident learning
  • Align monitoring practices with regulatory and compliance obligations
  • Represent the observability and incident management function at governance forums
  • Engage with vendors, SaaS providers, and cloud platforms to ensure integration with internal monitoring and incident workflows
  • Coach and mentor monitoring and incident managers to raise maturity across people, processes and tooling
QualificationsCore competencies required
  • Deep expertise in monitoring platforms e.g., ELK, AppDynamics, Grafana, Elastic, Datadog, APM, synthetic monitoring and log aggregation
  • Solid understanding of distributed systems, microservices and hybrid cloud environments
  • Strong command of SRE, telemetry pipelines, SLI/SLO and alerting strategies
  • Experience running 24/7 incident command processes, leading war rooms, managing comms to executives and driving post-mortems
  • Ability to align observability practices to business-critical services and customer impact, not just infra health
  • Mastery of ITIL event management and incitement management with ITSM platforms like ServiceNow
  • Calm decisive leadership in high pressure scenarios, excellent cross functional coordination and communication skills
  • Overall 15+ years of technology experience is desirable

First Abu Dhabi Bank