
Senior Engineer- Alerting & Incident Management
- Abu Dhabi
- Permanent
- Full-time
- To establish and maintain an effective, intelligent, and timely alerting framework across infrastructure, application, and business services.
- To coordinate and continuously improve the incident management lifecycle with a focus on early detection, rapid response, and root cause accountability.
- To integrate observability data (logs, metrics, traces) into a unified alerting and incident response workflow.
- To reduce Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) through automation, clear escalation paths, and operational discipline.
- Manage and continuously improve the incident response process, including triage, escalation, status communications, and resolution tracking.
- Act as the incident commander during major outages or high-severity issues, coordinating technical teams toward resolution.
- Maintain and govern on-call schedules, escalation paths, and responder playbooks.
- Integrate observability tools with incident management platforms to enable real-time, contextual alerting.
- Lead and document root cause analysis (RCA) and ensure completion of follow-up actions and preventive measures.
- Report on incident metrics and trends, identifying areas for resilience and process improvement.
- Maintain detailed documentation on alert rules, incident workflows, contact rosters, and escalation trees.
- Ensure compliance with regulatory, audit, and risk management requirements related to incident response and system availability.
- Collaborate with monitoring, logging, and APM peers to align telemetry signals with operational response.
- Work with development, infrastructure, and support teams to embed alert and incident management best practices in SDLC and change management.
- Participate in regular incident simulations and on-call readiness drills.
- Drive continuous improvement through retrospective reviews, blameless post-mortems, and incident automation.
- Strong experience with alert management platforms such as Opsgenie, Splunk On-Call, ServiceNow Event Management, or VictorOps.
- Familiarity with routing rules, escalation policies, noise suppression, on-call schedules, and alert deduplication.
- Deep understanding of the end-to-end incident management process-detection, triage, escalation, communication, and closure.
- Proficient in running major incident bridges, documenting timelines, and leading post-incident reviews (PIRs/RCAs).
- Calm and assertive in high-pressure incident scenarios.
- Excellent communicator-able to coordinate with technical and business stakeholders during incidents..