Observability SME (Datadog) (m/f/d)
Halian View all jobs
- Abu Dhabi
- Contract
- Full-time
- Own our Datadog-as-code monorepo - dashboards (global, app, database, pipeline, performance, GraphQL, RUM, release-validation), monitors (APM, K8s, CNPG Postgres, Ingress, Logs, Composite, FinOps, TLS, Tenant drift, Deployment health), SLOs, Log Indexes & Pipelines, and RUM metrics.
- Drive the Observability Compliance release gate - enforce that every service ships with SLOs, monitors, dashboards, and log pipelines before it can go to prod.
- Design and run Datadog IRM / On-Call - escalation policies, routing rules, schedules, and JSM integration driven by client SLAs.
- Lead standardisation initiatives: health, log formats, trace tagging, RUM instrumentation, APM service naming.
- Build SRE dashboards and evidence reports tied to release gates and quarterly reviews.
- Close observability gaps.
- Partner with product/engineering to turn raw telemetry into SLOs that match client contracts.
- Mentor service teams on instrumentation - you are the internal Datadog expert.
- Datadog: APM, RUM, Logs, Infrastructure, Network, Synthetics, SLOs, IRM/On-Call, Notebooks, CI Visibility
- IaC: Terraform + Terragrunt (Datadog provider), GitHub Actions delivery
- Signals: OpenTelemetry (Go, Node/TS, Python), Datadog Agent, CNPG exporter, pgwatch, Kyverno policy metrics
- Adjacent: Elasticsearch, Prometheus (limited), Slack (flux-events), PagerDuty-style routing via Datadog IRM + JSM
- Languages: Terraform/HCL daily; Go and Python for tooling
- 6+ years in DevOps / SRE / Observability with deep, hands-on Datadog expertise (not just "used Datadog" - designed and scaled an estate).
- Strong Terraform skills - you are comfortable authoring Datadog provider resources at scale (hundreds of monitors/dashboards as code).
- Demonstrated ability to define and drive SLOs from business/contract requirements to implemented monitors and error budgets.
- Real experience with APM tracing, log pipelines, log-based metrics, composite monitors, anomaly detection, forecast alerts.
- Hands-on Kubernetes observability - Datadog Agent, DaemonSets, Admission Controller, cluster checks, autodiscovery.
- Experience building or operating an on-call / incident response program (Datadog IRM, PagerDuty, Opsgenie, or similar).
- Scripting in Python or Go - you can automate Datadog API workflows, backfill tags, migrate resources.
- You care about signal quality over noise - you have killed more monitors than you have created.
- OpenTelemetry contributions or deep tuning of the OTel Collector.
- Regulated industry experience (FinServ, HealthTech) with audit-ready observability.
- FinOps / cost-observability experience (Kubecost, Datadog Cloud Cost Management).
- Experience migrating from another APM (New Relic, Dynatrace, AppDynamics) to Datadog.
- Jira Service Management integration for incident → ticket workflows.
- Scale & impact: Our platform powers digital wealth for top-tier banks - real AUM, real regulatory stakes.
- Modern stack: Flux, Terragrunt, Datadog-as-code, Envoy Gateway, streaming SQL, Temporal - running real production workloads.
- Autonomy: Senior engineers own platforms end-to-end. No ticket-pushing, no gatekeepers.
- Strategic initiatives: Autonomous agent platform, automated release gates, SSL for SaaS, multi-region DR - lots to build.
- Team: Small, senior, opinionated DevOps/SRE group. You'll ship on day one.