Observability SME (Datadog) (m/f/d)

Abu Dhabi
Contract
Full-time

3 days ago

Observability is a first-class product on our platform. We run a large Datadog estate fully managed as code (Terraform/Terragrunt) covering APM, RUM, Logs, Infrastructure, Synthetics, SLOs, and IRM/On-Call for 100+ services across multiple regions.You will own this estate end-to-end: dashboards, monitors, SLOs, on-call routing, and the automated release gates that block bad deploys from reaching production. You will partner with service teams to raise the bar on telemetry quality, instrument new services, and make Datadog genuinely actionable.What you'll work on

Own our Datadog-as-code monorepo - dashboards (global, app, database, pipeline, performance, GraphQL, RUM, release-validation), monitors (APM, K8s, CNPG Postgres, Ingress, Logs, Composite, FinOps, TLS, Tenant drift, Deployment health), SLOs, Log Indexes & Pipelines, and RUM metrics.
Drive the Observability Compliance release gate - enforce that every service ships with SLOs, monitors, dashboards, and log pipelines before it can go to prod.
Design and run Datadog IRM / On-Call - escalation policies, routing rules, schedules, and JSM integration driven by client SLAs.
Lead standardisation initiatives: health, log formats, trace tagging, RUM instrumentation, APM service naming.
Build SRE dashboards and evidence reports tied to release gates and quarterly reviews.
Close observability gaps.
Partner with product/engineering to turn raw telemetry into SLOs that match client contracts.
Mentor service teams on instrumentation - you are the internal Datadog expert.

Our observability stack

Datadog: APM, RUM, Logs, Infrastructure, Network, Synthetics, SLOs, IRM/On-Call, Notebooks, CI Visibility
IaC: Terraform + Terragrunt (Datadog provider), GitHub Actions delivery
Signals: OpenTelemetry (Go, Node/TS, Python), Datadog Agent, CNPG exporter, pgwatch, Kyverno policy metrics
Adjacent: Elasticsearch, Prometheus (limited), Slack (flux-events), PagerDuty-style routing via Datadog IRM + JSM
Languages: Terraform/HCL daily; Go and Python for tooling

What we're looking for

6+ years in DevOps / SRE / Observability with deep, hands-on Datadog expertise (not just "used Datadog" - designed and scaled an estate).
Strong Terraform skills - you are comfortable authoring Datadog provider resources at scale (hundreds of monitors/dashboards as code).
Demonstrated ability to define and drive SLOs from business/contract requirements to implemented monitors and error budgets.
Real experience with APM tracing, log pipelines, log-based metrics, composite monitors, anomaly detection, forecast alerts.
Hands-on Kubernetes observability - Datadog Agent, DaemonSets, Admission Controller, cluster checks, autodiscovery.
Experience building or operating an on-call / incident response program (Datadog IRM, PagerDuty, Opsgenie, or similar).
Scripting in Python or Go - you can automate Datadog API workflows, backfill tags, migrate resources.
You care about signal quality over noise - you have killed more monitors than you have created.

Nice to have

OpenTelemetry contributions or deep tuning of the OTel Collector.
Regulated industry experience (FinServ, HealthTech) with audit-ready observability.
FinOps / cost-observability experience (Kubecost, Datadog Cloud Cost Management).
Experience migrating from another APM (New Relic, Dynatrace, AppDynamics) to Datadog.
Jira Service Management integration for incident → ticket workflows.

Why join us

Scale & impact: Our platform powers digital wealth for top-tier banks - real AUM, real regulatory stakes.
Modern stack: Flux, Terragrunt, Datadog-as-code, Envoy Gateway, streaming SQL, Temporal - running real production workloads.
Autonomy: Senior engineers own platforms end-to-end. No ticket-pushing, no gatekeepers.
Strategic initiatives: Autonomous agent platform, automated release gates, SSL for SaaS, multi-region DR - lots to build.
Team: Small, senior, opinionated DevOps/SRE group. You'll ship on day one.

Halian

Apply Now