DevOps Support Engineer

NorthBay Solutions View all jobs

  • Abu Dhabi
  • Permanent
  • Full-time
  • 1 month ago
  • Apply easily
DevOps Support EngineerAbout AI FactoryThe AI Factory operates sovereign AI infrastructure including:GPU clustersCloud subscriptionsContainerized workloadsAPI gatewaysMulti-environment deployments (Sandbox → Staging → Production)The DevOps Support Engineer ensures:Infrastructure stabilityDeployment reliabilityOperational continuity for AI workloadsRole OverviewEmployment Type: Full-timeWork Arrangement: Onsite (Applicants based outside the UAE are required to relocate)The DevOps Support Engineer is responsible for supporting:Cloud infrastructureCI/CD pipelinesContainerized AI workloadsAPI gatewaysProduction environmentsThe role focuses on:Platform stabilityEnvironment healthDeployment reliabilityInfrastructure troubleshootingStructured incident managementEnvironment disciplineProduction governanceThis is an operational reliability role aligned with modern DevOps, SRE, and AIOps practices.The engineer acts as:L1 operational responder for infrastructure/platform incidentsEnsures issues are diagnosed, contained, escalated appropriatelyEnsures resolution within defined service levelsKey Responsibilities1. Infrastructure, Cloud & Environment SupportSupport Azure subscriptions, resource groups, networking, and access controlMonitor GPU environments, container clusters, and AI runtime environmentsTroubleshoot deployment failures across Sandbox, Staging, and Production2. DevOps & CI/CD SupportMonitor CI/CD pipelines and resolve build/deployment issuesSupport Git workflows, version control issues, and release rolloutsEnsure environment configuration consistencyValidate infrastructure changes post-deploymentPerform rollback support when required3. GPU & AI Runtime Operations SupportMonitor GPU utilization and allocationIdentify memory saturation and CUDA/container runtime errorsSupport AI model deployment on GPU nodesDetect performance bottlenecks affecting inference services4. API Gateway, WAF & IntegrationsTroubleshoot API gateway routing issues and throttling policiesMonitor rate limiting and traffic control mechanismsInvestigate WAF-related blocking incidentsSupport secure external integrationsSupport integrations with enterprise systems:Microsoft 365SharePointTeamsOracleJiraTroubleshoot authentication issues, webhook failures, and API timeouts5. Observability & Incident ResponseMonitor service availability, CPU/GPU utilization, memory, storage, and logsDetect infrastructure bottlenecks affecting AI workloadsAct as first-line responder for infrastructure and platform-related incidents (P0–P3)Perform triage using logs, metrics, system databases, and environment diagnosticsClassify incidents by severity and business impact in line with defined SLAsContain and mitigate production-impacting issuesCoordinate with L2/L3 teams and vendorsEscalate with full diagnostic context (logs, metrics snapshots, timestamps, components)Track incident lifecycle to closure and ensure no SLA breach6. Documentation & Knowledge ManagementMaintain and improve:Infrastructure runbooksDeployment troubleshooting guidesEnvironment configuration documentationFAQsDocument recurring failure patterns (deployment errors, GPU saturation, network misconfigurations)Handle ITSM/ticketing documentationCapture and publish Root Cause Analysis (RCA) summaries for major incidentsUpdate environment diagrams and operational checklists after changes7. Platform ReliabilitySupport Kubernetes clusters, Docker containers, and orchestration layersValidate scaling, failover, and resilience mechanismsEnsure uptime SLAs for AI products, platforms, and APIs8. Security & Compliance CoordinationSupport IAM, access control, WAF, and network configurationsCoordinate with security teams for incident remediationEnsure adherence to environment governance policiesRequired Technical SkillsStrong hands-on experience with Azure (AWS/GCP acceptable)Experience supporting Kubernetes and Docker environmentsFamiliarity with CI/CD tools (Azure DevOps, GitHub Actions, Jenkins)Experience with monitoring tools (Azure Monitor, Dynatrace, Grafana)Understanding of networking, IAM, API gateways, and WAFExperience supporting production cloud environments under SLA constraintsFamiliarity with Infrastructure-as-Code concepts (ARM/Terraform)Experience4–7 years in DevOps, Cloud Operations, Platform Support, or SRE-aligned rolesExperience supporting containerized or AI workloads preferredExposure to regulated or government environments advantageousArabic speaker is a plusPowered by JazzHR

NorthBay Solutions

Similar Jobs

  • DevOps Engineer (m/f/d)

    Halian

    • Abu Dhabi
    A national financial services initiative dedicated to advancing the financial ecosystem through digital innovation, automation, and secure technology infrastructure. The organizati…
    • 3 days ago
  • DevOps Engineer (UAEN)

    Black Pearl

    • Abu Dhabi
    Our client is seeking a highly motivated and skilled DevOps Engineer to join their growing technology team. This role will play a critical part in building, deploying, and maintain…
    • 20 days ago
  • DevOps Engineer (UAEN)

    Black Pearl

    • Abu Dhabi
    Job Description: Our client is seeking a highly motivated and skilled DevOps Engineer to join their growing technology team. This role will play a critical part in building, depl…
    • 21 days ago