AI Support Engineer

Abu Dhabi
Permanent
Full-time

1 month ago
Apply easily

AI Support EngineerAbout AI FactoryThe AI Factory is the product and technology arm of GovAI, responsible for:Building and operating sovereign AI platforms, models, and services across Abu Dhabi Government entitiesScaling AI services from pilot to productionStrengthening AI Operations to ensure:ReliabilityGovernanceHigh-quality support for AI-powered workloadsRole OverviewEmployment Type: Full-timeWork Arrangement: Onsite (Applicants based outside the UAE are required to relocate)The AI Support Engineer is:The first line of operational support for AI platforms, AI models, APIs, and AI-enabled solutionsFocused on supporting:AI workloadsModel consumptionRAG pipelinesAI-driven applications running in productionAn AI-focused L1 operations role, aligned with AIOps practices (not a traditional IT helpdesk)Responsible for:Triaging AI-related incidentsMonitoring model behavior and API performanceSupporting AI service integrationsEscalating issues to engineering, vendors, or platform teamsKey ResponsibilitiesAI Model & API Operations SupportSupport production AI model consumption (LLMs, embeddings, OCR, STT/TTS, inference APIs)Troubleshoot inference failures, latency spikes, malformed payloads, and API errorsDiagnose authentication failures (OAuth, tokens, API keys, quota limits)Validate request structures and integration configurationsMonitor token consumption trends and detect abnormal usage spikesSupport quota management and controlled usage increasesRAG & AI Pipeline OperationsMonitor RAG pipelines and retrieval workflowsTroubleshoot embedding generation failures and indexing issuesIdentify ingestion failures affecting vector databasesValidate document connector and data pipeline integrityDiagnose relevance or response degradation caused by configuration issuesEscalate data-layer or infrastructure-level issues to DevOps supportAI Governance & Guardrail MonitoringEnsure AI service consumption complies with defined access controls and governance policiesValidate rate limiting, usage policies, and guardrail configurationsDetect abnormal usage patterns or policy violationsSupport enforcement of entity-level quotas and access restrictionsEscalate governance breaches to appropriate stakeholdersIncident Triage & SLA ManagementAct as first responder for AI-layer incidents (P0–P3)Perform structured triage using logs, API traces, and monitoring dashboardsClassify incidents based on severity and business impactContain and mitigate AI service disruptions and coordinate with vendors when neededEscalate complex issues to L2/L3 engineering with complete diagnostic contextTrack incidents through full lifecycle and ensure SLA adherenceParticipate in Root Cause Analysis (RCA) for major AI service failuresRelease Validation & Change SupportPerform smoke validation after AI model updates or API releasesMonitor regression risks following deploymentsIdentify post-release anomalies and escalate earlySupport controlled rollout monitoring for new AI capabilitiesEnterprise Integration & Connector SupportSupport integrations with enterprise systems (Microsoft 365, SharePoint, Teams, Oracle, Jira, etc.)Troubleshoot API integration failures, webhook errors, and data exchange issuesValidate secure connectivity and authentication configurationsCoordinate with DevOps support for infrastructure-related integration failuresObservability & Operational MonitoringMonitor AI API performance metrics (latency, error rates, throughput)Track token usage, consumption trends, and service availabilityIdentify recurring failure patterns and propose preventive actionsMaintain visibility dashboards for AI service healthDocumentation & Knowledge ManagementMaintain AI troubleshooting runbooks and support playbooksUpdate known-issue repositories and FAQsDocument recurring AI API and RAG-related issuesCapture structured RCA documentation for major incidentsContribute to operational documentation for new AI servicesHandle ITSM/ticketingRequired Technical SkillsExperience supporting REST APIs and API-based platformsUnderstanding of LLM consumption patterns (RAG, embeddings, inference APIs)Familiarity with authentication mechanisms (OAuth2, API keys, token-based access)Ability to troubleshoot using logs, traces, and monitoring dashboardsExperience with monitoring tools (Azure Monitor, Dynatrace, Grafana)Basic understanding of cloud environments (Azure preferred)Familiarity with enterprise system integrationsUnderstanding of rate limiting, quotas, and API governanceExperience3–8 years in:AI platform supportAPI supportSaaS supportApplication operationsExperience supporting AI/ML services or developer platforms preferredExposure to regulated or government environments advantageousExperience working with external vendors and enterprise stakeholdersArabic speaker is a plusPowered by JazzHR

NorthBay Solutions