Youโll also play a key role in supporting our MLOps and LLMOps workflows, helping scale AI model deployment and experimentation across our platform.
Establish and maintain robust MLOps and LLMOps workflows to support the scalable development, reliable deployment, and continuous optimisation of LLMs at scale.
๐ Experience
7+ years in DevOps/Infrastructure Engineering, including AI/ML workloads in production.
โ๏ธ Cloud & Efficiency
Strong AWS and Cloudflare skills with hands-on experience in EB, ECS, RDS, MSK/Kinesis, CloudWatch, IAM, Lambda, S3, Route 53, etc., and a proven track record in infrastructure cost optimisation.
๐ Multi-region & Scaling
Experience designing highly available, scalable, multi-region systems with disaster recovery strategies and cost optimisation.
๐ฆ Containerisation & Orchestration
Hands-on experience with Docker and orchestration platforams such as ECS, EKS, or Kubernetes.
๐ Security & Reliability
Good understanding of cloud security best practices to ensure safe and resilient systems.
๐ CI/CD & Observability
Experience with CI/CD pipelines, such as Bitbucket Pipelines or GitHub Actions, and observability tools like OpenTelemetry and Datadog or similar.
Proficient with Terraform or Pulumi for managing infrastructure.
MLOps & LLMOps
Familiarity with machine learning operations is a plus. Experience supporting ML workflows and managing the model lifecycle using tools like MLflow or SageMaker is beneficial, but not required.
An understanding of concepts such as model versioning, experiment tracking, feature stores, scalable deployment, and the unique challenges of LLM (Large Language Model) inference, fine-tuning, and performance observability would be an advantage.
๐จ Incident Management
Experience setting up incident processes, participating in on-call rotations, and resolving production issues.
๐ค Collaboration & Enablement
Worked closely with engineering teams to build tailored infrastructure, provide reusable blueprints and self-service tooling, and promote DevOps best practices.
A fast-moving environment with minimal bureaucracy and quick decision-making
The opportunity to work on cutting-edge AI products and services
A strong focus on high-quality technical solutions
High autonomy and rapid feedback cycles
A great chance to learn how to play poker
Remote-friendly work culture
Unlimited vacation policy
Close collaboration with engineering teams and meaningful contributions to a shared product vision
๐ฐ This role is part of AceGuardian, a cutting-edge team within A5 Labs. AceGuardian is focused on building advanced AI agents through reinforcement learning, game-solving, fine-tuning, and planning. These AI agents tackle challenges such as anti-cheat detection (including collusion and bots) and optimising gameplay across various games. The team operates in stealth mode and is composed of experts in AI, machine learning, and game development, all working together to revolutionise both gaming and real-world problem-solving. By joining this team, youโll contribute to innovative projects that push the boundaries of AI in the gaming industry while working alongside some of the brightest minds in the field.