Site Reliability Engineer
29647
Posted: 23/05/2025
- Negotiable
- Asia Pacific
About Our Client:
- A pioneering AI innovation leader
Key Responsibilities:
- Manage production-grade container ecosystems (Kubernetes/Docker) and open-source component clusters across multiple business units
- Develop infrastructure operation platforms encompassing CI/CD pipelines, monitoring/alerting systems, and centralized logging solutions
- Execute rapid incident response protocols to maintain service continuity and minimize downtime
- Optimize system architecture and deployment strategies to ensure 99.9%+ service availability
- Spearhead automation programs to streamline operations and eliminate manual processes
- Partner with engineering teams to implement infrastructure-as-code (IaC) principles and reliability patterns
- Maintain 24/7 operational readiness through rotational on-call support
Qualifications:
- 5+ years in SRE/DevOps roles managing large-scale distributed systems
- Expert-level proficiency with AWS/Azure/GCP cloud ecosystems
- Advanced Linux administration skills with hands-on maintenance experience
- Scripting mastery in Python/Shell for operational automation
- Deep technical expertise in optimizing Nginx, JVM, Redis, Kafka, and SQL/NoSQL datastores
- Production experience managing Kubernetes clusters and containerized workloads
- CI/CD implementation experience using GitLab CI/ArgoCD or comparable tools
- Proven ability to diagnose complex system failures under time constraints
- Effective remote collaboration skills across technical teams
- Self-driven work ethic with strong technical ownership mentality
- Full professional fluency in English and Chinese

Rachel Mou
Divisional Director | China
Recruitment