Job description
Core Responsibilities Overview
1. Responsible for the deployment, optimization, and high availability assurance of the company's core business systems, cloud platforms (such as AWS / Alibaba Cloud / Tencent Cloud), and foundational services (such as Kubernetes, Docker, Nginx, MySQL, Redis, Kafka).
2. Plan and implement system capacity management, performance optimization, and disaster recovery solutions to ensure service stability and scalability.
3. Responsible for the construction and maintenance of CI/CD pipelines to achieve automated building, testing, releasing, and rollback.
4. Design and improve system monitoring, log collection, and alerting systems (Prometheus / Grafana / ELK / OpenSearch).
5. Participate in emergency response to production environment failures, issue identification and review, summarizing and promoting long-term optimization solutions.
6. Participate in the standardization and process construction of the operations and maintenance system, producing best practice documentation.
7. Participate in the statistical assessment of overall operations and maintenance costs, auditing various IT expenditures.
Job Requirements
1. Full-time undergraduate degree in computer-related fields (no consideration for associate degree holders) and above, with more than 5 years of experience in large-scale internet or cloud platform operations and maintenance.
2. Familiar with the use of various Linux system series, proficient in at least one language among Shell/Python/Go.
3. Proficient in Docker, Kubernetes, and CI/CD toolchains (Jenkins, GitLab CI, ArgoCD, etc.).
4. Familiar with monitoring and logging systems such as Prometheus, Grafana, ELK/OpenSearch.
5. Experience in public cloud architecture (AWS, Aliyun, GCP, Azure).
6. Good communication skills and team collaboration spirit, with the ability to quickly identify and solve complex problems.
7. Preference for candidates based in Singapore and Malaysia, with consideration for other non-mainland regions.
Bonus Points
1. Rich experience in IT operations cost optimization.
2. Experience in global network acceleration and security protection deployment on enterprise cloud.
