Job description
Responsible for the stability, availability, and performance of the company's core business global infrastructure on the AWS cloud platform, accountable for the SLA of the production environment.
Familiar with the architectural design of cloud-native components such as Kubernetes, Envoy, Service Mesh, and Ingress gateways, capable of targeted troubleshooting.
Enhance operational efficiency through automation tools and platform construction (IaC, CI/CD), achieving system observability, self-healing, and rapid fault recovery.
Participate in the formulation and promotion of the operational security system, including but not limited to access control (AWS IAM/K8s RBAC), network security policies, vulnerability management, and emergency response.
Build and improve a global operational system, including but not limited to: capacity planning, monitoring and alerting (Prometheus/ELK), CI/CD (GitLab/Jenkins), disaster recovery, and automated fault recovery.
Deeply understand the business architecture, participate in designing high availability and disaster recovery solutions, and continuously optimize costs.
*Job Requirements / Qualifications*
*Essential Requirements:*
Over 5 years of experience in Linux operations/SRE/DevOps, with the ability to operate large distributed systems.
Proficient in the use, architecture, and operation of core AWS services (such as EC2, S3, VPC, IAM, ELB, RDS, etc.), with extensive cost optimization experience.
Deep understanding of Kubernetes architecture, with experience in managing, troubleshooting, and performance tuning large-scale K8s clusters in production environments.
Familiar with layer 7 traffic governance technologies such as Envoy, Istio/Linkerd service mesh, or Nginx/Istio Ingress controllers.
Possess good operational security awareness and practices, familiar with common security vulnerabilities and preventive measures at the operating system, network, and application layers.
Proficient in at least one programming language (Go/Python/Shell), with the engineering capability to solve operational issues through automation tools and platforms.
Expert in Prometheus monitoring systems and ELK logging stacks, capable of building efficient observability platforms based on these.
Possess excellent capacity planning and performance testing experience, able to conduct quantitative analysis and planning for system bottlenecks.
