Job description
Cloud Platform (Required)
Familiar with core AWS products, with practical experience in the following modules:
EKS: Pod status troubleshooting, node eviction, log viewing (basic kubectl operations)
ALB: Target group health checks, 5xx error troubleshooting, understanding of Listener rules
CloudFront / CDN: Origin anomaly judgment, understanding of caching strategies, edge node status identification
RDS / ElastiCache: Connection count monitoring, understanding of Failover
CloudWatch: Understanding of alarm configuration, basic log retrieval operations
Monitoring and Observability (Required)
Familiar with at least one monitoring system: Grafana, Prometheus, CloudWatch, ELK / Kibana
Able to quickly locate abnormal services, abnormal time periods, and impact scope through Dashboards
Understand the meanings of common alert metrics: CPU/Memory usage, P99 response time, error rate, QPS fluctuations
Troubleshooting Ability (Required)
Possess a systematic troubleshooting approach, able to identify root causes layer by layer from alerts → logs → links
Familiar with basic Linux commands, able to perform basic diagnostics on nodes or containers
Able to quickly read and understand system architecture documents and Runbooks, independently conduct troubleshooting in unfamiliar systems
Experience with on-call duty is preferred
Soft Skills Requirements
Strong sense of responsibility, able to maintain focus and respond promptly during night shifts
Clear communication, accurate fault description, no omission of key information
Strong ability to work under pressure, able to remain calm and execute orderly in high-pressure incident scenarios
Good documentation habits, standardized incident records, complete handover information