Responsibilities
1. Kubernetes and AWS Cloud Resource Operations
- Manage the full lifecycle of production and testing Kubernetes clusters (including EKS managed clusters), ensuring high availability (SLA ≥ 99.95%).
- Optimize scheduling of AWS resources (EC2, RDS, S3, ELB, VPC, etc.) with Kubernetes clusters to drive cost savings (e.g., Spot Instances, auto-scaling strategies).
- Lead version upgrades and stability tuning for EKS ecosystem components (CoreDNS, Ingress Controller, AWS Load Balancer Controller), troubleshooting network, storage, and compute anomalies in cloud-native environments.
2. Monitoring and Observability Platform
- Build an end-to-end monitoring and event management system covering containers, hosts, middleware, and AWS resources using Nightingale, Prometheus, Grafana, FlashDuty, and AWS CloudWatch.
- AWS Monitoring:
• Configure monitoring for AWS resources (EC2, RDS, S3, ELB, Lambda) via CloudWatch to collect metrics (CPU utilization, memory usage, disk I/O, request latency) and logs (CloudWatch Logs).
• Define custom AWS alert rules (e.g., RDS connection limits, S3 traffic spikes, Lambda error rates) and forward alerts to Nightingale and FlashDuty for cross-platform aggregation.
• Analyze CloudWatch metrics and Logs Insights to identify performance bottlenecks (EC2 CPU contention, RDS slow queries) and implement optimizations (instance right-sizing, SQL index tuning).
- Nightingale & FlashDuty:
• Maintain Nightingale’s time-series database and alerting engine, integrating Prometheus, CloudWatch, and other data sources for unified dashboards and alert consolidation.
• Operate the FlashDuty incident center, define alert severities (P0–P4) and dispatch policies (e.g., route EC2 failures to Infrastructure, RDS issues to Database), and drive incident resolution efficiency.
3. CI/CD Pipelines & AWS Cloud-Native Toolchain
- Integrate DevOps toolchains (Jenkins, GitLab CI, Argo CD, FluxCD) with AWS services (CodePipeline, CodeBuild, EKS) to design CI/CD workflows for hybrid cloud environments.
- Automate container image builds (ECR), artifact storage (S3), and environment canary deployments (EKS Blue/Green), reducing release cycles.
- Enhance pipeline performance and cost efficiency by leveraging AWS Spot Instances and caching strategies to improve developer experience.
4. Middleware & AWS Service Reliability
- Maintain core middleware clusters (MySQL, Redis, RabbitMQ, Kafka, Elasticsearch) including managed services (RDS for MySQL, ElastiCache for Redis) and design high-availability architectures.
- Monitor middleware-to-AWS interactions (RDS connection pool usage, Kafka-to-S3 replication latency), proactively alert on cross-service bottlenecks, and implement remediation.
- Standardize middleware and AWS configurations (RDS parameter tuning, ElastiCache node selection), and produce O&M SOPs and runbooks.
5. Cross-Team Collaboration & AWS Enablement
- Collaborate with development and QA teams to provide AWS best practices (IAM policies, security group design), containerization guidance, and CI/CD support.
- Document AWS monitoring and cloud-native operations best practices, and conduct internal trainings (CloudWatch metrics analysis, Nightingale-AWS integration troubleshooting).
Qualifications
• Education & Experience
– Bachelor’s degree or higher in Computer Science, Software Engineering, or related field.
– Minimum 3 years of experience in operations or DevOps, with at least 1 year of AWS cloud operations experience.
• Technical Expertise
– Deep understanding of Kubernetes core concepts (scheduling, networking, storage) and hands-on experience deploying and operating EKS clusters (eksctl, EKS kubeadm). Experience with large-scale EKS clusters (≥50 nodes) is a plus.
– Proficient with AWS CloudWatch for metrics collection, alert configuration, and log analysis (CloudWatch Logs Insights), and skilled in Metrics Explorer, Alarms, and Dashboards.
– Experienced in building monitoring and alerting systems with Nightingale, integrating multiple data sources (Prometheus, CloudWatch) for unified visualization.
– Familiar with FlashDuty incident management and AWS alert integration (e.g., Lambda-triggered FlashDuty events).
– Proficient in the Prometheus + Grafana + Alertmanager stack and knowledgeable in OpenTelemetry for metrics and tracing.
– Hands-on experience with at least one CI/CD toolchain (Argo CD, Jenkins) and integrating with AWS CodePipeline/CodeBuild.
– Skilled in operating MySQL, Redis, RabbitMQ, etc., with AWS managed services (RDS, ElastiCache) and designing high-availability architectures.
– Strong problem-solving skills with the ability to diagnose complex issues (EKS node network packet loss, RDS cross-AZ replication latency) using CloudWatch logs, Nightingale metrics, and tracing.
• Certifications
– AWS Certified SysOps Administrator – Associate or AWS Certified Cloud Practitioner preferred.
• Soft Skills
– Excellent communication and collaboration skills, with the ability to drive cross-team technical initiatives.
– Strong documentation habits and a passion for knowledge sharing.