资深运维工程师

Senior Operations and Maintenance Engineer

Best Web3

15.8-21Kر.س[Monthly]

On-site - China3-5 Yrs ExpBachelorFull-time

Job Description

Show original text

Responsibilities 1. Kubernetes and AWS Cloud Resource Operations - Manage the full lifecycle of production and testing Kubernetes clusters (including EKS managed clusters), ensuring high availability (SLA ≥ 99.95%). - Optimize scheduling of AWS resources (EC2, RDS, S3, ELB, VPC, etc.) with Kubernetes clusters to drive cost savings (e.g., Spot Instances, auto-scaling strategies). - Lead version upgrades and stability tuning for EKS ecosystem components (CoreDNS, Ingress Controller, AWS Load Balancer Controller), troubleshooting network, storage, and compute anomalies in cloud-native environments. 2. Monitoring and Observability Platform - Build an end-to-end monitoring and event management system covering containers, hosts, middleware, and AWS resources using Nightingale, Prometheus, Grafana, FlashDuty, and AWS CloudWatch. - AWS Monitoring: • Configure monitoring for AWS resources (EC2, RDS, S3, ELB, Lambda) via CloudWatch to collect metrics (CPU utilization, memory usage, disk I/O, request latency) and logs (CloudWatch Logs). • Define custom AWS alert rules (e.g., RDS connection limits, S3 traffic spikes, Lambda error rates) and forward alerts to Nightingale and FlashDuty for cross-platform aggregation. • Analyze CloudWatch metrics and Logs Insights to identify performance bottlenecks (EC2 CPU contention, RDS slow queries) and implement optimizations (instance right-sizing, SQL index tuning). - Nightingale & FlashDuty: • Maintain Nightingale’s time-series database and alerting engine, integrating Prometheus, CloudWatch, and other data sources for unified dashboards and alert consolidation. • Operate the FlashDuty incident center, define alert severities (P0–P4) and dispatch policies (e.g., route EC2 failures to Infrastructure, RDS issues to Database), and drive incident resolution efficiency. 3. CI/CD Pipelines & AWS Cloud-Native Toolchain - Integrate DevOps toolchains (Jenkins, GitLab CI, Argo CD, FluxCD) with AWS services (CodePipeline, CodeBuild, EKS) to design CI/CD workflows for hybrid cloud environments. - Automate container image builds (ECR), artifact storage (S3), and environment canary deployments (EKS Blue/Green), reducing release cycles. - Enhance pipeline performance and cost efficiency by leveraging AWS Spot Instances and caching strategies to improve developer experience. 4. Middleware & AWS Service Reliability - Maintain core middleware clusters (MySQL, Redis, RabbitMQ, Kafka, Elasticsearch) including managed services (RDS for MySQL, ElastiCache for Redis) and design high-availability architectures. - Monitor middleware-to-AWS interactions (RDS connection pool usage, Kafka-to-S3 replication latency), proactively alert on cross-service bottlenecks, and implement remediation. - Standardize middleware and AWS configurations (RDS parameter tuning, ElastiCache node selection), and produce O&M SOPs and runbooks. 5. Cross-Team Collaboration & AWS Enablement - Collaborate with development and QA teams to provide AWS best practices (IAM policies, security group design), containerization guidance, and CI/CD support. - Document AWS monitoring and cloud-native operations best practices, and conduct internal trainings (CloudWatch metrics analysis, Nightingale-AWS integration troubleshooting). Qualifications • Education & Experience – Bachelor’s degree or higher in Computer Science, Software Engineering, or related field. – Minimum 3 years of experience in operations or DevOps, with at least 1 year of AWS cloud operations experience. • Technical Expertise – Deep understanding of Kubernetes core concepts (scheduling, networking, storage) and hands-on experience deploying and operating EKS clusters (eksctl, EKS kubeadm). Experience with large-scale EKS clusters (≥50 nodes) is a plus. – Proficient with AWS CloudWatch for metrics collection, alert configuration, and log analysis (CloudWatch Logs Insights), and skilled in Metrics Explorer, Alarms, and Dashboards. – Experienced in building monitoring and alerting systems with Nightingale, integrating multiple data sources (Prometheus, CloudWatch) for unified visualization. – Familiar with FlashDuty incident management and AWS alert integration (e.g., Lambda-triggered FlashDuty events). – Proficient in the Prometheus + Grafana + Alertmanager stack and knowledgeable in OpenTelemetry for metrics and tracing. – Hands-on experience with at least one CI/CD toolchain (Argo CD, Jenkins) and integrating with AWS CodePipeline/CodeBuild. – Skilled in operating MySQL, Redis, RabbitMQ, etc., with AWS managed services (RDS, ElastiCache) and designing high-availability architectures. – Strong problem-solving skills with the ability to diagnose complex issues (EKS node network packet loss, RDS cross-AZ replication latency) using CloudWatch logs, Nightingale metrics, and tracing. • Certifications – AWS Certified SysOps Administrator – Associate or AWS Certified Cloud Practitioner preferred. • Soft Skills – Excellent communication and collaboration skills, with the ability to drive cross-team technical initiatives. – Strong documentation habits and a passion for knowledge sharing.