资深运维工程师

Senior IT Operations Engineer

Best Web3

15.8-21Kر.س[Monthly]

On-site - China3-5 Yrs ExpBachelorFull-time

Job Description

Show original text

[Job Responsibilities]

Kubernetes Ecosystem and AWS Cloud Resource Collaborative Operations and Maintenance:

Responsible for the full lifecycle management of production and test environment Kubernetes clusters (including EKS managed clusters), ensuring cluster high availability (SLA ≥ 99.95%).

Optimize AWS cloud resources (EC2, RDS, S3, ELB, VPC, etc.) and Kubernetes cluster resource scheduling to drive cost optimization (e.g., Spot Instances, auto scaling strategies).

Lead version upgrades and stability tuning of EKS ecosystem components (e.g., CoreDNS, Ingress Controller, AWS Load Balancer Controller), addressing network, storage, and compute resource anomalies in cloud-native scenarios.

Monitoring and Observability System Construction:

Build a full-stack monitoring and incident management system covering containers, hosts, middleware, and AWS cloud resources using a tech stack based on Nightingale + Prometheus + Grafana + FlashDuty + AWS CloudWatch.

AWS Monitoring:

Configure monitoring for AWS resources (EC2, RDS, S3, ELB, Lambda, etc.), collecting metrics (CPU utilization, memory usage, disk I/O, request latency) and logs via CloudWatch.

Create AWS-specific alert rules (e.g., RDS connection limits, S3 bucket traffic spikes, Lambda function error rate increases) and synchronize alerts to Nightingale and FlashDuty for cross-platform aggregation.

Analyze AWS resource monitoring data (CloudWatch Metrics, Logs Insights) to identify performance bottlenecks (e.g., EC2 CPU contention, RDS slow queries) and drive optimization measures (e.g., adjust instance types, optimize SQL indexes).

Nightingale & FlashDuty:

Maintain Nightingale’s time-series database and alert rule engine, integrate multiple data sources (Prometheus, CloudWatch), and provide unified metric visualization and alert consolidation.

Operate the FlashDuty incident center by defining alert severity levels (P0–P4) and dispatch strategies (e.g., EC2 incidents to Infrastructure team, RDS incidents to Database team), track incident resolution, and optimize response efficiency.

CI/CD Pipeline and AWS Cloud-Native Toolchain Integration:

Integrate DevOps toolchains (Jenkins, GitLab CI, Argo CD, FluxCD) with AWS services (CodePipeline, CodeBuild, EKS) and design CI/CD workflows for hybrid cloud environments.

Automate container image build (ECR), artifact storage (S3), and environment deployments (EKS Blue/Green Deployment) to shorten release cycles.

Optimize pipeline performance by leveraging AWS Spot Instances and caching strategies to reduce build costs and enhance developer experience.

Middleware and AWS Service Stability Assurance:

Maintain core middleware clusters (MySQL, Redis, RabbitMQ, Kafka, Elasticsearch), including AWS managed services (RDS for MySQL, ElastiCache for Redis), and design high-availability architectures.

Monitor interactions between middleware and AWS services (e.g., RDS connection pool utilization, Kafka-to-S3 message sync latency), provide early warnings, and resolve cross-service performance bottlenecks.

Standardize configurations for middleware and AWS services (e.g., RDS parameter group tuning, ElastiCache node type selection) and produce operations SOPs and incident response guides.

Cross-Team Collaboration and AWS Technology Enablement:

Work with development and QA teams to provide technical guidance on AWS resource usage best practices (IAM permissions, security group configurations), containerization, and CI/CD processes.

Document AWS monitoring and cloud-native operations best practices, create technical documentation, and conduct internal training sessions (e.g., interpreting CloudWatch metrics, troubleshooting Nightingale-AWS integrations).

[Qualifications]

Basic Requirements: Bachelor’s degree or above in Computer Science, Software Engineering, or related field; 3+ years of operations/DevOps experience; 1+ year of AWS cloud operations experience.

Technical Expertise:

Deep understanding of Kubernetes core principles (scheduling, networking, storage) and experience deploying and managing EKS clusters (e.g., eksctl, kubeadm for EKS); experience with large-scale EKS clusters (≥ 50 nodes) is a plus.

Proficient in using AWS CloudWatch for metric collection, alert configuration, and log analysis (CloudWatch Logs Insights); skilled with CloudWatch Metrics Explorer, Alarms, and Dashboards.

Experienced in building monitoring and alerting systems with Nightingale, integrating multiple data sources (Prometheus, CloudWatch), and providing unified visualizations.

Familiar with FlashDuty incident management workflows and configuring integrations with AWS alerts (e.g., triggering FlashDuty events via Lambda).

Hands-on experience with the Prometheus + Grafana + Alertmanager stack and OpenTelemetry for data collection and distributed tracing.

Proficient with at least one CI/CD toolchain (Argo CD, Jenkins) and practical experience integrating with AWS CodePipeline and CodeBuild.

Skilled in collaborative operations of middleware (MySQL, Redis, RabbitMQ) and AWS services (RDS, ElastiCache) with high-availability architecture design.

Strong problem-solving skills with the ability to quickly troubleshoot complex issues (e.g., EKS node network packet loss, RDS cross-AZ sync latency) using CloudWatch logs, Nightingale metrics, and tracing data to pinpoint root causes.

AWS Certification: AWS Certified SysOps Administrator – Associate or AWS Certified Cloud Practitioner is preferred.

Soft Skills: Excellent communication and collaboration abilities to drive cross-team technical initiatives; strong technical documentation skills and a willingness to share knowledge.