Senior Operations and Maintenance Engineer

Best Web3

15.8-21Kر.س[Monthly]
On-site - China3-5 Yrs ExpBachelorFull-time
Share

Job Description

Show original text
Responsibilities 1. Kubernetes and AWS Cloud Resource Operations - Manage the full lifecycle of production and testing Kubernetes clusters (including EKS managed clusters), ensuring high availability (SLA ≥ 99.95%). - Optimize scheduling of AWS resources (EC2, RDS, S3, ELB, VPC, etc.) with Kubernetes clusters to drive cost savings (e.g., Spot Instances, auto-scaling strategies). - Lead version upgrades and stability tuning for EKS ecosystem components (CoreDNS, Ingress Controller, AWS Load Balancer Controller), troubleshooting network, storage, and compute anomalies in cloud-native environments. 2. Monitoring and Observability Platform - Build an end-to-end monitoring and event management system covering containers, hosts, middleware, and AWS resources using Nightingale, Prometheus, Grafana, FlashDuty, and AWS CloudWatch. - AWS Monitoring: • Configure monitoring for AWS resources (EC2, RDS, S3, ELB, Lambda) via CloudWatch to collect metrics (CPU utilization, memory usage, disk I/O, request latency) and logs (CloudWatch Logs). • Define custom AWS alert rules (e.g., RDS connection limits, S3 traffic spikes, Lambda error rates) and forward alerts to Nightingale and FlashDuty for cross-platform aggregation. • Analyze CloudWatch metrics and Logs Insights to identify performance bottlenecks (EC2 CPU contention, RDS slow queries) and implement optimizations (instance right-sizing, SQL index tuning). - Nightingale & FlashDuty: • Maintain Nightingale’s time-series database and alerting engine, integrating Prometheus, CloudWatch, and other data sources for unified dashboards and alert consolidation. • Operate the FlashDuty incident center, define alert severities (P0–P4) and dispatch policies (e.g., route EC2 failures to Infrastructure, RDS issues to Database), and drive incident resolution efficiency. 3. CI/CD Pipelines & AWS Cloud-Native Toolchain - Integrate DevOps toolchains (Jenkins, GitLab CI, Argo CD, FluxCD) with AWS services (CodePipeline, CodeBuild, EKS) to design CI/CD workflows for hybrid cloud environments. - Automate container image builds (ECR), artifact storage (S3), and environment canary deployments (EKS Blue/Green), reducing release cycles. - Enhance pipeline performance and cost efficiency by leveraging AWS Spot Instances and caching strategies to improve developer experience. 4. Middleware & AWS Service Reliability - Maintain core middleware clusters (MySQL, Redis, RabbitMQ, Kafka, Elasticsearch) including managed services (RDS for MySQL, ElastiCache for Redis) and design high-availability architectures. - Monitor middleware-to-AWS interactions (RDS connection pool usage, Kafka-to-S3 replication latency), proactively alert on cross-service bottlenecks, and implement remediation. - Standardize middleware and AWS configurations (RDS parameter tuning, ElastiCache node selection), and produce O&M SOPs and runbooks. 5. Cross-Team Collaboration & AWS Enablement - Collaborate with development and QA teams to provide AWS best practices (IAM policies, security group design), containerization guidance, and CI/CD support. - Document AWS monitoring and cloud-native operations best practices, and conduct internal trainings (CloudWatch metrics analysis, Nightingale-AWS integration troubleshooting). Qualifications • Education & Experience – Bachelor’s degree or higher in Computer Science, Software Engineering, or related field. – Minimum 3 years of experience in operations or DevOps, with at least 1 year of AWS cloud operations experience. • Technical Expertise – Deep understanding of Kubernetes core concepts (scheduling, networking, storage) and hands-on experience deploying and operating EKS clusters (eksctl, EKS kubeadm). Experience with large-scale EKS clusters (≥50 nodes) is a plus. – Proficient with AWS CloudWatch for metrics collection, alert configuration, and log analysis (CloudWatch Logs Insights), and skilled in Metrics Explorer, Alarms, and Dashboards. – Experienced in building monitoring and alerting systems with Nightingale, integrating multiple data sources (Prometheus, CloudWatch) for unified visualization. – Familiar with FlashDuty incident management and AWS alert integration (e.g., Lambda-triggered FlashDuty events). – Proficient in the Prometheus + Grafana + Alertmanager stack and knowledgeable in OpenTelemetry for metrics and tracing. – Hands-on experience with at least one CI/CD toolchain (Argo CD, Jenkins) and integrating with AWS CodePipeline/CodeBuild. – Skilled in operating MySQL, Redis, RabbitMQ, etc., with AWS managed services (RDS, ElastiCache) and designing high-availability architectures. – Strong problem-solving skills with the ability to diagnose complex issues (EKS node network packet loss, RDS cross-AZ replication latency) using CloudWatch logs, Nightingale metrics, and tracing. • Certifications – AWS Certified SysOps Administrator – Associate or AWS Certified Cloud Practitioner preferred. • Soft Skills – Excellent communication and collaboration skills, with the ability to drive cross-team technical initiatives. – Strong documentation habits and a passion for knowledge sharing.
Preview

Miko M

HR ManagerBest Web3

Reply 0 Times Today

Working Location

深圳市, 中国广东省深圳市

Posted on 22 January 2026

الإبلاغ عن هذه الوظيفة

تذكير أمان Bossjob

إذا كانت الوظيفة تتطلب العمل خارج البلاد، يرجى أن تكون متيقظًا وأن تحذر من الاحتيال.

إذا واجهت صاحب عمل قام بالإجراءات التالية أثناء بحثك عن وظيفة، يرجى الإبلاغ عنه فورًا

  • يحجب هويتك،
  • يتطلب منك تقديم ضمان أو يجمع ممتلكات،
  • يجبرك على الاستثمار أو جمع الأموال،
  • يجمع فوائد غير قانونية،
  • أو حالات غير قانونية أخرى.
Tips
×

Some of our features may not work properly on your device.

If you are using a mobile device, please use a desktop browser to access our website.

Or use our app: Download App