Site Reliability Engineer

Responsibilities:

Platform Reliability & Automation

Design, implement, and operate reliable, scalable, and observable data platforms.
Automate incident triage, remediation, and postmortems using GenAI-powered tools.
Develop intelligent runbooks and self-healing workflows using LLMs.

GenAI-Enabled SRE Practices

Build and integrate GenAI copilots for on-call support, anomaly detection, and RCA (root cause analysis).
Fine-tune or prompt engineer LLMs for specific use cases like summarizing logs, interpreting metrics, or generating remediation steps.
Leverage vector databases (e.g., FAISS, Weaviate) to retrieve telemetry and incident history for GenAI prompts.

Observability & Anomaly Detection

Integrate GenAI with observability tools (e.g., Datadog, Prometheus, Grafana, OpenTelemetry).
Build systems for natural language querying of platform health and pipeline performance.
Collaborate with data engineers to monitor SLIs/SLOs across ingestion, transformation, and delivery layers.

CI/CD & Risk Management

Integrate GenAI into CI/CD pipelines to generate blast radius analyses and deployment guardrails.
Use LLMs to assess the risk of configuration or schema changes before production rollout.
Automate validation and rollback strategies based on historical outcomes.

Requirements:

5+ years in SRE, DevOps, or Data Engineering roles with strong focus on automation and
observability.
Solid experience in cloud-native data platforms (e.g., Databricks, Glue, Kafka, Flink, S3, Lambda).
Proven experience using or integrating GenAI tools (OpenAI, Claude, HuggingFace Transformers).
Proficiency in Python or Scala; experience with Spark and Airflow a plus.
Familiarity with LLM techniques: prompt engineering, embeddings, retrieval-augmented generation (RAG).
Hands-on experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Datadog).
Experience with Infrastructure as Code (e.g., Terraform, CloudFormation).

Nice to have:

Experience fine-tuning LLMs or integrating GenAI agents into production systems.
Familiarity with vector databases (e.g., Pinecone, Qdrant, FAISS).
Knowledge of data quality frameworks and lineage tools (e.g., DeeQu, Great Expectations, Amundsen, Unity Catalog).
Understanding of ITIL/incident management frameworks.
Strong communication and documentation skills, especially in on-call and postmortem environments.

HOW TO APPLY: Please send your CV to the consultant in charge:

Ms. Tu Anh

E-mail: anh.duong@ev-search.com

All applications will be considered without regard to race, color, religion, sex (including pregnancy and gender identity), national origin, political affiliation, sexual orientation, marital status, disability, genetic information, age, membership in an employee organization, parental status, military service, or any other non-merit factor.

Interested in this position?

Get in touch with us now!

Quick Apply

Site Reliability Engineer

Home

About us

Service

Expertise

Hot Jobs

Insight

Contact us