Job description
Job Responsibilities:
1️⃣ AI/LLM Workflow Orchestration
Design and implement multi-step reasoning, agent collaboration, tool calling (Tool-Calling/Function-Calling), asynchronous task queues, and compensation mechanisms. Build and optimize RAG: data ingestion, slicing and vectorization, recall/reordering, context compression, caching, and cost reduction.
2️⃣ Evaluation and Quality Assurance
Establish an automated evaluation and alignment system (benchmark sets, Ragas/G-Eval/self-developed metrics), integrate A/B testing and online monitoring. Use Diffy (or equivalent solutions) for shadow traffic playback and response difference comparison to identify regression risks in model/prompt/service upgrades; support gray/canary releases and quick rollbacks.
3️⃣ Engineering and Observability
Build model/prompt version management, feature and data version management, experiment tracking (MLflow/W&B), and audit logs. Establish end-to-end observability: latency, error rates, prompt/context length, hit rates, cost monitoring (tokens/$).
4️⃣ Platform and Integration
Expose workflows as APIs/SDKs/microservices; integrate with business backends (Go/PHP/Node), queues (Kafka/RabbitMQ), storage (Postgres/Redis/object storage), and vector databases (Milvus/Qdrant/pgvector). Implement security and compliance (data masking, PII protection, auditing, rate and quota management, model governance).
Job Requirements:
1️⃣ 3+ years of backend or data/platform engineering experience, 1-2 years of practical experience in LLM/generative AI projects.
2️⃣ Familiar with LLM application engineering: prompt engineering, function/tool calling, dialogue state management, memory, structured output, alignment, and evaluation.
3️⃣ At least familiar with one orchestration framework: LangChain/LangGraph, LlamaIndex, Temporal/Prefect/Airflow, or self-developed with DAG/state machine/compensation mechanisms.
4️⃣ Practical experience with RAG: end-to-end experience in data cleaning → vectorization → recall → reordering → evaluation; familiar with one of Milvus/Qdrant/pgvector.
5️⃣ Experience using Diffy or equivalent online playback and difference comparison experience (e.g., shadow traffic, recording/playback, regression output comparison, gray release, and automated acceptance).
6️⃣ Solid engineering skills: Docker, CI/CD, Git workflows, logging and metrics systems (any of OpenTelemetry/Prometheus/Grafana).
7️⃣ Proficient in at least one major programming language (Go/Python/TypeScript or multiple); able to write reliable services and tests.
8️⃣ Good remote collaboration and documentation skills, able to deliver and review based on metrics.
Preferred Qualifications (any of the following is a plus)
1️⃣ Deep practical experience with Diffy (or integrating microservice gateways for shadow traffic/diversion/comparison).
2️⃣ Experience implementing LLMOps/evaluation platforms (Arize Phoenix, Evidently, PromptLayer, OpenAI Evals, Ragas).
3️⃣ Practical case studies with agent frameworks (LangGraph, autogen/crewAI, GraphRAG, tool ecosystems).
4️⃣ Familiar with security and compliance (data masking, permission domains, PDPA/GDPR), with practical experience in "human review/machine review/interception (Llama Guard/Moderation)".
5️⃣ Familiar with IM/customer service/marketing automation business domains, or experience in multilingual scenarios (Chinese/English/Vietnamese, etc.).
6️⃣ Experience in cost optimization: caching, retrieval compression, model mixing/routing, multi-provider switching (OpenAI/Anthropic/Google/local models).
