Job description
1. Job Highlights
1.1 Participate in the construction of an AI orchestration and automation platform from 0→1/1→N (RAG, tool invocation, agent collaboration, asynchronous orchestration).
1.2 Introduce Diffy or similar "shadow traffic replay + response difference comparison" solutions for model/version regression detection and gray release.
1.3 Interface with multiple business lines (IM, customer service, marketing automation, data processing, etc.) to implement a real production-level closed loop.
2. Main Responsibilities
2.1 AI/LLM Workflow Orchestration
Design and implement multi-step reasoning, agent collaboration, tool invocation (Tool-Calling/Function-Calling), asynchronous task queues, and compensation mechanisms.
Build and optimize RAG: data ingestion, slicing and vectorization, recall/reordering, context compression, caching, and cost reduction.
2.2 Evaluation and Quality Assurance
Establish an automated evaluation and alignment system (benchmark sets, Ragas/G-Eval/self-developed metrics), integrate A/B testing and online monitoring.
Use Diffy (or equivalent solutions) for shadow traffic replay and response difference comparison to identify regression risks in model/prompt/service upgrades; support gray/canary releases and rapid rollbacks.
2.3 Engineering and Observability
Build model/prompt version management, feature and data version management, experiment tracking (MLflow/W&B), and audit logs.
Establish end-to-end observability: latency, error rates, prompt/context length, hit rates, cost monitoring (tokens/$).
2.4 Platform and Integration
Expose workflows as APIs/SDKs/microservices; integrate with business backends (Go/PHP/Node), queues (Kafka/RabbitMQ), storage (Postgres/Redis/object storage), and vector databases (Milvus/Qdrant/pgvector).
Implement security and compliance (data masking, PII protection, auditing, rate limiting and quotas, model governance).
Job Requirements:
3. Qualifications (Mandatory)
3.1 3+ years of backend or data/platform engineering experience, 1-2 years of practical experience in LLM/generative AI projects.
3.2 Familiar with LLM application engineering: prompt engineering, function/tool invocation, dialogue state management, memory, structured output, alignment, and evaluation.
3.3 Familiar with at least one orchestration framework: LangChain/LangGraph, LlamaIndex, Temporal/Prefect/Airflow, or self-developed with DAG/state machine/compensation mechanisms.
3.4 Practical experience with RAG: end-to-end experience from data cleaning to vectorization, recall, reordering, and evaluation; familiar with at least one of Milvus/Qdrant/pgvector.
3.5 Experience using Diffy or equivalent online replay and difference comparison experience (e.g., shadow traffic, recording/replay, regression output comparison, gray release, and automated acceptance).
3.6 Solid engineering skills: Docker, CI/CD, Git workflows, logging and metrics systems (any of OpenTelemetry/Prometheus/Grafana).
3.7 Proficient in at least one primary language (Go/Python/TypeScript or multiple); able to write reliable services and tests.
3.8 Good remote collaboration and documentation skills, able to deliver and review based on metrics.
4. Preferred Qualifications (any one is a plus)
4.1 In-depth practice with Diffy (or integrating microservice gateways for shadow traffic/diversion/comparison).
4.2 Experience in implementing LLMOps/evaluation platforms (Arize Phoenix, Evidently, PromptLayer, OpenAI Evals, Ragas).
4.3 Actual case studies related to agent frameworks (LangGraph, autogen/crewAI, GraphRAG, tool ecosystem).
4.4 Familiar with security and compliance (data masking, permission domains, PDPA/GDPR), with practical experience in "human review/machine review/interception (Llama Guard/Moderation)".
4.5 Familiar with IM/customer service/marketing automation business domains, or experience in multilingual scenarios (Chinese/English/Vietnamese, etc.).
4.6 Experience in cost optimization: caching, retrieval compression, model mixing/routing, multi-provider switching (OpenAI/Anthropic/Google/local models).
5. Main Technical Stack (any one or more)
5.1 Orchestration/Agent: LangChain/LangGraph, LlamaIndex, Temporal/Prefect/Airflow
5.2 Models and Evaluation: OpenAI/Anthropic/Google, VLLM/Ollama, Ragas, G-Eval, MLflow/W&B
5.3 Retrieval: Milvus, Qdrant, pgvector, Elasticsearch, rerank (bge/multilingual/E5, etc.)
5.4 Service-oriented: Go/Python/TypeScript, gRPC/REST, Redis, Postgres, Kafka/RabbitMQ, Docker/K8s
5.5 Observability: OpenTelemetry, Prometheus, Grafana, ELK/ClickHouse
5.6 Diffy/replay comparison: Twitter Diffy or similar shadow traffic/replay + difference comparison systems
Benefits:
Fully remote
