Job description
Job Responsibilities
Build an automated evaluation process around conversational AI agents (text/voice/multimodal), covering dimensions such as functionality, performance, and robustness.
Design and maintain a test case library, including but not limited to: intent recognition, knowledge recall, long dialogue consistency, multi-turn context, interruption and restart, noise robustness, etc.
Establish monitoring and regression frameworks for end-to-end latency (TTFT, first byte, first audio frame, E2E), stability, resource usage, and other metrics.
Conduct subjective and objective evaluations of visual generation and audio-video synchronization systems (e.g., lip sync, video smoothness, audio-video delay, compliance of generated content, etc.).
Reproduce and locate bugs, write clear reproduction steps and data samples, and collaborate with algorithm engineers to facilitate fixes.
Track industry open-source evaluation benchmarks and the latest papers, introducing reusable methods into the team’s evaluation system.
Job Requirements
Undergraduate or above in relevant majors such as Computer Science, Artificial Intelligence, Electronics, Automation, etc., able to guarantee at least 4 days on-site per week, with a continuous internship of ≥ 3 months.
Proficient in Python, able to independently write automation scripts and testing frameworks; familiar with any of pytest / asyncio / websocket / requests.
Understand the basic principles and common evaluation metrics (e.g., CER/WER, BLEU, RTF, FID, etc.) of at least one category among LLM, ASR, TTS, and diffusion models.
Data-sensitive, able to deduce patterns from large amounts of logs and samples; clear logic and accurate written expression.
Proactive, self-driven, able to independently advance tasks in a fast-paced small team.
Bonus Points (any one of these is sufficient)
Experience in evaluating or tuning LLM / dialogue systems / voice systems.
Familiarity with audio and video processing toolchains (ffmpeg, pydub, librosa, etc.) or image quality evaluation methods.
Experience using evaluation frameworks such as LLM-as-a-Judge, Ragas, DeepEval, lm-eval-harness, etc.
Experience using Linux + GPU servers, capable of collaborating with tmux / docker / git.
Own open-source projects, technical blogs, or participation in Kaggle / evaluation competitions.
What We Offer
Exposure to cutting-edge multimodal agent technology stack, participating in product refinement from 0 to 1.
Direct interaction with senior engineers with rich engineering experience, without intermediaries.
Outstanding performers will be prioritized for conversion to full-time / Return Offer, or receive long-term collaboration opportunities.
Flexible working arrangements (remote).
