Rules
Challenge Rules
All submissions must be complete, reproducible, and directly executable by the organizers on the official hidden evaluation set. Each team must clearly specify its selected track and leaderboard category at submission time.
- No cascaded ASR-LLM-TTS pipelines
- No online APIs or remote model calls
- RTF must be no greater than 3.0
- Parameter-constrained submissions must use fewer than 1B inference-time parameters
- All external resources must be declared
1. Tracks and Categories
The challenge includes two tracks.
Track 1: Text-Context-Aware CoT-TTS
Systems receive textual dialogue history, target text, and reference speech, and must generate both a reasoning analysis and the target speech waveform.
Track 2: Audio-Context-Aware CoT-TTS
Systems receive continuous historical dialogue audio, target text, and reference speech, and must generate both a reasoning analysis and the target speech waveform.
Each track has two leaderboard categories.
Unrestricted Category
There is no strict model-size limit. This category is intended for comparison with large-scale systems.
Parameter-Constrained Category
The total number of parameters used during inference must be fewer than 1B. All loaded or invoked modules count toward this limit, including frozen models, vocoders, speech tokenizers, speaker encoders, auxiliary models, and post-processing models.
2. Allowed Resources
Participants may use the official training and development data released by the organizers. Publicly available academic datasets and publicly available pretrained models are also allowed, provided that they are permitted for academic research.
External data must not contain, reconstruct, or derive from hidden evaluation sources.
All external datasets, pretrained models, synthetic data, data augmentation strategies, post-processing methods, and auxiliary resources must be clearly declared in the system description document.
3. Submission Requirements
Each valid submission must include:
- Trained model or checkpoints
- Inference code
- Runtime environment configuration
- System description document
The inference code must be directly executable on the official evaluation set and must generate both reasoning outputs and speech waveforms for all test samples.
The runtime environment should include a Dockerfile, environment.yml, requirements.txt, or installation script.
All tokenizers, vocoders, speaker encoders, auxiliary files, and other resources required for inference must be included in the submitted package.
4. Prohibited Systems and Invalid Submissions
Cascaded ASR-LLM-TTS pipelines are not allowed. Participants may not call separate off-the-shelf ASR, LLM, TTS, voice conversion, or speech enhancement modules as a cascaded pipeline during inference, even if these modules are wrapped into a single submitted system.
Online APIs or remote model calls are not allowed. Official evaluation will be conducted without internet access.
Manual inspection, use, or reverse engineering of hidden test labels is prohibited. Participants must not attempt to identify, retrieve, reconstruct, or use the original source media corresponding to hidden evaluation samples.
Submissions with missing files, invalid formats, severe content mismatch, extremely low speech quality, incomplete generation, or official RTF greater than 3.0 may be excluded from official evaluation and final ranking.
5. Reasoning Output Requirements
Each system must output a reasoning analysis for every test sample.
The reasoning should demonstrate:
- Contextual understanding
- Internal logical coherence
- Information richness
Generic or template-like reasoning may receive a low reasoning-informativeness score even if it appears superficially correct.
6. Final Ranking Eligibility
A submission is eligible for final ranking only if it satisfies all of the following conditions:
- Runs successfully in the official evaluation environment
- Generates both reasoning and speech outputs
- Follows the selected track and category constraints
- Declares all external resources
- Does not use prohibited cascaded systems or online APIs
- Passes the basic validity check before detailed objective, LLM-based, and human evaluation