Rules | ISCSLP 2026

Rules

Challenge Rules

All submissions must be complete, reproducible, and directly executable by the organizers on the official hidden evaluation set. Each team must clearly specify its selected track and leaderboard category at submission time.

Key Restrictions

No cascaded ASR-LLM-TTS pipelines
No online APIs or remote model calls
RTF must be no greater than 3.0
Parameter-constrained submissions must use fewer than 1B inference-time parameters
All external resources must be declared

1. Tracks and Categories

The challenge includes two tracks.

Track 1: Text-Context-Aware CoT-TTS

Systems receive textual dialogue history, target text, and reference speech, and must generate both a reasoning analysis and the target speech waveform.

Track 2: Audio-Context-Aware CoT-TTS

Systems receive continuous historical dialogue audio, target text, and reference speech, and must generate both a reasoning analysis and the target speech waveform.

Each track has two leaderboard categories.

Unrestricted Category

There is no strict model-size limit. This category is intended for comparison with large-scale systems.

Parameter-Constrained Category

The total number of parameters used during inference must be fewer than 1B. All loaded or invoked modules count toward this limit, including frozen models, vocoders, speech tokenizers, speaker encoders, auxiliary models, and post-processing models.

2. Allowed Resources

Participants may use the official training and development data released by the organizers. Publicly available academic datasets and publicly available pretrained models are also allowed, provided that they are permitted for academic research.

External data must not contain, reconstruct, or derive from hidden evaluation sources.

All external datasets, pretrained models, synthetic data, data augmentation strategies, post-processing methods, and auxiliary resources must be clearly declared in the system description document.

3. Submission Requirements

Each valid submission must include:

Trained model or checkpoints
Inference code
Runtime environment configuration
System description document

The inference code must be directly executable on the official evaluation set and must generate both reasoning outputs and speech waveforms for all test samples.

The runtime environment should include a Dockerfile, environment.yml, requirements.txt, or installation script.

All tokenizers, vocoders, speaker encoders, auxiliary files, and other resources required for inference must be included in the submitted package.

4. Prohibited Systems and Invalid Submissions

Cascaded ASR-LLM-TTS pipelines are not allowed. Participants may not call separate off-the-shelf ASR, LLM, TTS, voice conversion, or speech enhancement modules as a cascaded pipeline during inference, even if these modules are wrapped into a single submitted system.

Online APIs or remote model calls are not allowed. Official evaluation will be conducted without internet access.

Manual inspection, use, or reverse engineering of hidden test labels is prohibited. Participants must not attempt to identify, retrieve, reconstruct, or use the original source media corresponding to hidden evaluation samples.

Submissions with missing files, invalid formats, severe content mismatch, extremely low speech quality, incomplete generation, or official RTF greater than 3.0 may be excluded from official evaluation and final ranking.

5. Reasoning Output Requirements

Each system must output a reasoning analysis for every test sample.

The reasoning should demonstrate:

Contextual understanding
Internal logical coherence
Information richness

Generic or template-like reasoning may receive a low reasoning-informativeness score even if it appears superficially correct.

6. Final Ranking Eligibility

A submission is eligible for final ranking only if it satisfies all of the following conditions:

Runs successfully in the official evaluation environment
Generates both reasoning and speech outputs
Follows the selected track and category constraints
Declares all external resources
Does not use prohibited cascaded systems or online APIs
Passes the basic validity check before detailed objective, LLM-based, and human evaluation