FAQ
Frequently Asked Questions
What is the ISCSLP 2026 CoT-TTS Challenge?
The ISCSLP 2026 CoT-TTS Challenge focuses on Chain-of-Thought Reasoning for Context-Aware Text-to-Speech. Given dialogue context, a target sentence, and a reference speech sample, participating systems are expected to infer how the target sentence should be spoken, generate an explicit reasoning analysis, and synthesize speech that matches both the context and the reference speaker's timbre.
What does CoT-TTS mean?
CoT-TTS stands for Chain-of-Thought Text-to-Speech. Instead of requiring users to manually provide style prompts such as "angry," "sad," or "excited," the system should reason from the previous context and decide the appropriate speaking style automatically.
Who can participate in the challenge?
The challenge is open to teams and individuals from academia and industry. Researchers, students, and engineers working on text-to-speech synthesis, speech foundation models, spoken dialogue systems, multimodal learning, and expressive speech generation are welcome to participate.
How do we register for the challenge?
Registration is handled through the official Google Form. Teams should complete the form by July 4, 2026 so organizer-side registration and later system submissions stay matched correctly.
What are the tracks in this challenge?
The challenge contains two tracks. Track 1 is Text-Context-Aware CoT-TTS, where the system receives previous dialogue turns as speaker-labeled text. Track 2 is Audio-Context-Aware CoT-TTS, where the system receives the previous dialogue as a continuous audio segment.
What should a submitted system generate?
For each test sample, the system should generate two outputs: a reasoning analysis and a speech waveform. The reasoning analysis should explain the intended speaking manner of the target sentence, while the generated speech should preserve the reference speaker's timbre and remain consistent with the inferred speaking style.
Can we use external datasets or pretrained models?
Yes. Participants may use external datasets and publicly available pretrained models, as long as they are permitted for academic research. Any external data, pretrained models, synthetic data, data augmentation strategies, or post-processing methods must be clearly described in the system description paper.
Are cascaded ASR-LLM-TTS systems allowed?
No. Cascaded systems such as ASR-LLM-TTS pipelines are not allowed. Participants are expected to build end-to-end systems that directly perform context-aware reasoning and speech generation.
What is the parameter-constrained category?
Each track includes a parameter-constrained category for systems with fewer than 1B parameters during inference. This category is designed to encourage efficient modeling and make the challenge more accessible to teams with limited computational resources.
How will submissions be evaluated?
The final score will combine objective evaluation, LLM-based evaluation, and human subjective evaluation. Objective metrics will measure factors such as speech quality, intelligibility, speaker similarity, prosody, and efficiency. LLM-based evaluation will assess the reasoning output and its consistency with the generated speech. Human evaluation will focus on contextual coherence, reasoning accuracy, reasoning informativeness, and speech-reasoning consistency.
What should be included in the submission?
Each submission should include the inference code, runtime environment, trained model or checkpoints, and a short system description paper. The inference code must be executable on the official evaluation set and should generate both reasoning analyses and speech waveforms. The submitted system must run without internet access during official evaluation.