Data & models

Challenge Resources

This page provides the official training dataset and two baseline models for Track 1 and Track 2.

Official dataset

Training Dataset

The official training release includes approximately 16K hours of bilingual speech and around 3M segments for context-aware CoT-TTS research and system development.

English ~8.6K hours About 1.62M segments of English speech data.
Chinese ~7.4K hours About 1.38M segments of Chinese speech data.

Baseline Models

Track 1 Text context

Track 1 Baseline: Text-Context-Aware CoT-TTS

A reproducible 0.6B Qwen3-style baseline with a three-stage training strategy for context-aware reasoning and speech generation.

Context signal

Uses speaker-labeled textual dialogue context together with target text and reference speech.

Track 2 Audio context

Track 2 Baseline: Audio-Context-Aware CoT-TTS

A reproducible 0.6B Qwen3-style baseline with a three-stage training strategy for reasoning from acoustic dialogue history.

Context signal

Uses continuous multi-speaker audio context together with target text and reference speech.

Resource Release Note

All resources will be released through the official challenge website. Models and data are provided for academic research and challenge participation only. Submitted systems must include all required files because official evaluation runs without internet access.