Challenge tasks

Challenge Description

Given dialogue context, target text, and a reference speech sample, the system should infer the reasoning analysis and generate a speech waveform that stays consistent with both the reasoning analysis and the reference speaker's timbre.

2 Tracks: Text & Audio Context Bilingual: English + Chinese ~16K Hours Training Data ~3M Segments 30% Objective + 20% LLM + 50% Human

Task Pipeline

01

Context Input

Provide dialogue context, target text, and a reference speech sample as the conditioning signal for style inference.

02

CoT Reasoning

Before synthesis, the causes and consequences are analyzed and the speaking style of the target audio is summarized.

03

Speech Generation

Generate speech that preserves the reference speaker timbre while matching the inferred speaking manner and scene context.

Track 1 Text context

Text-Context-Aware CoT-TTS

The model reads speaker-labeled dialogue history and infers the target utterance's emotion, tone, rhythm, and communicative intent.

Input
Dialogue context Target text Reference speech
Expected Output

The system must generate both a reasoning analysis and an output audio waveform aligned with the inferred speaking style.

Example
{
  "context": ["speaker-0: I cannot believe this happened."],
  "target_text": "Then we must act now.",
  "reference_audio": "ref_speaker.wav"
}

{
  "reasoning": "The previous turn signals shock and urgency, so the target sentence should sound decisive, tense, and action-oriented.",
  "output_audio": "track1_prediction.wav"
}
Track 2 Audio context

Audio-Context-Aware CoT-TTS

The model listens to a continuous multi-speaker dialogue segment, then generates the target speech in a coherent speaking style.

Input
Dialogue audio Target text Reference speech
Expected Output

The system must generate both a reasoning analysis and an output audio waveform aligned with the inferred speaking style.

Example
{
  "context_audio": "history_dialogue.wav",
  "target_text": "Then we must act now.",
  "reference_audio": "ref_speaker.wav"
}

{
  "reasoning": "The acoustic context suggests escalating tension and shared urgency, so the response should be firm, immediate, and emotionally heightened.",
  "output_audio": "track2_prediction.wav"
}

What Makes This Challenge Different?

Focus Contextual Understanding The model must interpret evolving dialogue scenes instead of relying on isolated style labels.
Focus Explicit Reasoning Systems must expose why a sentence should be spoken in a certain way through a reasoning analysis.
Focus Speech-Reasoning Consistency The generated waveform should faithfully reflect the reasoning output rather than merely sounding natural.

Evaluation Snapshot

The official evaluation is designed to measure both the quality of the generated speech and the reliability of the model's reasoning process. Each submission will be assessed through a unified pipeline that combines automatic speech metrics, multimodal LLM-based judgment, and human subjective evaluation, ensuring that the final ranking reflects naturalness, contextual appropriateness, reasoning quality, and speech-reasoning consistency.

30% Objective Evaluation

Speech quality, intelligibility, speaker similarity, prosody, expression, and efficiency.

20% LLM-Based Evaluation

Contextual understanding, internal logical coherence, and informativeness of the reasoning.

50% Human Evaluation

Contextual coherence, reasoning accuracy, informativeness, naturalness, and speech-reasoning consistency.