AdaTT: Text-Guided Instrument Timbre Transfer with Target-Adaptive Structural Control
Dabin Kim1, Junwon Lee2, Juhan Nam1,2
1 Graduate School of Cultural Technology, KAIST, South Korea
2 Graduate School of AI, KAIST, South Korea
Supporting Webpage for INTERSPEECH 2026.
[arXiv]
Abstract
This paper addresses timbral ambiguity in instrument timbre transfer under fine-grained structural conditions. We argue this issue stems from instrument-specific expressive details in these conditions, which conflict with the target timbral properties. For example, imposing a violin's pitch-dominant vibrato contours onto a flute, which naturally exhibits loudness-dominant vibrato, impairs timbral fidelity. We propose AdaTT, a target-adaptive system that ensures high timbral fidelity across diverse timbre transfer scenarios within the ControlNet scheme. It selectively scales the frame-wise influence of pitch and loudness controls via text prompts to match the target instrument's identity. We also present a semi-automatic data construction pipeline to teach the model which expressive details to transform or preserve. Results show AdaTT achieves superior timbral fidelity and naturalness while retaining score-level content.