AdaTT: Text-Guided Instrument Timbre Transfer with Target-Adaptive Structural Control

Proposed Architecture Demo

Abstract

This paper addresses timbral ambiguity in instrument timbre transfer under fine-grained structural conditions. We argue this issue stems from instrument-specific expressive details in these conditions, which conflict with the target timbral properties. For example, imposing a violin's pitch-dominant vibrato contours onto a flute, which naturally exhibits loudness-dominant vibrato, impairs timbral fidelity. We propose AdaTT, a target-adaptive system that ensures high timbral fidelity across diverse timbre transfer scenarios within the ControlNet scheme. It selectively scales the frame-wise influence of pitch and loudness controls via text prompts to match the target instrument's identity. We also present a semi-automatic data construction pipeline to teach the model which expressive details to transform or preserve. Results show AdaTT achieves superior timbral fidelity and naturalness while retaining score-level content.

Demo Examples

Method Description
Stable Audio Open A text-to-music latent diffusion model that synthesizes audio based on text prompts.
ControlNet A DiT-based ControlNet trained with a frozen Stable Audio Open backbone. It is conditioned on two heterogeneous structural features: \(f_0\) and RMS contours.
SmartControl Incorporates Control Scale Predictors (CSPs) with the frozen ControlNet to adaptively adjust intermediate control features.
AdaTT (Ours) Jointly trains CSPs and Text-Guided CSPs (TG-CSPs). TG-CSPs modulate control scales to balance the contributions of each structural condition.

Violin → Clarinet

Spectrogram for Violin to Clarinet - Source

Source

Spectrogram for Violin to Clarinet - AdaTT (Ours)

AdaTT (Ours)

Spectrogram for Violin to Clarinet - Stable Audio Open

Stable Audio Open

Spectrogram for Violin to Clarinet - ControlNet

ControlNet

Spectrogram for Violin to Clarinet - SmartControl

SmartControl

Violin → Flute

Spectrogram for Violin to Flute - Source

Source

Spectrogram for Violin to Flute - AdaTT (Ours)

AdaTT (Ours)

Spectrogram for Violin to Flute - Stable Audio Open

Stable Audio Open

Spectrogram for Violin to Flute - ControlNet

ControlNet

Spectrogram for Violin to Flute - SmartControl

SmartControl

Trumpet → Flute

Spectrogram for Trumpet to Flute - Source

Source

Spectrogram for Trumpet to Flute - AdaTT (Ours)

AdaTT (Ours)

Spectrogram for Trumpet to Flute - Stable Audio Open

Stable Audio Open

Spectrogram for Trumpet to Flute - ControlNet

ControlNet

Spectrogram for Trumpet to Flute - SmartControl

SmartControl

Oboe → Trumpet

Spectrogram for Oboe to Trumpet - Source

Source

Spectrogram for Oboe to Trumpet - AdaTT (Ours)

AdaTT (Ours)

Spectrogram for Oboe to Trumpet - Stable Audio Open

Stable Audio Open

Spectrogram for Oboe to Trumpet - ControlNet

ControlNet

Spectrogram for Oboe to Trumpet - SmartControl

SmartControl

Clarinet → Violin

Spectrogram for Clarinet to Violin - Source

Source

Spectrogram for Clarinet to Violin - AdaTT (Ours)

AdaTT (Ours)

Spectrogram for Clarinet to Violin - Stable Audio Open

Stable Audio Open

Spectrogram for Clarinet to Violin - ControlNet

ControlNet

Spectrogram for Clarinet to Violin - SmartControl

SmartControl

Double Bass → Trombone

Spectrogram for Double Bass to Trombone - Source

Source

Spectrogram for Double Bass to Trombone - AdaTT (Ours)

AdaTT (Ours)

Spectrogram for Double Bass to Trombone - Stable Audio Open

Stable Audio Open

Spectrogram for Double Bass to Trombone - ControlNet

ControlNet

Spectrogram for Double Bass to Trombone - SmartControl

SmartControl

Saxophone → Viola

Spectrogram for Saxophone to Viola - Source

Source

Spectrogram for Saxophone to Viola - AdaTT (Ours)

AdaTT (Ours)

Spectrogram for Saxophone to Viola - Stable Audio Open

Stable Audio Open

Spectrogram for Saxophone to Viola - ControlNet

ControlNet

Spectrogram for Saxophone to Viola - SmartControl

SmartControl