A Quick Review

Background

Previous target speaker extraction mainly focused on high-overlapping scenarios. For example, the DNS challenge evaluates the TSE on single-speaker noisy and highly overlapped multi-speaker noisy conditions. However, real-world applications usually meet more complex scenarios like variable speaker overlapping and speaker absence.

Experimental setup

The experiments were performed on a simulated dataset. The test set is divided into four configurations based on the number of speakers: 2, 3, 4, 5. Each configuration corresponds to six overlap rates.

Framework & Method

(1) The introduction of transformer blocks improves the performance of TSVAD.

(2) For the continuous target speech extraction task in complex scenes, we propose the C-TSE network.

Results

(1) TSVAD model is introduced with transformer blocks to capture the target speaker's activity, surpassing the classic clustering-based methods.

(2) Joint framework for C-TSE is proposed by the parallel fusion of TSVAD and TSE results, further promoting the performance on diarization and enhancement metrics.

Detailed results

Simulated-recorded audio test

Speaker num is 2.

Method
OL
OS
OV10
OV20
OV30
OV40
ANCHOR
MIXTURE
TARGET
TSVAD
pBSRNN
Cascade approach 1
Cascade approach 2
Parallel approach

Speaker num is 3.

Method
OL
OS
OV10
OV20
OV30
OV40
ANCHOR
MIXTURE
TARGET
TSVAD
pBSRNN
Cascade approach 1
Cascade approach 2
Parallel approach

Speaker num is 4.

Method
OL
OS
OV10
OV20
OV30
OV40
ANCHOR
MIXTURE
TARGET
TSVAD
pBSRNN
Cascade approach 1
Cascade approach 2
Parallel approach

Speaker num is 5.

Method
OL
OS
OV10
OV20
OV30
OV40
ANCHOR
MIXTURE
TARGET
TSVAD
pBSRNN
Cascade approach 1
Cascade approach 2
Parallel approach

Real-recorded audio test