Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Long Recordings

Authors: He Zhao & Hangting Chen & Jianwei Yu & Yuehai Wang

School of Information and Electronic Engineering, Zhejiang University & Tencent AI Lab, Audio and Speech Signal Processing Oteam

Email: zhao_he@zju.edu.cn

Abstract: Target speaker extraction (TSE) aims to extract the target speaker's voice from the input mixture. Previous studies have concentrated on high-overlapping scenarios. However, real-world applications usually meet more complex scenarios like variable speaker overlapping and target speaker absence. In this paper, we introduces a framework to perform continuous TSE (C-TSE), comprising a target speaker voice activation detection (TSVAD) and a TSE model. This framework significantly improves TSE performance on similar speakers and enhances personalization, which is lacking in traditional diarization methods. In detail, unlike conventional TSVAD deployed to refine the diarization results, the proposed Attention-target speaker voice activation detection (A-TSVAD) directly generates timestamps of the target speaker. We also explore some different integration methods of A-TSVAD and TSE by comparing the cascaded and parallel methods. The framework's effectiveness is assessed using a range of metrics, including diarization and enhancement metrics. Our experiments demonstrate that A-TSVAD outperforms conventional methods in reducing diarization errors. Furthermore, the integration of A-TSVAD and TSE in a sequential cascaded manner further enhances extraction accuracy.

A Quick Review

Background

Previous target speaker extraction mainly focused on high-overlapping scenarios. For example, the DNS challenge evaluates the TSE on single-speaker noisy and highly overlapped multi-speaker noisy conditions. However, real-world applications usually meet more complex scenarios like variable speaker overlapping and speaker absence.

Experimental setup

The experiments were performed on a simulated dataset. The test set is divided into four configurations based on the number of speakers: 2, 3, 4, 5. Each configuration corresponds to six overlap rates.

Framework & Method

(1) The introduction of transformer blocks improves the performance of TSVAD.

(2) For the continuous target speech extraction task in complex scenes, we propose the C-TSE network.

Results

(1) TSVAD model is introduced with transformer blocks to capture the target speaker's activity, surpassing the classic clustering-based methods.

(2) Joint framework for C-TSE is proposed by the parallel fusion of TSVAD and TSE results, further promoting the performance on diarization and enhancement metrics.