快速检索:      
引用本文:
【打印本页】   【下载PDF全文】   查看/发表评论  【EndNote】   【RefMan】   【BibTex】
←前一篇|后一篇→ 过刊浏览    高级检索
本文已被:浏览 17次   下载 4 本文二维码信息
码上扫一扫!
分享到: 微信 更多
基于深度学习的交叉残差连接网络应用于语音分离
褚俊佟, 魏爽
上海师范大学 信息与机电工程学院, 上海 201418
摘要:
在多模态语音分离领域,传统的特征融合方法往往采用简单的维度对齐拼接方式,而三模态的拼接仅在相邻模态之间建立联系,未能实现首尾特征的直接关联,导致多模态信息不能被充分利用.为了克服这一限制,本文提出一种基于交叉-残差连接的音视频与文本融合方法,以实现音频、视频和文本特征的深度融合,从而改善语音分离效果.该方法在任意两个模态之间建立联系,通过交叉连接,与其他所有模态共享信息,并利用残差连接将原始输入特征与处理中的特征表示相结合,既保留了各模态特征原始的完整性,也充分利用了模态间的相关性,使每一模态都能有效学习到其他模态的信息,提高了融合特征的稳健性.实验结果表明,相较于传统的基于特征拼接的音视频或音视频-文本语音分离方法,本方法在源失真比(SDR)和客观语音质量评估(PESQ)等关键指标上均获得显著提升,证明了该方法的优势.
关键词:  多模态语音分离  音视频特征  文本特征  特征融合  交叉-残差连接
DOI:10.20192/j.cnki.JSHNU(NS).2025.02.015
分类号:TN911.7
基金项目:上海市自然科学基金(19ZR1437600)
Speech separation based on cross-residual connection with deep learning
CHU Juntong, WEI Shuang
College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 201418, China
Abstract:
In the multi-modal speech separation field, simple dimensionally aligned splicing was often absorbed by traditional feature fusion methods and the tri-modal splicing connections were only established between neighboring modalities, which failed to directly correlate the first and last features, and lead to insufficient utilization of multi-modal information. To overcome the above limitation, an audio-video and text fusion method based on cross and residual connections was proposed in this paper to achieve deep fusion for audio, video, and text features to promote speech separation. Connections between two random modals were established by the method, in terms of sharing information with all other modals through cross connections. Besides, residual connections were utilized to combine original input features with feature representations in processing. The original integrity of the features of each modality was preserved, and the inter-modal correlation was also employed by the method, so that any modality could effectively learn the information of each other, improving the robustness of the fused features. The experimental results showed that compared to the traditional features by splicing-based audio-video or audio-video-text speech separation methods, the introduced method obtained significant improvement in key metrics such as source distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ), which proved the advantages of the method.
Key words:  multi-modal speech separation  audio-visual feature  text feature  feature fusion  cross-residual connection