Semi-intrusive audio evaluation: Casting non-intrusive assessment as a multi-modal text prediction task

Abstract

Human perception has the unique ability to focus on specific events in a mixture of signals—a challenging task for existing non-intrusive assessment methods. In this work, we introduce semi-intrusive assessment that emulates human attention by framing the audio assessment as a text-prediction task with audio-text inputs. To this end, we extend the multi-modal PENGI model through instruction fine-tuning for MOS and SNR estimation. For MOS, our approach achieves absolute Pearson correlation gains of 0.06 and 0.20 over the re-trained MOSRA model and the pre-trained PAM model, respectively. We further propose a novel SNR estimator that can focus on a specific audio source in a mixture, outperforming a random baseline and the fixed-prompt counterpart. Our findings suggest that semi-intrusive assessment can effectively capture human-like selective listening capabilities.

Sample outputs

MOS prediction

In the Output Caption With Target column, we show the output caption with the target sound class in the prompt.

e.g. Paying attention to the music assess the audio quality

In the MOSRA MOS prediction column, we show the output MOS Predicted by the MOSRA [1] model. This model makes predictions using only the audio, and does not have access to the signal class.

Input Prompt	Target Sound Ground Truth	Output Caption With Target	MOSRA [1] MOS prediction
Paying attention to the music assess the audio quality	The audio quality is 4.6	The audio quality is 2.9	1.1
Paying attention to the music assess the audio quality	The audio quality is 4.4	The audio quality is 4.1	1.1
Paying attention to the music assess the audio quality	The audio quality is 5.0	The audio quality is 4.3	1.2
Paying attention to the speech assess the audio quality	The audio quality is 3.8	The audio quality is 4.8	4.8
Paying attention to the speech assess the audio quality	The audio quality is 4.2	The audio quality is 2.8	2.1
Paying attention to the speech assess the audio quality	The audio quality is 2.1	The audio quality is 2.8	3.0

SNR prediction

In the Output Caption With Target column, we show the output caption with the target sound class in the prompt:

e.g. Paying attention to the dog estimate the SNR.

In the MOSRA SNR prediction column, we show the output SNR Predicted by the MOSRA model. This model makes predictions using only the audio, and does not have access to the signal class.

Input Prompt	Target Sound Ground Truth	Output Caption With Target	MOSRA [1] SNR prediction
Paying attention to the dog estimate the SNR	The SNR is 12.1	The SNR is 9.8	-16.9
Paying attention to the chainsaw estimate the SNR	The SNR is 11.8	The SNR is 17.6	-15.5
Paying attention to the keyboard_typing estimate the SNR	The SNR is 7.2	The SNR is 7.2	-23.9
Paying attention to the crickets estimate the SNR	The SNR is -19.5	The SNR is -17.6	-19.6
Paying attention to the drinking_sipping estimate the SNR	The SNR is -8.8	The SNR is -4.9	-23.9
Paying attention to the sneezing estimate the SNR	The SNR is 19.2	The SNR is 19.9	-20.4
Paying attention to the frog estimate the SNR	The SNR is 5.4	The SNR is 7.4	6.5
Paying attention to the sea_waves estimate the SNR	The SNR is -8.7	The SNR is -10.8	-20.5

Sources

[1] El Hajal, Karl, et al. "Efficient speech quality assessment using self-supervised framewise embeddings." ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.

[2] Deshmukh, Soham, et al. "Pengi: An audio language model for audio tasks." Advances in Neural Information Processing Systems 36 (2023): 18090-18108.

[3] Deshmukh, Soham, et al. "Pam: Prompting audio-language models for audio quality assessment." arXiv preprint arXiv:2402.00282 (2024).

BibTeX

@misc{coldenhoff2024semiintrusiveaudioevaluationcasting,
        title={Semi-intrusive audio evaluation: Casting non-intrusive assessment as a multi-modal text prediction task}, 
        author={Jozef Coldenhoff and Milos Cernak},
        year={2024},
        eprint={2409.14069},
        archivePrefix={arXiv},
        primaryClass={eess.AS},
        url={https://arxiv.org/abs/2409.14069}, 
  }