M.Fazal Ur Rehman,Nasir Saleem,Asif Nawaz,Sadeeq Jan,Zeeshan Najam,M. Irfan Khattak,Sheeraz Ahmed,



CASA,IBM,intelligibility,time-frequency masking,supervised speech separation,quality,


Computational auditory scene analysis (CASA) based speech separation is widely considered in a number speech processing applications and is used to separate a target speech from target-interference mixtures and usually the task of target separation is considered as a signal processing problem. However, target speech separation is formulated as a supervised learning problem and discriminative patterns of speech, speakers and background noises are learned from input training data. In this paper, we present a single channel supervised speech separation approach based on the ideal binary mask (IBM) estimation. In proposed approach, speaker independent speech separation system is trained with sets of the clean speech magnitudes and during separation; SNR is estimated in time-frequency (TF) channels using clean magnitudes and compared to a pre-defined threshold. The TF channels satisfying threshold are hold while TF channels violating the threshold are discarded to construct an IBM. The estimated mask is than applied to the mixtures to reconstruct the target speech, using phase of the mixture speech. The experiments are conducted in three speaker independent mixture’s scenarios: termed as 2-talkers, 3- talkers and 4-talkers mixtures at four input SNRs: -5dB, 0dB, 5dB and 10dB. The experimental outcomes reported that proposed CASA based supervised speaker independent mask estimation outperformed the competing approaches: Nonnegative matrix factorization (NMF), Nonnegative dynamical system (NNDS) and log minimum mean square error (LMMSE) estimation in terms of PESQ, SegSNR, LLR, WSS, SIG, BAK and STOI objective measures.


I. Cauwenberghs, G. (1999). Monaural separation of independent acoustical
components. In Circuits and Systems, 1999. ISCAS’99. Proceedings of the
1999 IEEE International Symposium on (Vol. 5, pp. 62-65). IEEE.
II. Darwin, C. J. (1997). Auditory grouping. Trends in cognitive sciences, 1(9),
III. Ellis, D.P.W. (1996). Prediction-driven computational auditory scene
analysis (Doctoral dissertation, Massachusetts Institute of Technology).
IV. Hu, Y., & Loizou, P. C. (2007). Subjective comparison and evaluation of
speech enhancement algorithms. Speech communication, 49(7-8), 588-601.
V. Hyvärinen, A., Hurri, J., & Hoyer, P. O. (2009). Independent component
analysis. In Natural Image Statistics (pp. 151-175). Springer, London.
VI. Li, H., Wang, Y., Zhao, R., & Zhang, X. (2018). An Unsupervised Two-
Talker Speech Separation System Based on CASA. International Journal of
Pattern Recognition and Artificial Intelligence, 32(07), 1858002.
VII. Li, P., Guan, Y., Xu, B., & Liu, W. (2006). Monaural speech separation based
on computational auditory scene analysis and objective quality assessment of
speech. IEEE Transactions on Audio, Speech, and Language
Processing, 14(6), 2014-2023.
VIII. Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. (2001).
Perceptual evaluation of speech quality (PESQ)-a new method for speech
quality assessment of telephone networks and codecs. In Acoustics, Speech,
and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE
International Conference on (Vol. 2, pp. 749-752). IEEE.

IX. O’shaughnessy, D. (1987). Speech communication: human and machine.
Universities press.
X. Roweis, S. T. (2001). One microphone source separation. In Advances in
neural information processing systems (pp. 793-799).
XI. Saleem, N., Shafi, M., Mustafa, E., & Nawaz, A. (2015). A novel binary
mask estimation based on spectral subtraction gain-induced distortions for
improved speech intelligibility and quality. University of Engineering and
Technology Taxila. Technical Journal, 20(4), 36.
XII. Saleem, N., & Irfan, M. (2018). Noise reduction based on soft masks by
incorporating SNR uncertainty in frequency domain. Circuits, Systems, and
Signal Processing, 37(6), 2591-2612.
XIII. Saleem, N. (2017). Single channel noise reduction system in low
SNR. International Journal of Speech Technology, 20(1), 89-98.
XIV. Saleem, N., Mustafa, E., Nawaz, A., & Khan, A. (2015). Ideal binary
masking for reducing convolutive noise. International Journal of Speech
Technology, 18(4), 547-554.
XV. Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2010, March). A
short-time objective intelligibility measure for time-frequency weighted noisy
speech. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE
International Conference on (pp. 4214-4217). IEEE
XVI. Wang, D. L., & Brown, G. J. (1999). Separation of speech from interfering
sounds based on oscillatory correlation. IEEE transactions on neural
networks, 10(3), 684-697.
XVII. Weintraub, M. (1985). A theory and computational model of auditory
monaural sound separation (Doctoral dissertation, Stanford University).

View Download