Husam Ali Abdulmohsin,HalaBahjat Abdul wahab,Abdul Mohssen Jaber Abdul hossen,




feature extraction ,feature selection and classification,real-time system,robotics,SER,


Speech emotion recognition (SER) research field extends back to 1996, but still one main obstacle still exists, which is achieving real-time SER systems. The once-imaginary relationship between humans and robots is rapidly approaching reality. Robots already play major roles, particularly in manufacturing, but until recently, they did only what they were programmed to do. However, with the development of artificial intelligence (AI) approaches, SER researchers are seeking to move robotics to a higher level, giving them the ability to predict human actions and recognize facial expressions and allowing them to interact with humans in more natural and clever ways. Humans are complicated; understanding only what they say is insufficient for all situations. One complication is that humans express identical emotions in multiple ways. For robots to act more like humans, understand them, and follow their orders in more intelligent ways, they need to understand emotions to make appropriate decisions. Thus, to reach the ideal SER state, a more up-to-date survey that considers how SER research has evolved over the past decade is needed. In this survey, our main goal is to explain the different research approaches followed in the SER field particularly Path 6, which represents a new technique in the SER field. To clarify the techniques for readers, details of the SER systems and their different approaches will be elaborated.


I. A. Álvarez, B. Sierra, A. Arruti, J.-M. López-Gil, and N. Garay-Vitoria, “Classifier subset selection for the stacked generalization method applied to emotion recognition in speech,” Sensors, vol. 16, no. 1, pp. 21, Jan. 2016, doi: 10.3390/s16010021.
II. A. Bhavan, P. Chauhan, and R. R. Shah, “Bagged support vector machines for emotion recognition from speech,” Knowl. Based Syst., vol. 184, pp. 104886, Mar. 2019, doi: 10.1016/j.knosys.2019.104886.
III. A. H. Ton-That and N. T. Cao, “Speech emotion recognition using a fuzzy approach,” J. Intell. Fuzzy Syst., vol. 36, no. 2, pp. 1587–1597, Jul. 2019, doi: 10.3233/JIFS-18594.
IV. A. Huang and P. Bao, “Human vocal sentiment analysis, arXiv preprint arXiv:1905.08632,” 2019.
V. A. Jalili, S. Sahami, C.-Y. Chi, and R. Amirfattahi, “Speech emotion recognition using cyclostationary spectral analysis,” in 2018 IEEE 28th Int. Workshop Mach. Learn. Signal Process. (MLSP), Aalborg, Denmark, Feb. 2018, pp. 1–6.
VI. A. Milton, S. T. Selvi, and Language, “Class-specific multiple classifiers scheme to recognize emotions from speech signals,” Comput. Speech, vol. 28, no. 3, pp. 727–742, Apr. 2014, doi: 10.1016/j.csl.2013.08.004.
VII. A. S. Popova, A. G. Rassadin, and A. A. Ponomarenko, “Emotion recognition in sound,” in Int. Conf. Neuroinformatics, Moscow, Feb. 2017, pp. 117–124.
VIII. Burkhardt F., Paeschke A., Rolfes M., Sendlmeier W., “Database of German Emotional Speech Proceedings Interspeech,” Weiss, BA J Lisbon jornal, Portugal, Sept. pp. 4-8, 2005.
IX. C. Huang, W. Gong, W. Fu, and D. Feng, “A research of speech emotion recognition based on deep belief network and SVM,” Math. Problems Eng., vol. 2014, no. 1, pp. 1–4, Aug. 2014, doi: 10.1155/2014/749604.
X. C. S. Ooi, K. P. Seng, L.-M. Ang, and L. W. Chew, “A new approach of audio emotion recognition,” Expert Syst. Appl., vol. 41, no. 13, pp. 5858–5869, Sept. 2014, doi: 10.1016/j.eswa.2014.03.026.
XI. F. Dellaert, T. Polzin, and A. Waibel, “Recognizing emotion in speech,” in Proc. 4th Int. Conf. Spoken Language Process. ICSLP’96, Philadelphia, PA, Oct. 1996, pp. 1970–1973.
XII. G. Deshmukh, A. Gaonkar, G. Golwalkar, and S. Kulkarni, “Speech based emotion recognition using machine learning,” in 2019 3rd Int. Conf. Comput. Methodologies Commun. (ICCMC), Erode, Jun. 2019, pp. 812–817.
XIII. G. Trigeorgis et al., “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in 2016 IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Piscataway, NJ, Apr. 2016, pp. 5200–5204.
XIV. H. Holmström and V. Zars, “Effect of Feature Extraction when Classifying Emotions in Speech-an Applied Study,” UMEA university, Faculty of Science and Technology, Department of Computing Science, pp. 1-30, 2018.
XV. H. Kaya and A. A. Karpov, “Efficient and effective strategies for cross-corpus acoustic emotion recognition,” Neurocomputing, vol. 275, pp. 1028–1034, Sept. 2018, doi: 10.1016/j.neucom.2017.09.049.
XVI. J. G. Rázuri, D. Sundgren, R. Rahmani, A. Moran, I. Bonet, and A. Larsson, “Speech emotion recognition in emotional feedbackfor human-robot interaction,” Int. J. Advanced Res. Artificial Intell., vol. 4, no. 2, pp. 20–27, Jul. 2015, doi: 10.14569/IJARAI.2015.040204.
XVII. J. G. Wilpon and D. B. Roe, Voice Communication between Humans and Machines. Washington, DC: National Academies Press, 1994.
XVIII. J. Grekow, “Emotion detection using feature extraction tools,” in Int. Symp. Methodologies Intell. Syst., Berlin, Germany, Nov. 2015, pp. 267–272.
XIX. J. M. López, I. Cearreta, N. Garay-Vitoria, K. L. de Ipiña, and A. Beristain, “A methodological approach for building multimodal acted affective databases,” in Engineering the user Interface, M. A. Redondo, C. Bravo, and M. Ortega, Eds. London, UK: Springer, 2009, pp. 1–17.
XX. K. Chengeta, “Comparative analysis of emotion detection from facial expressions and voice using local binary patterns and markov models,” in Proc. 2nd Int. Conf. Vision Image Signal Proc. Article No. 27, Las Vegas, Aug. 2018, pp. 1–6.
XXI. K. Mulligan and K. R. Scherer, “Toward a working definition of emotion,” Emotion Rev., vol. 4, no. 4, pp. 345–357, Aug. 2012, doi: 10.1177/1754073912445818.
XXII. K. Rajvanshi, A. Khunteta, and E. Technology, “An efficient approach for emotion detection from speech using neural networks,” Int. J. Res. Appl. Sci. Eng. Technol., vol. 6, no. 5, May 2018, doi: 10.22214/ijraset.2018.5170.
XXIII. K. Venkataramanan and H. R. Rajamohan, “Emotion recognition from speech, arXiv preprint arXiv:1912.10458,” 2019.
XXIV. L. Devillers, M. Tahon, M. A. Sehili, and A. Delaborde, “Inference of human beings’ emotional states from speech in human–robot interactions,” Int. J. Social Robot., vol. 7, no. 4, pp. 451–463, Aug. 2015, doi: 10.1007/s12369-015-0297-8.
XXV. L. Kerkeni, Y. Serrestou, M. Mbarki, K. Raoof, and M. A. Mahjoub, “Speech emotion recognition: Methods and cases study,” in ICAART (2), Funchal, Madeira, Jan. 2018, pp. 175–182.
XXVI. L. Kerkeni, Y. Serrestou, M. Mbarki, K. Raoof, M. A. Mahjoub, and C. Cleder, “Automatic speech emotion recognition using machine learning,” in Social Media and Machine Learning: IntechOpen, 2019.
XXVII. L. Tian and C. Watson, “Emotion recognition using intrasegmental features of continuous speech,” in 17th Speech Sci. Technol. Conf. (SST2018), Syndey, Australia, Jan. 2018.
XXVIII. L. Zhu, L. Chen, D. Zhao, J. Zhou, and W. Zhang, “Emotion recognition from Chinese speech for smart affective services using a combination of SVM and DBN,” Sensors, vol. 17, no. 7, pp. 1694, Nov. 2017, doi: 10.3390/s17071694.
XXIX. M. B. Akçay and K. Oğuz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Commun., vol. 116, Feb. 2020, doi: 10.1016/j.specom.2019.12.001.
XXX. M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, Jan. 2011, doi: 10.1016/j.patcog.2010.09.020.
XXXI. M.-W. Dictionary, Merriam-webster, 2002. [Online]. Available: http://www.mw.com/home.htm
XXXII. N. Hossain, R. Jahan, and T. T. Tunka, “Emotion detection from voice based classified frame-energy signal using K-means clustering,” 2018, doi: 10.5121/ijsea.
XXXIII. N. Jaitly and G. Hinton, “Learning a better representation of speech soundwaves using restricted boltzmann machines,” in 2011 IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Brisbane, Jan. 2011, pp. 5884–5887.
XXXIV. Nithya Roopa S., Prabhakaran M and Betty.P, “Speech Emotion Recognition using Deep Learning,” International Journal of Recent Technology and Engineering (IJRTE), Vol.7, no. 4S, Nov. 2018.
XXXV. N. Salankar and A. Mishra, “Statistical feature selection approach for classification of emotions from speech,” Mar. 2020, doi: 10.2139/ssrn.3527262.
XXXVI. P. Ekman and W. V. Friesen, Pictures of Facial Affect. Palo Alto, CA: Consulting Psychologists Press, 1976.
XXXVII. P. Kalapatapu, S. Goli, P. Arthum, and A. Malapati, “A study on feature selection and classification techniques of indian music,” Procedia Comput. Sci., vol. 98, pp. 125–131, May 2016, doi: 10.1016/j.procs.2016.09.020.
XXXVIII. R. Afdhal, R. Ejbali, and M. Zaied, “Primary emotions and recognition of their intensities,” Comput. J., pp. bxz162, 2020, doi: 10.1093/comjnl/bxz162.
XXXIX. S. Chebbi and S. B. Jebara, “On the use of pitch-based features for fear emotion detection from speech,” in 2018 4th Int. Conf. Advanced Technol. Signal Image Process. (ATSIP), Sousse, Tunisia, Mar. 2018, pp. 1–6.
XL. S. Jagtap, “Speech based emotion recognition using various features and SVM classifier,” Int. J. Res. Appl. Sci. Eng. Technol. (IJRASET), vol. 7, no. 3, Nov. 2019, doi: 10.22214/ijraset.2019.3018.
XLI. S. Jing, X. Mao, and L. Chen, “Prominence features: Effective emotional features for speech emotion recognition,” Digit. Signal Process., vol. 72, pp. 216–231, Mar. 2018, doi: 10.1016/j.dsp.2017.10.016.
XLII. S. Kwon, “A CNN-assisted enhanced audio signal processing for speech emotion recognition,” Sensors, vol. 20, no. 1, pp. 183, Mar. 2020, doi: 10.3390/s20010183.
XLIII. S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion recognition using recurrent neural networks with local attention,” in 2017 IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Piscataway, NJ, Jul. 2017, pp. 2227–2231.
XLIV. S. Ntalampiras, “Toward language-agnostic speech emotion recognition,” J. Audio Eng. Soc., vol. 68, no. 1/2, pp. 7–13, Jan. 2020, doi: 10.17743/jaes.2019.0045.
XLV. S. R. Bandela, K. T. Kishore, and C. Sciences, “Speech emotion recognition using semi-NMF feature optimization,” Turkish J. Elect. Eng., vol. 27, no. 5, pp. 3741–3757, Oct. 2019, doi: 10.3906/elk-1903-121.
XLVI. S. R. Livingstone and F. A. Russo, “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English,” PLoS One, vol. 13, no. 5, pp. e0196391, Feb. 2018, doi: 10.1371/journal.pone.0196391.
XLVII. S. Sharma and P. Singh, “Emotion recognition based on audio signal using GFCC extraction and BPNN classification,” Int. J. Comput. Eng. Res., vol. 5, no. 1, pp. 2250–3005, Jan. 2015.
XLVIII. S. Susan and A. Kaur, “Measuring the randomness of speech cues for emotion recognition,” in 2017 10th Int. Conf. Contemporary Comput. (IC3), Piscataway, NJ, Nov. 2017, pp. 1–6.

XLIX. S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching,” IEEE Trans. Multimedia, vol. 20, no. 6, pp. 1576–1590, Jan. 2017, doi: 10.1109/TMM.2017.2766843.
L. T. Vogt, “Real-time automatic emotion recognition from speech,” Dissertation, Technischen Fakultät der Universität Bielefeld, Bielefeld, Germany, 2010.
LI. V. Pérez-Rosas, R. Mihalcea, and L.-P. Morency, “Utterance-level multimodal sentiment analysis,” in Proc. 51st Annu. Meeting Assoc. Comput. Linguistics (Volume 1: Long Papers), Aug. 2013, pp. 973–982.
LII. W. Jiang, Z. Wang, J. S. Jin, X. Han, and C. Li, “Speech emotion recognition with heterogeneous feature unification of deep neural network,” Sensors, vol. 19, no. 12, pp. 2730, Jul. 2019, doi: 10.3390/s19122730.
LIII. W. Lim, D. Jang, and T. Lee, “Speech emotion recognition using convolutional and recurrent neural networks,” in 2016 Asia-Pacific Signal Inf. Process. Assoc. Ann. Summit Conf. (APSIPA), Piscataway, NJ, Nov. 2016, pp. 1–4.
LIV. Y. Li, T. Zhao, and T. Kawahara, “Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning,” in Proc. Interspeech 2019, Graz, Austria, Sept. 2019, pp. 2803–2807.
LV. Z. Farhoudi, S. Setayeshi, and A. Rabiee, “Using learning automata in brain emotional learning for speech emotion recognition,” Int. J. Speech Technol., vol. 20, no. 3, pp. 553–562, Dec. 2017, doi: 10.1007/s10772-017-9426-0.
LVI. Z.-T. Liu, M. Wu, W.-H. Cao, J.-W. Mao, J.-P. Xu, and G.-Z. Tan, “Speech emotion recognition based on feature selection and extreme learning machine decision tree,” Neurocomputing, vol. 273, pp. 271–280, Jul. 2018, doi: 10.1016/j.neucom.2017.07.050.
LVII. Z.-T. Liu, Q. Xie, M. Wu, W.-H. Cao, Y. Mei, and J.-W. Mao, “Speech emotion recognition based on an improved brain emotion learning model,” Neurocomputing, vol. 309, pp. 145–156, Mar. 2018, doi: 10.1016/j.neucom.2018.05.005.

View Download