Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Published: 27 March 2023

The role of facial movements in emotion recognition

  • Eva G. Krumhuber   ORCID: orcid.org/0000-0003-1894-2517 1 ,
  • Lina I. Skora   ORCID: orcid.org/0000-0002-1323-6595 2 , 3 ,
  • Harold C. H. Hill   ORCID: orcid.org/0000-0002-6800-7204 4 &
  • Karen Lander   ORCID: orcid.org/0000-0002-4738-1176 5  

Nature Reviews Psychology volume  2 ,  pages 283–296 ( 2023 ) Cite this article

1273 Accesses

16 Citations

18 Altmetric

Metrics details

  • Human behaviour

Most past research on emotion recognition has used photographs of posed expressions intended to depict the apex of the emotional display. Although these studies have provided important insights into how emotions are perceived in the face, they necessarily leave out any role of dynamic information. In this Review, we synthesize evidence from vision science, affective  science and neuroscience to ask when, how and why dynamic information contributes to emotion recognition, beyond the information conveyed in static images. Dynamic displays offer distinctive temporal information such as the direction, quality and speed of movement, which recruit higher-level cognitive processes and support social and emotional inferences that enhance judgements of facial affect. The positive influence of dynamic information on emotion recognition is most evident in suboptimal conditions when observers are impaired and/or facial expressions are degraded or subtle. Dynamic displays further recruit early attentional and motivational resources in the perceiver, facilitating the prompt detection and prediction of others’ emotional states, with benefits for social interaction. Finally, because emotions can be expressed in various modalities, we examine the multimodal integration of dynamic and static cues across different channels, and conclude with suggestions for future research.

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 12 digital issues and online access to articles

55,14 € per year

only 4,60 € per issue

Buy this article

  • Purchase on Springer Link
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

literature review on emotion recognition

Similar content being viewed by others

literature review on emotion recognition

The temporal dynamics of emotion comparison depends on low-level attentional factors

literature review on emotion recognition

Effects of aging on emotion recognition from dynamic multimodal expressions and vocalizations

literature review on emotion recognition

The effect of processing partial information in dynamic face perception

Vick, S. J., Waller, B. M., Parr, L. A., Smith Pasqualini, M. C. & Bard, K. A. A cross-species comparison of facial morphology and movement in humans and chimpanzees using the facial action coding system (FACS). J. Nonverbal Behav. 31 , 1–20 (2007).

Article   PubMed   PubMed Central   Google Scholar  

Ekman, P., Friesen, W. V. & Hager, J. C. The Facial Action Coding System (Research Nexus eBook, 2002).

Ambadar, Z., Schooler, J. W. & Cohn, J. F. Deciphering the enigmatic face: the importance of facial dynamics in interpreting subtle facial expressions. Psychol. Sci. 16 , 403–410 (2005).

Article   PubMed   Google Scholar  

Lederman, S. J. et al. Haptic recognition of static and dynamic expressions of emotion in the live face. Psychol. Sci. 18 , 158–164 (2007).

Weyers, P., Muhlberger, A., Hefele, C. & Pauli, P. Electromyographic responses to static and dynamic avatar emotional facial expressions. Psychophysiology 43 , 450–453 (2006).

Zloteanu, M., Krumhuber, E. G. & Richardson, D. C. Detecting genuine and deliberate displays of surprise in static and dynamic faces. Front. Psychol. 9 , 1184 (2018).

Johnston, P., Mayes, A., Hughes, M. & Young, A. W. Brain networks subserving the evaluation of static and dynamic facial expressions. Cortex 49 , 2462–2472 (2013).

Paulmann, S., Jessen, S. & Kotz, S. A. Investigating the multimodal nature of human communication: insights from ERPs. J. Psychophysiol. 23 , 63–76 (2009).

Article   Google Scholar  

Trautmann, S. A., Fehr, T. & Herrmann, M. Emotions in motion: dynamic compared to static facial expressions of disgust and happiness reveal more widespread emotion-specific activations. Brain Res. 1284 , 100–115 (2009).

Dawel, A., Miller, E. J., Horsburgh, A. & Ford, P. A. A systematic survey of face stimuli used in psychological research 2000–2020. Behav. Res. Methods 54 , 1889–1901 (2021).

Sato, W. & Yoshikawa, S. The dynamic aspects of emotional facial expressions. Cogn. Emot. 18 , 701–710 (2004).

Krumhuber, E. G., Kappas, A. & Manstead, A. S. R. Effects of dynamic aspects of facial expressions: a review. Emot. Rev. 5 , 41–46 (2013).

Krumhuber, E. & Skora, P. in Handbook of Human Motion (eds Müller, B. & Wolf, S.) 2271–2285 (Springer, 2016).

Dobs, K., Bülthoff, I. & Schultz, J. Use and usefulness of dynamic face stimuli for face perception studies — a review of behavioral findings and methodology. Front. Psychol. 9 , 1355 (2018).

Lander, K. & Butcher, N. L. Recognizing genuine from posed facial expressions: exploring the role of dynamic information and face familiarity. Front. Psychol. 11 , 1378 (2020).

Arsalidou, M., Morris, D. & Taylor, M. J. Converging evidence for the advantage of dynamic facial expressions. Brain Topogr. 24 , 149–163 (2011).

Zinchenko, O., Yaple, Z. A. & Arsalidou, M. Brain responses to dynamic facial expressions: a normative meta-analysis. Front. Hum. Neurosci. 12 , 227 (2018).

Bruce, V. & Young, A. Face Perception (Psychology Press, 2012).

Ekman, P. An argument for basic emotions. Cogn. Emot. 6 , 169–200 (1992).

Ekman, P. in The Handbook of Cognition and Emotion (eds Dalgeish, T. & Power, M. J.) 45–60 (Wiley, 1999).

Goeleven, E., De Raedt, R., Leyman, L. & Verschuere, B. The Karolinska directed emotional faces: a validation study. Cogn. Emot. 22 , 1094–1118 (2008).

Langner, O. et al. Presentation and validation of the Radboud Faces Database. Cogn. Emot. 24 , 1377–1388 (2010).

Palermo, R. & Coltheart, M. Photographs of facial expression: accuracy, response times, and ratings of intensity. Behav. Res. Methods 36 , 634–638 (2004).

Tottenham, N. et al. The NimStim set of facial expressions: judgments from untrained research participants. Psychiatry Res. 168 , 242–249 (2009).

Tracy, J. L. & Robins, R. W. The automaticity of emotion recognition. Emotion 8 , 81–95 (2008).

Jack, R. E., Sun, W., Delis, I., Garrod, O. G. & Schyns, P. G. Four not six: revealing culturally common facial expressions of emotion. J. Exp. Psychol. Gen. 145 , 708–730 (2016).

Keltner, D., Sauter, D., Tracy, J. & Cowen, A. Emotional expression: advances in basic emotion theory. J. Nonverbal Behav. 43 , 133–160 (2019).

Schmidtmann, G., Logan, A. J., Carbon, C. C., Loong, J. T. & Gold, I. In the blink of an eye: reading mental states from briefly presented eye regions. iPerception 11 , 2041669520961116 (2020).

PubMed   PubMed Central   Google Scholar  

Calvo, M. G. & Lundqvist, D. Facial expressions of emotion (KDEF): identification under different display-duration conditions. Behav. Res. Methods 40 , 109–115 (2008).

Elfenbein, H. A. & Ambady, N. On the universality and cultural specificity of emotion recognition: a meta-analysis. Psychol. Bull. 128 , 203–235 (2002).

Haidt, J. & Keltner, D. Culture and facial expression: open-ended methods find more expressions and a gradient of recognition. Cogn. Emot. 13 , 225–266 (1999).

Kayyal, M. H. & Russell, J. A. Americans and Palestinians judge spontaneous facial expressions of emotion. Emotion 13 , 891–904 (2013).

Wagner, H. L. in The Psychology of Facial Expression (eds Russell, J. A. & Fernández-Dols, J. M.) 31–54 (Cambridge Univ. Press, 1997).

Beaudry, O., Roy-Charland, A., Perron, M., Cormier, I. & Tapp, R. Featural processing in recognition of emotional facial expressions. Cogn. Emot. 28 , 416–432 (2014).

Boucher, J. D. & Ekman, P. Facial areas and emotional information. J. Commun. 25 , 21–29 (1975).

Calder, A. J., Keane, J., Young, A. W. & Dean, M. Configural information in facial expression perception. J. Exp. Psychol. Hum. Percept. Perform. 26 , 527–551 (2000).

Smith, M. L., Cottrell, G. W., Gosselin, F. & Schyns, P. G. Transmitting and decoding facial expressions. Psychol. Sci. 16 , 184–189 (2005).

Tanaka, J. W., Kaiser, M. D., Butler, S. & Le Grand, R. Mixed emotions: holistic and analytic perception of facial expressions. Cogn. Emot. 26 , 961–977 (2012).

Blais, C., Fiset, D., Roy, C., Saumure Régimbald, C. & Gosselin, F. Eye fixation patterns for categorizing static and dynamic facial expressions. Emotion 17 , 1107–1119 (2017).

Yitzhak, N., Pertzov, Y. & Aviezer, H. The elusive link between eye-movement patterns and facial expression recognition. Soc. Personal. Psychol. Compass 15 , e12621 (2021).

Calvo, M. G. & Nummenmaa, L. Detection of emotional faces: salient physical features guide effective visual search. J. Exp. Psychol. Gen. 137 , 471–494 (2008).

Derntl, B., Seidel, E. M., Kainz, E. & Carbon, C. C. Recognition of emotional expressions is affected by inversion and presentation time. Perception 38 , 1849–1862 (2009).

Psalta, L. & Andrews, T. J. Inversion improves the recognition of facial expression in thatcherized images. Perception 43 , 715–730 (2014).

Palermo, R. et al. Impaired holistic coding of facial expression and facial identity in congenital prosopagnosia. Neuropsychologia 49 , 1226–1235 (2011).

White, M. Parts and wholes in expression recognition. Cogn. Emot. 14 , 39–60 (2000).

Etcoff, N. L. & Magee, J. J. Categorical perception of facial expressions. Cognition 44 , 227–240 (1992).

Horstmann, G. Visual search for schematic affective faces: stability and variability of search slopes with different instances. Cogn. Emot. 23 , 355–379 (2009).

Hugdahl, K., Iversen, P. M., Ness, H. M. & Flaten, M. A. Hemispheric differences in recognition of facial expressions: a VHF-study of negative, positive, and neutral emotions. Int. J. Neurosci. 45 , 205–213 (1989).

Sormaz, M., Young, A. W. & Andrews, T. J. Contributions of feature shapes and surface cues to the recognition of facial expressions. Vis. Res. 127 , 1–10 (2016).

Calder, A. J., Young, A. W., Rowland, D. & Perrett, D. I. Computer-enhanced emotion in facial expressions. Proc. R. Soc. B 264 , 919–925 (1997).

Calder, A. J. et al. Caricaturing facial expression. Cognition 76 , 105–146 (2000).

Calder, A. J., Burton, A. M., Miller, P., Young, A. W. & Akamatsu, S. A principal component analysis of facial expressions. Vis. Res. 41 , 1179–1208 (2001).

Dailey, M. N., Cottrell, G. W., Padgett, C. & Adolphs, R. Empath: a neural network that categorizes facial expressions. J. Cogn. Neurosci. 14 , 1158–1173 (2002).

Susskind, J. M., Littlewort, G., Bartlett, M. S., Movellan, J. & Anderson, A. K. Human and computer recognition of facial expressions of emotion. Neuropsychologia 45 , 152–162 (2007).

Calvo, M. G. & Nummenmaa, L. Perceptual and affective mechanisms in facial expression recognition: an integrative review. Cogn. Emot. 30 , 1081–1106 (2016).

Ullman, S. The interpretation of structure from motion. Proc. R. Soc. B 203 , 405–426 (1979).

Google Scholar  

O’Toole, A. J., Roark, D. A. & Abdi, H. Recognizing moving faces: a psychological and neural synthesis. Trends Cogn. Sci. 6 , 261–266 (2002).

Black, M. J. & Yacoob, Y. Recognizing facial expressions in image sequences using local parameterized models of image motion. Int. J. Comput. Vis. 25 , 23–48 (1997).

Horowitz, B. & Pentland, A. Recovery of non-rigid motion and structure. In IEEE Computer Soc. Conf. Computer Vision and Pattern Recognition 325–330 (IEEE, 1991).

Torresani, L., Hertzmann, A. & Bregler, C. Nonrigid structure-from-motion: estimating shape and motion with hierarchical priors. IEEE Trans. Pattern Anal. Mach. 30 , 878–892 (2008).

Jensen, S. H. N., Doest, M. E. B., Aanæs, H. & Del Bue, A. A benchmark and evaluation of non-rigid structure from motion. Int. J. Comput. Vis. 129 , 882–899 (2021).

Biele, C. & Grabowska, A. Sex differences in perception of emotion intensity in dynamic and static facial expressions. Exp. Brain Res. 171 , 1–6 (2006).

Uono, S., Sato, W. & Toichi, M. Brief report: Representational momentum for dynamic facial expressions in pervasive developmental disorder. J. Autism Dev. Disord. 40 , 371–377 (2010).

Sowden, S., Schuster, B. A., Keating, C. T., Fraser, D. S. & Cook, J. L. The role of movement kinematics in facial emotion expression production and recognition. Emotion 21 , 1041–1061 (2021).

Johansson, G. Visual perception of biological motion and a model for its analysis. Percept. Psychophys. 14 , 201–211 (1973).

Atkinson, A. P., Vuong, Q. C. & Smithson, H. E. Modulation of the face- and body-selective visual regions by the motion and emotion of point-light face and body stimuli. Neuroimage 59 , 1700–1712 (2012).

Bassili, J. N. Facial motion in the perception of faces and emotional expression. J. Exp. Psychol. Hum. Percept. Perform. 4 , 373–379 (1978).

Bassili, J. N. Emotion recognition: the role of facial movement and the relative importance of the upper and lower areas of the face. J. Pers. Soc. Psychol. 37 , 2049–2058 (1979).

Dittrich, W. H. Facial motion and the recognition of emotions. Psychol. Beitr. 33 , 366–377 (1991).

Pollick, F. E., Hill, H., Calder, A. & Paterson, H. Recognising facial expression from spatially and temporally modified movements. Perception 32 , 813–826 (2003).

Bidet-Ildei, C., Decatoire, A. & Gil, S. Recognition of emotions from facial point-light displays. Front. Psychol. 11 , 1062 (2020).

Furl, N. et al. Face space representations of movement. Neuroimage 21 , 116676 (2020).

Keating, C. T., Fraser, D. S., Sowden, S. & Cook, J. L. Differences between autistic and non-autistic adults in the recognition of anger from facial motion remain after controlling for alexithymia. J. Autism Dev. Disord. 52 , 1855–1871 (2021).

Valentine, T. A unified account of the effects of distinctiveness, inversion, and race in face recognition. Q. J. Exp. Psychol. A 43 , 161–204 (1991).

Freyd, J. J. Dynamic mental representations. Psychol. Rev. 94 , 427–438 (1987).

Blais, C., Roy, C., Fiset, D., Arguin, M. & Gosselin, F. The eyes are not the window to basic emotions. Neuropsychologia 50 , 2830–2838 (2012).

Gosselin, F. & Schyns, P. G. Bubbles: a technique to reveal the use of information in recognition tasks. Vis. Res. 41 , 2261–2271 (2001).

Royer, J., Blais, C., Gosselin, F., Duncan, J. & Fiset, D. When less is more: impact of face processing ability on recognition of visually degraded faces. J. Exp. Psychol. Hum. Percept. Perform. 41 , 1179–1183 (2015).

Nusseck, M., Cunningham, D. W., Wallraven, C. & Bülthoff, H. H. The contribution of different facial regions to the recognition of conversational expressions. J. Vis. 8 , 1–23 (2008).

Back, E., Ropar, D. & Mitchell, P. Do the eyes have It? Inferring mental states from animated faces in autism. Child. Dev. 78 , 397–411 (2007).

Back, E., Jordan, T. R. & Thomas, S. M. The recognition of mental states from dynamic and static facial expressions. Vis. Cogn. 17 , 1271–1289 (2009).

Hoffmann, H., Traue, H. C., Limbrecht-Ecklundt, K., Walter, S. & Kessler, H. Static and dynamic presentation of emotions in different facial areas: fear and surprise show influences of temporal and spatial properties. Psychology 4 , 663 (2013).

Buchan, J. N., Paré, M. & Munhall, K. G. Spatial statistics of gaze fixations during dynamic face processing. Soc. Neurosci. 2 , 1–13 (2007).

Gurnsey, R., Roddy, G., Ouhnana, M. & Troje, N. F. Stimulus magnification equates identification and discrimination of biological motion across the visual field. Vis. Res. 48 , 2827–2834 (2008).

Thompson, B., Hansen, B. C., Hess, R. F. & Troje, N. F. Peripheral vision: good for biological motion, bad for signal noise segregation? J. Vis. 7 , 11–17 (2007).

Plouffe-Demers, M. P., Fiset, D., Saumure, C., Duncan, J. & Blais, C. Strategy shift toward lower spatial frequencies in the recognition of dynamic facial expressions of basic emotions: when it moves it is different. Front. Psychol. 10 , 1563 (2019).

Calvo, M. G., Fernández-Martín, A., Gutiérrez-García, A. & Lundqvist, D. Selective eye fixations on diagnostic face regions of dynamic emotional expressions: KDEF-dyn database. Sci. Rep. 8 , 17039 (2018).

Anaki, D., Boyd, J. & Moscovitch, M. Temporal integration in face perception: evidence of configural processing of temporally separated face parts. J. Exp. Psychol. Hum. Percept. Perform. 33 , 1–19 (2007).

Luo, C., Wang, Q., Schyns, P. G., Kingdom, F. A. A. & Xu, H. Facial expression aftereffect revealed by adaption to emotion-invisible dynamic bubbled faces. PLoS ONE 10 , e0145877 (2015).

Carey, S. & Diamond, R. From piecemeal to configurational representation of faces. Science 195 , 312–314 (1977).

Sergent, J. An investigation into component and configural processes underlying face perception. Br. J. Psychol. 75 , 221–242 (1984).

Tanaka, J. W. & Farah, M. J. Parts and wholes in face recognition. Q. J. Exp. Psychol. 46A , 225–245 (1993).

Rakover, S. S. Featural vs. configurational information in faces: a conceptual and empirical analysis. Br. J. Psychol. 93 , 1–30 (2002).

Schwaninger, A., Carbon, C. C. & Leder, H. in Development of Face Processing (eds Schwarzer, G. & Leder, H.) 81–97 (Hogrefe, 2003).

Favelle, S. K., Tobin, A., Piepers, D. & Robbins, R. A. Dynamic composite faces are processed holistically. Vis. Res. 112 , 26–32 (2015).

Zhao, M. & Bülthoff, I. Holistic processing of static and moving faces. J. Exp. Psychol. Learn. Mem. Cogn. 43 , 1020–1035 (2017).

Chiller-Glaus, S. D., Schwaninger, A., Hofer, F., Kleiner, M. & Knappmeyer, B. Recognition of emotion in moving and static composite faces. Swiss J. Psychol. 70 , 233–240 (2011).

Tobin, A., Favelle, S. & Palermo, R. Dynamic facial expressions are processed holistically, but not more holistically than static facial expressions. Cogn. Emot. 30 , 1208–1221 (2016).

Bould, E., Morris, N. & Wink, B. Recognising subtle emotional expressions: the role of facial movements. Cogn. Emot. 22 , 1569–1587 (2008).

Schultz, J., Brockhaus, M., Bülthoff, H. H. & Pilz, K. S. What the human brain likes about facial motion. Cereb. Cortex 23 , 1167–1178 (2013).

Bould, E. & Morris, N. Role of motion signals in recognizing subtle facial expressions of emotion. Br. J. Psychol. 99 , 167–189 (2008).

Krumhuber, E. G. & Manstead, A. S. R. Can Duchenne smiles be feigned? New evidence on felt and false smiles. Emotion 9 , 807–820 (2009).

Jack, R. E., Oliver, G. B. G. & Philippe, G. S. Dynamic facial expressions of emotion transmit an evolving hierarchy of signals over time. Curr. Biol. 24 , 187–192 (2014).

Jack, R. E. & Schyns, P. G. The human face as a dynamic tool for social communication. Curr. Biol. 25 , R621–R634 (2015).

Krumhuber, E. G. & Scherer, K. R. Affect bursts: dynamic patterns of facial expression. Emotion 11 , 825–841 (2011).

Fiorentini, C., Schmidt, S. & Viviani, P. The identification of unfolding facial expressions. Perception 41 , 532–555 (2012).

With, S. & Kaiser, S. Sequential patterning of facial actions in the production and perception of emotional expressions. Swiss J. Psychol. 70 , 241–252 (2011).

Leonard, C. M., Voeller, K. K. S. & Kuldau, J. M. When’s a smile a smile? Or how to detect a message by digitizing the signal. Psychol. Sci. 2 , 166–172 (1991).

Edwards, K. The face of time: temporal cues in facial expressions of emotion. Psychol. Sci. 9 , 270–276 (1998).

Cunningham, D. W. & Wallraven, C. Dynamic information for the recognition of conversational expressions. J. Vis. 9 , 1–17 (2009).

Takehara, T., Saruyama, M. & Suzuki, N. Dynamic facial expression recognition in low emotional intensity and shuffled sequences. New Am. J. Psychol. 19 , 359–370 (2017).

Richoz, A. R., Lao, J., Pascalis, O. & Caldara, R. Tracking the recognition of static and dynamic facial expressions of emotion across the life span. J. Vis. 18 , 1–27 (2018).

Furl, N. et al. Modulation of perception and brain activity by predictable trajectories of facial expressions. Cereb. Cortex 20 , 694–703 (2010).

Reinl, M. & Bartels, A. Perception of temporal asymmetries in dynamic facial expressions. Front. Psychol. 6 , 1107 (2015).

Reinl, M. & Bartels, A. Face processing regions are sensitive to distinct aspects of temporal sequence in facial dynamics. Neuroimage 102 , 407–415 (2014).

Wehrle, T., Kaiser, S., Schmidt, S. & Scherer, K. R. Studying the dynamics of emotional expression using synthesized facial muscle movements. J. Pers. Soc. Psychol. 78 , 105–119 (2000).

Delis, I. et al. Space-by-time manifold representation of dynamic facial expressions for emotion categorization. J. Vis. 16 , 14 (2016).

Korolkova, O. A. The role of temporal inversion in the perception of realistic and morphed dynamic transitions between facial expressions. Vis. Res. 143 , 42–51 (2018).

Sato, W., Kochiyama, T. & Yoshikawa, S. Amygdala activity in response to forward versus backward dynamic facial expressions. Brain Res. 1315 , 92–99 (2010).

Krumhuber, E. G. & Scherer, K. R. The look of fear from the eyes varies with the dynamic sequence of facial actions. Swiss J. Psychol. 75 , 5–14 (2016).

Cosker, D., Krumhuber, E. & Hilton, A. Perception of linear and nonlinear motion properties using a FACS validated 3D facial model. In Proc. 7th Symp. Applied Perception in Graphics and Visualization (APGV) (eds Gutierrez, D., Kearney, J., Banks, M. S. & Mania, K.) 101–108 (ACM, 2010).

Cosker, D., Krumhuber, E. & Hilton, A. Perceived emotionality of linear and nonlinear AUs synthesised using a 3D dynamic morphable facial model. In Proc. Facial Analysis and Animation (eds. Pucher, M. et al.) 7 (ACM, 2015).

Hoffmann, H., Traue, H. C., Bachmayr, F. & Kessler, H. Perceived realism of dynamic facial expressions of emotion: optimal durations for the presentation of emotional onsets and offsets. Cogn. Emot. 24 , 1369–1376 (2010).

Kamachi, M. et al. Dynamic properties influence the perception of facial expressions. Perception 30 , 875–887 (2001).

Wallraven, C., Breidt, M., Cunningham, D. W. & Bülthoff, H. H. Evaluating the perceptual realism of animated facial expressions. ACM Trans. Appl. Percept. 4 , 1–20 (2008).

Perdikis, D. et al. Brain synchronization during perception of facial emotional expressions with natural and unnatural dynamics. PLoS ONE 12 , e0181225 (2017).

Dobs, K. et al. Quantifying human sensitivity to spatio-temporal information in dynamic faces. Vis. Res. 100 , 78–87 (2014).

Recio, G., Schacht, A. & Sommer, W. Classification of dynamic facial expressions of emotion presented briefly. Cogn. Emot. 27 , 1486–1494 (2013).

Hess, U. & Kleck, R. E. The cues decoders use in attempting to differentiate emotion-elicited and posed facial expressions. Eur. J. Soc. Psychol. 24 , 367–381 (1994).

Ambadar, Z., Cohn, J. F. & Reed, L. I. All smiles are not created equal: morphology and timing of smiles perceived as amused, polite, and embarrassed/nervous. J. Nonverbal Behav. 33 , 17–34 (2009).

Krumhuber, E. G. & Kappas, A. Moving smiles: the role of dynamic components for the perception of the genuineness of smiles. J. Nonverbal Behav. 29 , 3–24 (2005).

Krumhuber, E. G. et al. Facial dynamics as indicators of trustworthiness and cooperative behavior. Emotion 7 , 730–735 (2007).

Krumhuber, E. G., Manstead, A. S. R. & Kappas, A. Temporal aspects of facial displays in person and expression perception: the effects of smile dynamics, head-tilt and gender. J. Nonverbal Behav. 31 , 39–56 (2007).

Krumhuber, E. G., Manstead, A. S. R., Cosker, D., Marshall, D. & Rosin, P. L. Effects of dynamic attributes of smiles in human and synthetic faces: a simulated job interview setting. J. Nonverbal Behav. 33 , 1–15 (2009).

Yu, H., Garrod, O. G. & Schyns, P. G. Perception-driven facial expression synthesis. Comput. Graph. 36 , 152–162 (2012).

Namba, S., Kabir, R. S., Miyatani, M. & Nakao, T. Dynamic displays enhance the ability to discriminate genuine and posed facial expressions of emotion. Front. Psychol. 9 , 672 (2018).

Hess, U. & Kleck, R. E. Differentiating emotion elicited and deliberate emotional facial expressions. Eur. J. Soc. Psychol. 20 , 369–385 (1990).

Cohn, J. F. & Schmidt, K. L. The timing of facial motion in posed and spontaneous smiles. Int. J. Wavelets. Multiresolution Inf. Process. 2 , 1–12 (2004).

Schmidt, K. L., Bhattacharya, S. & Denlinger, R. Comparison of deliberate and spontaneous facial movement in smiles and eyebrow raises. J. Nonverbal Behav. 33 , 35–45 (2009).

Schmidt, K. L., Ambadar, Z., Cohn, J. F. & Reed, L. I. Movement differences between deliberate and spontaneous facial expressions: zygomaticus major action in smiling. J. Nonverbal Behav. 30 , 37–52 (2006).

Namba, S., Makihara, S., Kabir, R., Miyatani, M. & Nakao, T. Spontaneous facial expressions are different from posed facial expressions: morphological properties and dynamic sequences. Curr. Psychol. 36 , 593–605 (2016).

Hanley, M., Riby, D. M., Caswell, S., Rooney, S. & Back, E. Looking and thinking: how individuals with Williams syndrome make judgements about mental states. Res. Dev. Disabil. 34 , 4466–4476 (2013).

Krumhuber, E., Lai, Y., Rosin, P. & Hugenberg, K. When facial expressions do and do not signal minds: the role of face inversion, expression dynamism, and emotion type. Emotion 19 , 746–750 (2019).

Rubenstein, A. J. Variation in perceived attractiveness: differences between dynamic and static faces. Psychol. Sci. 16 , 759–762 (2005).

Gill, D., Garrod, O. G., Jack, R. E. & Schyns, P. G. Facial movements strategically camouflage involuntary social signals of face morphology. Psychol. Sci. 25 , 1079–1086 (2014).

Bugental, D. B. Unmasking the “polite smile” situational and personal determinants of managed affect in adult–child interaction. Pers. Soc. Psychol. Bull. 12 , 7–16 (1986).

Haxby, J. V., Hoffman, E. A. & Gobbini, M. I. The distributed human neural system for face perception. Trends Cogn. Sci. 4 , 223–233 (2000).

Pitcher, D., Dilks, D. D., Saxe, R. R., Triantafyllou, C. & Kanwisher, N. Differential selectivity for dynamic versus static information in face-selective cortical regions. Neuroimage 56 , 2356–2363 (2011).

Pitcher, D., Duchaine, B. & Walsh, V. Combined TMS and fMRI reveal dissociable cortical pathways for dynamic and static face perception. Curr. Biol. 24 , 2066–2070 (2014).

Deen, B., Koldewyn, K., Kanwisher, N. & Saxe, R. Functional organization of social perception and cognition in the superior temporal sulcus. Cereb. Cortex 25 , 4596–4609 (2015).

Foley, E., Rippon, G. & Senior, C. Modulation of neural oscillatory activity during dynamic face processing. J. Cogn. Neurosci. 30 , 338–352 (2017).

Furl, N., Henson, R. N., Friston, K. J. & Calder, A. J. Top-down control of visual responses to fear by the amygdala. J. Neurosci. 33 , 17435–17443 (2013).

Sato, W. et al. Widespread and lateralized social brain activity for processing dynamic facial expressions. Hum. Brain Mapp. 40 , 3753–3768 (2019).

Trautmann-Lengsfeld, S. A., Dominguez-Borras, J., Escera, C., Hermann, M. & Fehr, T. The perception of dynamic and static facial expressions of happiness and disgust investigated by ERPs and fMRI constrained source analysis. PLoS ONE 8 , e666997 (2013).

Recio, G., Sommer, W. & Schacht, A. Electrophysiological correlates of perceiving and evaluating static and dynamic facial emotional expressions. Brain Res. 1376 , 66–75 (2011).

Recio, G., Schacht, A. & Sommer, W. Recognizing dynamic facial expressions of emotion: specificity and intensity effects in event-related brain potentials. Biol. Psychol. 96 , 111–125 (2014).

Wang, M. & Yuan, Z. EEG decoding of dynamic facial expressions of emotion: evidence from SSVEP and causal cortical network dynamics. Int. Brain Res. Org. 459 , 50–58 (2021).

Sato, W., Kochiyama, T. & Uono, S. Spatiotemporal neural network dynamics for the processing of dynamic facial expressions. Sci. Rep. 5 , 12432 (2015).

Donner, T. H. & Siegel, M. A framework for local cortical oscillation patterns. Trends Cogn. Sci. 15 , 191–199 (2011).

Xiao, N. G. et al. On the facilitative effects of face motion on face recognition and its development. Front. Psychol. 5 , 633 (2014).

Ehrlich, S. M., Schiano, D. J. & Sheridan, K. Communicating facial affect: it’s not the realism, it’s the motion. Proc. ACM CHI 2000 Conf. Human Factors in Computing Systems 151–152 (ACM, 2000).

Kätsyri, J., Saalasti, S., Tiippana, K., von Wendt, L. & Sams, M. Impaired recognition of facial emotions from low-spatial frequencies in Asperger syndrome. Neuropsychologia 46 , 1888–1897 (2008).

Kätsyri, J. & Sams, M. The effect of dynamics on identifying basic emotions from synthetic and natural faces. Int. J. Hum. Comput. Stud. 66 , 233–242 (2008).

Fujimura, T. & Suzuki, N. Effects of dynamic information in recognising facial expressions on dimensional and categorical judgments. Perception 39 , 543–552 (2010).

Harwood, N. K., Hall, L. J. & Shinkfield, A. J. Recognition of facial emotional expressions from moving and static displays by individuals with mental retardation. Am. J. Ment. Retard. 104 , 270–278 (1999).

Gepner, B., Deruelle, C. & Grynfeltt, S. Motion and emotion: a novel approach to the study of face processing by young autistic children. J. Autism Dev. Disord. 31 , 37–45 (2001).

Tardif, C., Lainé, F., Rodriguez, M. & Gepner, B. Slowing down presentation of facial movements and vocal sounds enhances facial expression recognition and induces facial–vocal imitation in children with autism. J. Autism Dev. Disord. 37 , 1469–1484 (2007).

Isaacowitz, D. M. & Stanley, J. T. Bringing an ecological perspective to the study of aging and recognition of emotional facial expressions: past, current, and future methods. J. Nonverbal Behav. 35 , 261–278 (2011).

Holland, C. A., Ebner, N. C., Lin, T. & Samanez-Larkin, G. R. Emotion identification across adulthood using the Dynamic FACES database of emotional expressions in younger, middle aged, and older adults. Cogn. Emot. 33 , 245–257 (2018).

Krendl, A. C. & Ambady, N. Older adults’ decoding of emotions: role of dynamic versus static cues and age-related cognitive decline. Psychol. Aging 25 , 788–793 (2010).

Ziaei, M., Arnold, C. & Ebner, N. C. Age-related differences in expression recognition of faces with direct and averted gaze using dynamic stimuli. Exp. Aging Res. 47 , 451–463 (2021).

Csukly, G., Czobor, P., Szily, E., Takács, B. & Simon, L. Facial expression recognition in depressed subjects: the impact of intensity level and arousal dimension. J. Nerv. Ment. Dis. 197 , 98–103 (2009).

Langenecker, S. A. et al. Face emotion perception and executive functioning deficits in depression. J. Clin. Exp. Neuropsychol. 27 , 320–333 (2005).

Persad, M. & Polivy, J. Differences between depressed and nondepressed participants in the recognition of and response to facial emotional cues. J. Abnorm. Psychol. 102 , 358–368 (1993).

Garrido-Vásquez, P., Jessen, S. & Kotz, S. A. Perception of emotion in psychiatric disorders: on the possible role of task, dynamics, and multimodality. Soc. Neurosci. 6 , 515–536 (2011).

Actis-Grosso, R., Bossi, F. & Ricciardelli, P. Emotion recognition through static faces and moving bodies: a comparison between typically developed adults and individuals with high level of autistic traits. Front. Psychol. 6 , 1570 (2015).

Jelili, S. et al. Impaired recognition of static and dynamic facial emotions in children with autism spectrum disorder using stimuli of varying intensities, different genders, and age ranges faces. Front. Psychiatry 12 , 693310 (2021).

Adolphs, R., Tranel, D. & Damasio, A. R. Dissociable neural systems for recognizing emotions. Brain Cogn. 52 , 61–69 (2003).

Bennetts, R. J., Butcher, N., Lander, K. & Bate, S. Movement cues aid face recognition in developmental prosopagnosia. Neuropsychology 29 , 855–860 (2015).

Humphreys, G. W., Donnelly, N. & Riddoch, M. J. Expression is computed separately from facial identity, and it is computed separately for moving and static faces: neuropsychological evidence. Neuropsychologia 31 , 173–181 (1993).

Longmore, C. & Tree, J. Motion as a cue to face recognition: evidence from congenital prosopagnosia. Neuropsychologia 51 , 864–875 (2013).

Richoz, A. R., Jack, R. E., Garrod, O. G., Schyns, P. G. & Caldara, R. Reconstructing dynamic mental models of facial expressions in prosopagnosia reveals distinct representations for identity and expression. Cortex 65 , 50–64 (2015).

Barker, M. S., Bidstrup, E. M., Robinson, G. A. & Nelson, N. L. “Grumpy” or “furious”? Arousal of emotion labels influences judgments of facial expressions. PLoS ONE 15 , e0235390 (2020).

Bowdring, M. A., Sayette, M. A., Girard, J. M. & Woods, W. C. In the eye of the beholder: a comprehensive analysis of stimulus type, perceiver, and target in physical attractiveness perceptions. J. Nonverbal Behav. 45 , 241–259 (2021).

Grainger, S. A., Henry, J. D., Phillips, L. H., Vanman, E. J. & Allen, R. Age deficits in facial affect recognition: the influence of dynamic cues. J. Gerontol. B Psychol. Sci. Soc. Sci. 72 , 622–632 (2015).

Nelson, N. L., Hudspeth, K. & Russell, J. A. A story superiority effect for disgust, fear, embarrassment, and pride. Br. J. Dev. Psychol. 31 , 334–348 (2013).

Nelson, N. L. & Russell, J. A. Putting motion in emotion: do dynamic presentations increase preschooler’s recognition of emotion? Cogn. Dev. 26 , 248–259 (2011).

Trichas, S., Schyns, B., Lord, R. & Hall, R. Facing leaders: facial expression and leadership perception. Leadersh. Q. 28 , 317–333 (2017).

Jiang, Z. et al. Time pressure inhibits dynamic advantage in the classification of facial expression of emotion. PLoS ONE 9 , e100162 (2014).

Fiorentini, C. & Viviani, P. Is there a dynamic advantage for facial expressions? J. Vis. 11 , 17 (2011).

Liang, Y., Liu, B., Li, X. & Wang, P. Multivariate pattern classification of facial expressions based on large-scale functional connectivity. Front. Hum. Neurosci. 12 , 94 (2018).

Widen, S. C. & Russell, J. A. Do dynamic facial expressions convey emotions to children better than do static ones? J. Cogn. Dev. 16 , 802–811 (2015).

Gold, J. M. et al. The efficiency of dynamic and static facial expression recognition. J. Vis. 13 , 23 (2013).

Jack, R. E., Garrod, O. G., Yu, H., Caldara, R. & Schyns, P. G. Facial expressions of emotion are not culturally universal. Proc. Natl Acad. Sci. USA 109 , 7241–7244 (2012).

Chung, K. M., Kim, S., Jung, W. H. & Kim, Y. Development and validation of the Yonsei face database (YFace DB). Front. Psychol. 10 , 2626 (2019).

Yitzhak, N., Gilaie-Dotan, S. & Aviezer, H. The contribution of facial dynamics to subtle expression recognition in typical viewers and developmental visual agnosia. Neuropsychologia 117 , 26–35 (2018).

Cassidy, S., Mitchell, P., Chapman, P. & Ropar, D. Processing of spontaneous emotional responses in adolescents and adults with autism spectrum disorders: effect of stimulus type. Autism Res. 8 , 534–544 (2015).

Pollux, P. M. J., Craddock, M. & Guo, K. Gaze patterns in viewing static and dynamic body expressions. Acta Psychol 198 , 102862 (2019).

Ceccarini, F. & Caudek, C. Anger superiority effect: the importance of dynamic emotional facial expressions. Vis. Cogn. 21 , 498–540 (2013).

Horstmann, G. & Ansorge, U. Visual search for facial expressions of emotions: a comparison of dynamic and static faces. Emotion 9 , 29–38 (2009).

Calvo, M. G., Avero, P., Fernandez-Martin, A. & Recio, G. Recognition thresholds for static and dynamic emotional faces. Emotion 16 , 1186–1200 (2016).

Li, W. O. & Yuen, K. S. L. The perception of time while perceiving dynamic emotional faces. Front. Psychol. 6 , 01248 (2015).

Rymarczyk, K., Biele, C., Grabowska, A. & Majczynski, H. EMG activity in response to static and dynamic facial expressions. Int. J. Psychophysiol. 79 , 330–333 (2011).

Rymarczyk, K., Zurawski, L., Jankowiak-Siuda, K. & Szatkowska, I. Do dynamic compared to static facial expressions of happiness and anger reveal enhanced facial mimicry? PLoS ONE 11 , e0158534 (2016).

Yoshikawa, S. & Sato, W. Dynamic facial expressions of emotion induce representational momentum. Cogn. Affect. Behav. Neurosci. 8 , 25–31 (2008).

Simons, R. F., Detener, B. H., Roedema, T. M. & Reiss, J. E. Emotion processing in three systems: the medium and the message. Psychophysiology 36 , 619–627 (1999).

Sato, W., Fujimura, T. & Suzuki, N. Enhanced facial EMG activity in response to dynamic facial expressions. Int. J. Psychophysiol. 70 , 70–74 (2008).

Sato, W. & Yoshikawa, S. Enhanced experience of emotional arousal in response to dynamic facial expressions. J. Nonverbal Behav. 31 , 119–135 (2007).

Hoffmann, J. Intense or malicious? The decoding of eyebrow-lowering frowning in laughter animations depends on the presentation mode. Front. Psychol. 5 , 1306 (2014).

Kaufman, J. & Johnston, P. J. Facial motion engages predictive visual mechanisms. PLoS ONE 9 , e91038 (2014).

Thornton, I. M. & Kourtzi, Z. A matching advantage for dynamic human faces. Perception 31 , 113–132 (2002).

Kouider, S., Berthet, V. & Faivre, N. Preference is biased by crowded facial expressions. Psychol. Sci. 22 , 184–189 (2013).

Sato, W., Kubota, Y. & Toichi, M. Enhanced subliminal emotional responses to dynamic facial expressions. Front. Psychol. 5 , 994 (2014).

Niedenthal, P. M. Embodying emotion. Science 316 , 1002–1005 (2007).

Rymarczyk, K., Zurawski, L., Jankowiak-Siuda, K. & Szatkowska, I. Emotional empathy and facial mimicry for static and dynamic facial expressions of fear and disgust. Front. Psychol. 7 , 1853 (2016).

Rymarczyk, K., Zurawski, L., Jankowiak-Siuda, K. & Szatkowska, I. Neural correlates of facial mimicry: simultaneous measurement of EMG and BOLD responses during perception of dynamic compared to static facial expressions. Front. Psychol. 9 , 52 (2018).

Rymarczyk, K., Zurawski, L., Jankowiak-Siuda, K. & Szatkowska, I. Empathy in facial mimicry of fear and disgust: simultaneous EMG–fMRI recordings during observation of static and dynamic facial expressions. Front. Psychol. 10 , 701 (2019).

Heyes, C. & Catmur, C. What happened to mirror neurons? Perspect. Psychol. Sci. 17 , 153–168 (2021).

Iacoboni, M. Imitation, empathy, and mirror neurons. Annu. Rev. Psychol. 60 , 653–670 (2009).

Fischer, A. & Hess, U. Mimicking emotions. Curr. Opin. Psychol. 17 , 151–155 (2017).

Fujimura, T., Sato, W. & Suzuki, N. Facial expression arousal level modulates facial mimicry. Int. J. Psychophysiol. 76 , 88–92 (2010).

Hess, U. & Fischer, A. Emotional mimicry: why and when we mimic emotions. Soc. Personal. Psychol. Compass 8 , 45–57 (2014).

Chartrand, T. L. & Lakin, J. L. The antecedents and consequences of human behavioural mimicry. Annu. Rev. Psychol. 64 , 285–308 (2013).

Keltner, D. & Cordaro, D. T. in The Science of Facial Expression (eds Fernández-Dols, J. M. & Russell, J. A.) 57–75 (Oxford Univ. Press, 2017).

Keltner, D., Tracy, J., Sauter, D. A., Cordaro, D. C. & McNeil, G. in Handbook of Emotions (eds Barrett, L. F. et al.) 467–482 (Guilford Press, 2016).

Scherer, K. R. & Ellgring, H. Multimodal expression of emotion: affect programs or componential appraisal patterns? Emotion 7 , 158–171 (2007).

Young, A. W., Frühholz, S. & Schweinberger, S. R. Face and voice perception: understanding commonalities and differences. Trends Cogn. Sci. 24 , 398–410 (2020).

Bahrick, L. E., Lickliter, R. & Flom, R. Intersensory redundancy guides the development of selective attention, perception, and cognition in infancy. Curr. Dir. Psychol. Sci. 13 , 99–102 (2004).

Campanella, S. & Belin, P. Integrating face and voice in person perception. Trends Cogn. Sci. 11 , 535–543 (2007).

App, B., Mcintosh, D. N., Reed, C. L. & Hertenstein, M. J. Nonverbal channel use in communication of emotion: how may depend on why. Emotion 11 , 603–617 (2011).

Lecker, M., Dotsch, R., Bijlstra, G. & Aviezer, H. Bidirectional contextual influence between faces and bodies in emotion perception. Emotion 20 , 1154–1164 (2020).

Mondloch, C. J., Nelson, N. L. & Horner, M. Asymmetries of influence: differential effects of body postures on perceptions of emotional facial expressions. PLoS ONE 8 , 1–17 (2013).

Klasen, M., Chen, Y. H. & Mathiak, K. Multisensory emotions: perception, combination and underlying neural processes. Rev. Neurosci. 23 , 381–392 (2012).

Collignon, O. et al. Audio-visual integration of emotion expression. Brain Res. 1242 , 126–135 (2008).

Stein, B. E. & Meredith, M. A. The Merging of the Senses (MIT Press, 1993).

Scherer, K. R., Clark-Polner, E. & Mortillaro, M. In the eye of the beholder? Universality and cultural specificity in the expression and perception of emotion. Int. J. Psychol. 46 , 401–435 (2011).

Schirmer, A. & Adolphs, R. Emotion perception from face, voice, and touch: comparisons and convergence. Trends Cogn. Sci. 21 , 216–228 (2017).

Tartter, V. C. Happy talk: perceptual and acoustic effects of smiling on speech. Percept. Psychophys. 27 , 24–27 (1980).

McGurk, H. & MacDonal, J. Hearing lips and seeing voices. Nature 264 , 746–748 (1976).

Sumby, W. H. & Pollack, I. Visual contribution to speech information in noise. J. Acoust. Soc. Am. 26 , 212–215 (1954).

Summerfield, Q. in Hearing by Eye: The Psychology of Lipreading (eds Dodd, B. & Campbell, R.) 3–51 (Lawrence Erlbaum, 1987).

Aubergé, V. & Cathiard, M. Can we hear the prosody of smile? Speech Commun. 40 , 87–97 (2003).

De Gelder, B. & Vroomen, J. The perception of emotions by ear and by eye. Cogn. Emot. 14 , 289–311 (2000).

Paulmann, S. & Pell, M. D. Is there an advantage for recognizing multi-modal emotional stimuli? Motiv. Emot. 35 , 192–201 (2011).

Juslin, P. N. & Laukka, P. Communication of emotions in vocal expression and music performance: different channels, same code? Psychol. Bull. 129 , 770–814 (2003).

Scherer, K. R. in Emotions: Essays on Emotional Theory (eds van Goozen, S. H. M., van de Poll, N. E. & Sergeant, J. A.) 161–193 (Lawrence Erlbaum, 1994).

Banse, R. & Scherer, K. R. Acoustic profiles in vocal emotion expressions. J. Pers. Soc. Psychol. 70 , 614–636 (1996).

Hawk, S. T., van Kleef, G. A., Fischer, A. H. & van der Schalk, J. “Worth a thousand words”: absolute and relative decoding of nonlinguistic affect vocalizations. Emotion 9 , 293–305 (2009).

Schröder, M. Experimental study of affect bursts. Speech Commun. 40 , 99–116 (2003).

Bänziger, T., Patel, S. & Scherer, K. R. The role of perceived voice and speech characteristics in vocal emotion communication. J. Nonverbal Behav. 38 , 31–52 (2014).

Bänziger, T., Hosoya, G. & Scherer, K. R. Path models of vocal emotion communication. PLoS ONE 10 , 1–29 (2015).

Goudbeek, M. & Scherer, K. Beyond arousal: valence and potency/control cues in the vocal expression of emotion. J. Acoust. Soc. Am. 128 , 1322 (2010).

Scherer, K. R. Expression of emotion in voice and music. J. Voice 9 , 235–248 (1995).

Adams, R. B. & Kleck, R. E. Effects of direct and averted gaze on the perception of facially communicated emotion. Emotion 5 , 3–11 (2005).

Hess, U., Adams, Æ. R. B. & Kleck, R. E. Looking at you or looking elsewhere: the influence of head orientation on the signal value of emotional facial expressions. Motiv. Emot. 31 , 137–144 (2007).

Milders, M., Hietanen, J. K., Leppänen, J. M. & Braun, M. Detection of emotional faces is modulated by the direction of eye gaze. Emotion 11 , 1456–1461 (2011).

Rigato, S. & Farroni, T. The role of gaze in the processing of emotional facial expressions. Emot. Rev. 5 , 36–40 (2013).

Bindemann, M., Burton, A. M. & Langton, S. R. H. How do eye gaze and facial expression interact? Vis. Cogn. 16 , 708–733 (2008).

Adams, R. B. & Kleck, R. E. Perceived gaze direction and the processing of facial displays of emotion. Psychol. Sci. 14 , 644–647 (2003).

Sander, D., Grandjean, D., Kaiser, S., Wehrle, T. & Scherer, K. R. Interaction effects of perceived gaze direction and dynamic facial expression: evidence for appraisal theories of emotion. Eur. J. Cogn. Psychol. 19 , 470–480 (2007).

Dalmaso, M., Castelli, L. & Galfano, G. Social modulators of gaze-mediated orienting of attention: a review. Psychon. Bull. Rev. 27 , 833–855 (2020).

Lassalle, A. & Itier, R. J. Emotional modulation of attention orienting by gaze varies with dynamic cue sequence. Vis. Cogn. 23 , 720–735 (2015).

McCrackin, S. D., Soomal, S. K., Patel, P. & Itier, R. J. Spontaneous eye-movements in neutral and emotional gaze-cuing: an eye-tracking investigation. Heliyon 5 , e01583 (2019).

Atkinson, A. P., Dittrich, W. H., Gemmell, A. J. & Young, A. W. Emotion perception from dynamic and static body expressions in point-light and full-light displays. Perception 33 , 717–746 (2004).

Tracy, J. L. & Robins, R. W. Show your pride evidence for a discrete emotion expression. Emotion 15 , 194–197 (2004).

App, B., Reed, C. L. & McIntosh, D. N. Relative contributions of face and body configurations: perceiving emotional state and motion intention. Cogn. Emot. 26 , 690–698 (2012).

Coulson, M. Attributing emotion to static body postures: recognition accuracy, confusions, and viewpoint dependence. J. Nonverbal Behav. 28 , 117–139 (2004).

Dael, N., Mortillaro, M. & Scherer, K. R. Emotion expression in body action and posture. Emotion 12 , 1085–1101 (2012).

Dittrich, W. H., Trosciankoh, T., Lea, S. E. & Morgan, D. Perception of emotion from dynamic point-light displays represented in dance. Perception 25 , 727–738 (1996).

Wallbott, H. G. Bodily expression of emotion. Eur. J. Soc. Psychol. 28 , 879–896 (1998).

Witkower, Z. & Tracy, J. L. Bodily communication of emotion: evidence for extrafacial behavioral expressions and available coding systems. Emot. Rev. 11 , 184–193 (2019).

Aviezer, H., Bentin, S., Dudarev, V. & Hassin, R. R. The automaticity of emotional face-context integration. Emotion 11 , 1406–1414 (2011).

Aviezer, H., Trope, Y. & Todorov, A. Holistic person processing: faces with bodies tell the whole story. J. Pers. Soc. Psychol. 103 , 20–37 (2012).

Lecker, M., Shoval, R., Aviezer, H. & Eitam, B. Temporal integration of bodies and faces: united we stand, divided we fall? Vis. Cogn. 25 , 477–491 (2017).

Meeren, H. K. M., Van Heijnsbergen, C. C. R. J. & De Gelder, B. Rapid perceptual integration of facial expression and emotional body language. Proc. Natl Acad. Sci. USA 102 , 16518–16523 (2005).

Mondloch, C. J. Sad or fearful? The influence of body posture on adults’ and children’s perception of facial displays of emotion. J. Exp. Child. Psychol. 111 , 180–196 (2012).

Aviezer, H. et al. Angry, disgusted, or afraid? Studies on the malleability of emotion perception. Psychol. Sci. 19 , 724–732 (2008).

Aviezer, H., Ensenberg, N. & Hassin, R. R. The inherently contextualized nature of facial emotion perception. Curr. Opin. Psychol. 17 , 47–54 (2017).

Nelson, N. L. & Mondloch, C. J. Adults’ and children’s perception of facial expressions is influenced by body postures even for dynamic stimuli. Vis. Cogn. 25 , 563–574 (2017).

Gross, M. M., Crane, E. A. & Fredrickson, B. L. Methodology for assessing bodily expression of emotion. J. Nonverbal Behav. 34 , 223–248 (2010).

Hietanen, J. K. & Leppänen, J. M. Judgment of other people’s facial expressions of emotions is influenced by their concurrent affective hand movements. Scand. J. Psychol. 49 , 221–230 (2008).

Mortillaro, M. & Dukes, D. Jumping for joy: the importance of the body and of dynamics in the expression and recognition of positive emotions. Front. Psychol. 9 , 763 (2018).

Lander, K. & Bruce, V. Recognising famous faces: exploring the benefits of facial motion. Ecol. Psychol. 12 , 259–272 (2000).

Küster, D. et al. Opportunities and challenges for using automatic human affect analysis in consumer research. Front. Neurosci. 14 , 400 (2020).

Zloteanu, M. & Krumhuber, E. G. Expression authenticity: the role of genuine and deliberate displays in emotion perception. Front. Psychol. 11 , 611248 (2021).

Krumhuber, E. G., Skora, P., Küster, D. & Fou, L. A review of dynamic datasets for facial expression research. Emot. Rev. 9 , 280–292 (2017).

Krumhuber, E., Küster, D., Namba, S., Shah, D. & Calvo, M. G. Emotion recognition from posed and spontaneous dynamic expressions: human observers vs. machine analysis. Emotion 21 , 447–451 (2021).

Krumhuber, E. G., Küster, D., Namba, S. & Skora, P. Human and machine validation of 14 databases of dynamic facial expressions. Behav. Res. Methods 53 , 686–701 (2021).

Fernández-Dols, J. M. in The Science of Facial Expression (eds Russell, J. A. & Fernández-Dols, J. M.) 1–21 (Oxford Univ. Press, 2017).

Mollahosseini, A., Hasani, B. & Mahoor, M. H. AffectNet: a database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 10 , 18–31 (2019).

Srinivasan, R. & Martinez, A. M. Cross-cultural and cultural-specific production and perception of facial expressions of emotion in the wild. IEEE Trans. Affect. Comput. 12 , 707–721 (2021).

Lander, K. & Chuang, L. Why are moving faces easier to recognize? Vis. Cogn. 12 , 429–442 (2005).

Lander, K., Hill, H., Kamachi, M. & Vatikiotis-Bateson, E. It’s not what you say but the way you say it: matching faces and voices. J. Exp. Psychol. Hum. Percept. Perform. 33 , 905–914 (2007).

Partan, S. & Marler, P. Communication goes multimodal. Science 283 , 1272–1273 (1999).

Benda, M. S. & Scherf, K. S. The complex emotion expression database: a validated stimulus set of trained actors. PLoS ONE 15 , e0228248 (2020).

Cowen, A. S. et al. Sixteen facial expressions occur in similar contexts worldwide. Nature 589 , 251–257 (2021).

Ortony, A. Are all “basic emotions” emotions? A problem for the (basic) emotions construct. Perspect. Psychol. Sci. 17 , 41–61 (2022).

Russell, J. A. in The Science of Facial Expression (eds Russell, J. A. & Fernández-Dols, J. M.) (Oxford Univ. Press, 2017).

Russell, J. A. & Fernández-Dols, J. M. in The Psychology of Facial Expression (eds Russell, J. A. & Fernández-Dols, J. M.) 3–30 (Cambridge Univ. Press, 1997).

Zeng, Z., Pantic, M., Roisman, G. I. & Huang, T. S. A survey of facial affect recognition methods: audio, visual and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. 31 , 39–58 (2009).

Grill-Spector, K., Weiner, K. S., Kay, K. & Gomez, J. The functional neuroanatomy of human face perception. Annu. Rev. Vis. Sci. 15 , 167–196 (2017).

Bernstein, M., Erez, Y., Blank, I. & Yovel, G. An integrated neural framework for dynamic and static face processing. Sci. Rep. 8 , 7036 (2018).

Pitcher, D. & Ungerleider, L. G. Evidence for a third visual pathway specialized for social perception. Trends Cogn. Sci. 25 , 100–110 (2021).

Bruce, V. & Young, A. Understanding face recognition. Br. J. Psychol. 77 , 305–327 (1986).

Tranel, D., Damasio, A. R. & Damasio, H. Intact recognition of facial expression, gender, and age in patients with impaired recognition of face identity. Neurology 38 , 690–696 (1988).

Young, A. W., McWeeny, K. H., Hay, D. C. & Ellis, A. W. Matching familiar and unfamiliar faces on identity and expression. Psychol. Res. 48 , 63–68 (1986).

Schweinberger, S. R. & Soukup, G. R. Asymmetric relationships among perceptions of facial identity, emotion, and facial speech. J. Exp. Psychol. Hum. Percept. Perform. 24 , 1748–1765 (1998).

Wang, Y., Fu, X., Johnston, R. A. & Yan, Z. Discriminability effect on Garner interference: evidence from recognition of facial identity and expression. Front. Psychol. 4 , 943 (2013).

Hoffman, E. A. & Haxby, J. V. Distinct representations of eye gaze and identity in the distributed human neural system for face perception. Nat. Neurosci. 3 , 80–84 (2000).

Kanwisher, N., McDermott, J. & Chun, M. M. The fusiform face area: a module in human extrastriate cortex specialized for face perception. J. Neurosci. 17 , 4302–4311 (1997).

Ganel, T., Valyear, K. F., Goshen-Gottstein, Y. & Goodale, M. A. The involvement of the “fusiform face area” in processing facial expression. Neuropsychologia 43 , 1645–1654 (2005).

Kliemann, D. et al. Cortical responses to dynamic emotional facial expressions generalize across stimuli, and are sensitive to task-relevance, in adults with and without autism. Cortex 103 , 24–43 (2018).

Lander, K., Chuang, L. & Wickham, L. Recognizing face identity from natural and morphed smiles. Q. J. Exp. Psychol. 59 , 801–808 (2006).

Pike, G. E., Kemp, R. I., Towell, N. A. & Phillips, K. C. Recognizing moving faces: the relative contribution of motion and perspective view information. Vis. Cogn. 4 , 409–438 (1997).

Rhodes, G. et al. How distinct is the coding of face identity and expression? Evidence for some common dimensions in face space. Cognition 142 , 123–137 (2015).

Van der Schalk, J., Hawk, S. T., Fischer, A. H. & Doosje, B. J. Moving faces, looking places: The Amsterdam Dynamic Facial Expressions Set (ADFES).  Emotion (in the press).

Download references

Acknowledgements

The authors thank A. Young for comments on an earlier draft of this manuscript.

Author information

Authors and affiliations.

Department of Experimental Psychology, University College London, London, UK

Eva G. Krumhuber

Institute for Experimental Psychology, Heinrich-Heine-Universität Düsseldorf, Düsseldorf, Germany

Lina I. Skora

School of Psychology, University of Sussex, Falmer, UK

School of Psychology, University of Wollongong, Wollongong, New South Wales, Australia

Harold C. H. Hill

Division of Psychology, Communication and Human Neuroscience, University of Manchester, Manchester, UK

Karen Lander

You can also search for this author in PubMed   Google Scholar

Contributions

E.G.K., H.C.H.H. and K.L. researched data for the article. All authors wrote the article and reviewed and/or edited the manuscript before submission.

Corresponding author

Correspondence to Eva G. Krumhuber .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Reviews Psychology thanks Claus-Christian Carbon, Guillermo Recio and Disa Sauter for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article.

Krumhuber, E.G., Skora, L.I., Hill, H.C.H. et al. The role of facial movements in emotion recognition. Nat Rev Psychol 2 , 283–296 (2023). https://doi.org/10.1038/s44159-023-00172-1

Download citation

Accepted : 06 March 2023

Published : 27 March 2023

Issue Date : May 2023

DOI : https://doi.org/10.1038/s44159-023-00172-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

A dynamic disadvantage social perceptions of dynamic morphed emotions differ from videos and photos.

  • Casey Becker
  • Russell Conduit
  • Robin Laycock

Journal of Nonverbal Behavior (2024)

The Predictive Role of the Posterior Cerebellum in the Processing of Dynamic Emotions

  • Gianluca Malatesta
  • Anita D’Anselmo
  • Luca Tommasi

The Cerebellum (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

literature review on emotion recognition

Emotion Recognition for Everyday Life Using Physiological Signals From Wearables: A Systematic Literature Review

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

  • Open access
  • Published: 21 May 2024

The bright side of sports: a systematic review on well-being, positive emotions and performance

  • David Peris-Delcampo 1 ,
  • Antonio Núñez 2 ,
  • Paula Ortiz-Marholz 3 ,
  • Aurelio Olmedilla 4 ,
  • Enrique Cantón 1 ,
  • Javier Ponseti 2 &
  • Alejandro Garcia-Mas 2  

BMC Psychology volume  12 , Article number:  284 ( 2024 ) Cite this article

79 Accesses

Metrics details

The objective of this study is to conduct a systematic review regarding the relationship between positive psychological factors, such as psychological well-being and pleasant emotions, and sports performance.

This study, carried out through a systematic review using PRISMA guidelines considering the Web of Science, PsycINFO, PubMed and SPORT Discus databases, seeks to highlight the relationship between other more ‘positive’ factors, such as well-being, positive emotions and sports performance.

The keywords will be decided by a Delphi Method in two rounds with sport psychology experts.

Participants

There are no participants in the present research.

The main exclusion criteria were: Non-sport thema, sample younger or older than 20–65 years old, qualitative or other methodology studies, COVID-related, journals not exclusively about Psychology.

Main outcomes measures

We obtained a first sample of 238 papers, and finally, this sample was reduced to the final sample of 11 papers.

The results obtained are intended to be a representation of the ‘bright side’ of sports practice, and as a complement or mediator of the negative variables that have an impact on athletes’ and coaches’ performance.

Conclusions

Clear recognition that acting on intrinsic motivation continues to be the best and most effective way to motivate oneself to obtain the highest levels of performance, a good perception of competence and a source of personal satisfaction.

Peer Review reports

Introduction

In recent decades, research in the psychology of sport and physical exercise has focused on the analysis of psychological variables that could have a disturbing, unfavourable or detrimental role, including emotions that are considered ‘negative’, such as anxiety/stress, sadness or anger, concentrating on their unfavourable relationship with sports performance [ 1 , 2 , 3 , 4 ], sports injuries [ 5 , 6 , 7 ] or, more generally, damage to the athlete’s health [ 8 , 9 , 10 ]. The study of ‘positive’ emotions such as happiness or, more broadly, psychological well-being, has been postponed at this time, although in recent years this has seen an increase that reveals a field of study of great interest to researchers and professionals [ 11 , 12 , 13 ] including physiological, psychological, moral and social beneficial effects of the physical activity in comic book heroes such as Tintin, a team leader, which can serve as a model for promoting healthy lifestyles, or seeking ‘eternal youth’ [ 14 ].

Emotions in relation to their effects on sports practice and performance rarely go in one direction, being either negative or positive—generally positive and negative emotions do not act alone [ 15 ]. Athletes experience different emotions simultaneously, even if they are in opposition and especially if they are of mild or moderate intensity [ 16 ]. The athlete can feel satisfied and happy and at the same time perceive a high level of stress or anxiety before a specific test or competition. Some studies [ 17 ] have shown how sports participation and the perceived value of elite sports positively affect the subjective well-being of the athlete. This also seems to be the case in non-elite sports practice. The review by Mansfield et al. [ 18 ] showed that the published literature suggests that practising sports and dance, in a group or supported by peers, can improve the subjective well-being of the participants, and also identifies negative feelings towards competence and ability, although the quantity and quality of the evidence published is low, requiring better designed studies. All these investigations are also supported by the development of the concept of eudaimonic well-being [ 19 ], which is linked to the development of intrinsic motivation, not only in its aspect of enjoyment but also in its relationship with the perception of competition and overcoming and achieving goals, even if this is accompanied by other unpleasant hedonic emotions or even physical discomfort. Shortly after a person has practised sports, he will remember those feelings of exhaustion and possibly stiffness, linked to feelings of satisfaction and even enjoyment.

Furthermore, the mediating role of parents, coaches and other psychosocial agents can be significant. In this sense, Lemelin et al. [ 20 ], with the aim of investigating the role of autonomy support from parents and coaches in the prediction of well-being and performance of athletes, found that autonomy support from parents and coaches has positive relationships with the well-being of the athlete, but that only coach autonomy support is associated with sports performance. This research suggests that parents and coaches play important but distinct roles in athlete well-being and that coach autonomy support could help athletes achieve high levels of performance.

On the other hand, an analysis of emotions in the sociocultural environment in which they arise and gain meaning is always interesting, both from an individual perspective and from a sports team perspective. Adler et al. [ 21 ] in a study with military teams showed that teams with a strong emotional culture of optimism were better positioned to recover from poor performance, suggesting that organisations that promote an optimistic culture develop more resilient teams. Pekrun et al. [ 22 ] observed with mathematics students that individual success boosts emotional well-being, while placing people in high-performance groups can undermine it, which is of great interest in investigating the effectiveness and adjustment of the individual in sports teams.

There is still little scientific literature in the field of positive emotions and their relationship with sports practice and athlete performance, although their approach has long had its clear supporters [ 23 , 24 ]. It is comforting to observe the significant increase in studies in this field, since some authors (e.g [ 25 , 26 ]). . , point out the need to overcome certain methodological and conceptual problems, paying special attention to the development of specific instruments for the evaluation of well-being in the sports field and evaluation methodologies.

As McCarthy [ 15 ] indicates, positive emotions (hedonically pleasant) can be the catalysts for excellence in sport and deserve a space in our research and in professional intervention to raise the level of athletes’ performance. From a holistic perspective, positive emotions are permanently linked to psychological well-being and research in this field is necessary: firstly because of the leading role they play in human behaviour, cognition and affection, and secondly, because after a few years of international uncertainty due to the COVID-19 pandemic and wars, it seems ‘healthy and intelligent’ to encourage positive emotions for our athletes. An additional reason is that they are known to improve motivational processes, reducing abandonment and negative emotional costs [ 11 ]. In this vein, concepts such as emotional intelligence make sense and can help to identify and properly manage emotions in the sports field and determine their relationship with performance [ 27 ] that facilitates the inclusion of emotional training programmes based on the ‘bright side’ of sports practice [ 28 ].

Based on all of the above, one might wonder how these positive emotions are related to a given event and what role each one of them plays in the athlete’s performance. Do they directly affect performance, or do they affect other psychological variables such as concentration, motivation and self-efficacy? Do they favour the availability and competent performance of the athlete in a competition? How can they be regulated, controlled for their own benefit? How can other psychosocial agents, such as parents or coaches, help to increase the well-being of their athletes?

This work aims to enhance the leading role, not the secondary, of the ‘good and pleasant side’ of sports practice, either with its own entity, or as a complement or mediator of the negative variables that have an impact on the performance of athletes and coaches. Therefore, the objective of this study is to conduct a systematic review regarding the relationship between positive psychological factors, such as psychological well-being and pleasant emotions, and sports performance. For this, the methodological criteria that constitute the systematic review procedure will be followed.

Materials and methods

This study was carried out through a systematic review using PRISMA (Preferred Reporting Items for Systematic Reviews) guidelines considering the Web of Science (WoS) and Psycinfo databases. These two databases were selected using the Delphi method [ 29 ]. It does not include a meta-analysis because there is great data dispersion due to the different methodologies used [ 30 ].

The keywords will be decided by the Delphi Method in two rounds with sport psychology experts. The results obtained are intended to be a representation of the ‘bright side’ of sports practice, and as a complement or mediator of the negative variables that have an impact on athletes’ and coaches’ performance.

It was determined that the main construct was to be psychological well-being, and that it was to be paired with optimism, healthy practice, realisation, positive mood, and performance and sport. The search period was limited to papers published between 2000 and 2023, and the final list of papers was obtained on February 13 , 2023. This research was conducted in two languages—English and Spanish—and was limited to psychological journals and specifically those articles where the sample was formed by athletes.

Each word was searched for in each database, followed by searches involving combinations of the same in pairs and then in trios. In relation to the results obtained, it was decided that the best approach was to group the words connected to positive psychology on the one hand, and on the other, those related to self-realisation/performance/health. In this way, it used parentheses to group words (psychological well-being; or optimism; or positive mood) with the Boolean ‘or’ between them (all three refer to positive psychology); and on the other hand, it grouped those related to performance/health/realisation (realisation; or healthy practice or performance), separating both sets of parentheses by the Boolean ‘and’’. To further filter the search, a keyword included in the title and in the inclusion criteria was added, which was ‘sport’ with the Boolean ‘and’’. In this way, the search achieved results that combined at least one of the three positive psychology terms and one of the other three.

Results (first phase)

The mentioned keywords were cross-matched, obtaining the combination with a sufficient number of papers. From the first research phase, the total number of papers obtained was 238. Then screening was carried out by 4 well-differentiated phases that are summarised in Fig.  1 . These phases helped to reduce the original sample to a more accurate one.

figure 1

Phases of the selection process for the final sample. Four phases were carried out to select the final sample of articles. The first phase allowed the elimination of duplicates. In the second stage, those that, by title or abstract, did not fit the objectives of the article were eliminated. Previously selected exclusion criteria were applied to the remaining sample. Thus, in phase 4, the final sample of 11 selected articles was obtained

Results (second phase)

The first screening examined the title, and the abstract if needed, excluding the papers that were duplicated, contained errors or someone with formal problems, low N or case studies. This screening allowed the initial sample to be reduced to a more accurate one with 109 papers selected.

Results (third phase)

This was followed by the second screening to examine the abstract and full texts, excluding if necessary papers related to non-sports themes, samples that were too old or too young for our interests, papers using qualitative methodologies, articles related to the COVID period, or others published in non-psychological journals. Furthermore, papers related to ‘negative psychological variables’’ were also excluded.

Results (fourth phase)

At the end of this second screening the remaining number of papers was 11. In this final phase we tried to organise the main characteristics and their main conclusions/results in a comprehensible list (Table  1 ). Moreover, in order to enrich our sample of papers, we decided to include some articles from other sources, mainly those presented in the introduction to sustain the conceptual framework of the concept ‘bright side’ of sports.

The usual position of the researcher of psychological variables that affect sports performance is to look for relationships between ‘negative’ variables, first in the form of basic psychological processes, or distorting cognitive behavioural, unpleasant or evaluable as deficiencies or problems, in a psychology for the ‘risk’ society, which emphasises the rehabilitation that stems from overcoming personal and social pathologies [ 31 ], and, lately, regarding the affectation of the athlete’s mental health [ 32 ]. This fact seems to be true in many cases and situations and to openly contradict the proclaimed psychological benefits of practising sports (among others: Cantón [ 33 ], ; Froment and González [ 34 ]; Jürgens [ 35 ]).

However, it is possible to adopt another approach focused on the ‘positive’ variables, also in relation to the athlete’s performance. This has been the main objective of this systematic review of the existing literature and far from being a novel approach, although a minority one, it fits perfectly with the definition of our area of knowledge in the broad field of health, as has been pointed out for some time [ 36 , 37 ].

After carrying out the aforementioned systematic review, a relatively low number of articles were identified by experts that met the established conditions—according to the PRISMA method [ 37 , 38 , 39 , 40 ]—regarding databases, keywords, and exclusion and inclusion criteria. These precautions were taken to obtain the most accurate results possible, and thus guarantee the quality of the conclusions.

The first clear result that stands out is the great difficulty in finding articles in which sports ‘performance’ is treated as a well-defined study variable adapted to the situation and the athletes studied. In fact, among the results (11 papers), only 3 associate one or several positive psychological variables with performance (which is evaluated in very different ways, combining objective measures with other subjective ones). This result is not surprising, since in several previous studies (e.g. Nuñez et al. [ 41 ]) using a systematic review, this relationship is found to be very weak and nuanced by the role of different mediating factors, such as previous sports experience or the competitive level (e.g. Rascado, et al. [ 42 ]; Reche, Cepero & Rojas [ 43 ]), despite the belief—even among professional and academic circles—that there is a strong relationship between negative variables and poor performance, and vice versa, with respect to the positive variables.

Regarding what has been evidenced in relation to the latter, even with these restrictions in the inclusion and exclusion criteria, and the filters applied to the first findings, a true ‘galaxy’ of variables is obtained, which also belong to different categories and levels of psychological complexity.

A preliminary consideration regarding the current paradigm of sport psychology: although it is true that some recent works have already announced the swing of the pendulum on the objects of study of PD, by returning to the study of traits and dispositions, and even to the personality of athletes [ 43 , 44 , 45 , 46 ], our results fully corroborate this trend. Faced with five variables present in the studies selected at the end of the systematic review, a total of three traits/dispositions were found, which were also the most repeated—optimism being present in four articles, mental toughness present in three, and finally, perfectionism—as the representative concepts of this field of psychology, which lately, as has already been indicated, is significantly represented in the field of research in this area [ 46 , 47 , 48 , 49 , 50 , 51 , 52 ]. In short, the psychological variables that finally appear in the selected articles are: psychological well-being (PWB) [ 53 ]; self-compassion, which has recently been gaining much relevance with respect to the positive attributional resolution of personal behaviours [ 54 ], satisfaction with life (balance between sports practice, its results, and life and personal fulfilment [ 55 ], the existence of approach-achievement goals [ 56 ], and perceived social support [ 57 ]). This last concept is maintained transversally in several theoretical frameworks, such as Sports Commitment [ 58 ].

The most relevant concept, both quantitatively and qualitatively, supported by the fact that it is found in combination with different variables and situations, is not a basic psychological process, but a high-level cognitive construct: psychological well-being, in its eudaimonic aspect, first defined in the general population by Carol Ryff [ 59 , 60 ] and introduced at the beginning of this century in sport (e.g., Romero, Brustad & García-Mas [ 13 ], ; Romero, García-Mas & Brustad [ 61 ]). It is important to note that this concept understands psychological well-being as multifactorial, including autonomy, control of the environment in which the activity takes place, social relationships, etc.), meaning personal fulfilment through a determined activity and the achievement or progress towards goals and one’s own objectives, without having any direct relationship with simpler concepts, such as vitality or fun. In the selected studies, PWB appears in five of them, and is related to several of the other variables/traits.

The most relevant result regarding this variable is its link with motivational aspects, as a central axis that relates to different concepts, hence its connection to sports performance, as a goal of constant improvement that requires resistance, perseverance, management of errors and great confidence in the possibility that achievements can be attained, that is, associated with ideas of optimism, which is reflected in expectations of effectiveness.

If we detail the relationships more specifically, we can first review this relationship with the ‘way of being’, understood as personality traits or behavioural tendencies, depending on whether more or less emphasis is placed on their possibilities for change and learning. In these cases, well-being derives from satisfaction with progress towards the desired goal, for which resistance (mental toughness) and confidence (optimism) are needed. When, in addition, the search for improvement is constant and aiming for excellence, its relationship with perfectionism is clear, although it is a factor that should be explored further due to its potential negative effect, at least in the long term.

The relationship between well-being and satisfaction with life is almost tautological, in the precise sense that what produces well-being is the perception of a relationship or positive balance between effort (or the perception of control, if we use stricter terminology) and the results thereof (or the effectiveness of such control). This direct link is especially important when assessing achievement in personally relevant activities, which, in the case of the subjects evaluated in the papers, specifically concern athletes of a certain level of performance, which makes it a more valuable objective than would surely be found in the general population. And precisely because of this effect of the value of performance for athletes of a certain level, it also allows us to understand how well-being is linked to self-compassion, since as a psychological concept it is very close to that of self-esteem, but with a lower ‘demand’ or a greater ‘generosity’, when we encounter failures, mistakes or even defeats along the way, which offers us greater protection from the risk of abandonment and therefore reinforces persistence, a key element for any successful sports career [ 62 ].

It also has a very direct relationship with approach-achievement goals, since precisely one of the central aspects characterising this eudaimonic well-being and differentiating it from hedonic well-being is specifically its relationship with self-determined and persistent progress towards goals or achievements with incentive value for the person, as is sports performance evidently [ 63 ].

Finally, it is interesting to see how we can also find a facet or link relating to the aspects that are more closely-related to the need for human affiliation, with feeling part of a group or human collective, where we can recognise others and recognise ourselves in the achievements obtained and the social reinforcement of those themselves, as indicated by their relationship with perceived social support. This construct is very labile, in fact it is common to find results in which the pressure of social support is hardly differentiated, for example, from the parents of athletes and/or their coaches [ 64 ]. However, its relevance within this set of psychological variables and traits is proof of its possible conceptual validity.

Analysing the results obtained, the first conclusion is that in no case is an integrated model based solely on ‘positive’ variables or traits obtained, since some ‘negative’ ones appear (anxiety, stress, irrational thoughts), affecting the former.

The second conclusion is that among the positive elements the variable coping strategies (their use, or the perception of their effectiveness) and the traits of optimism, perfectionism and self-compassion prevail, since mental strength or psychological well-being (which also appear as important, but with a more complex nature) are seen to be participated in by the aforementioned traits.

Finally, it must be taken into account that the generation of positive elements, such as resilience, or the learning of coping strategies, are directly affected by the educational style received, or by the culture in which the athlete is immersed. Thus, the applied potential of these findings is great, but it must be calibrated according to the educational and/or cultural features of the specific setting.

Limitations

The limitations of this study are those evident and common in SR methodology using the PRISMA system, since the selection of keywords (and their logical connections used in the search), the databases, and the inclusion/exclusion criteria bias the work in its entirety and, therefore, constrain the generalisation of the results obtained.

Likewise, the conclusions must—based on the above and the results obtained—be made with the greatest concreteness and simplicity possible. Although we have tried to reduce these limitations as much as possible through the use of experts in the first steps of the method, they remain and must be considered in terms of the use of the results.

Future developments

Undoubtedly, progress is needed in research to more precisely elucidate the role of well-being, as it has been proposed here, from a bidirectional perspective: as a motivational element to push towards improvement and the achievement of goals, and as a product or effect of the self-determined and competent behaviour of the person, in relation to different factors, such as that indicated here of ‘perfectionism’ or the potential interference of material and social rewards, which are linked to sports performance—in our case—and that could act as a risk factor so that our achievements, far from being a source of well-being and satisfaction, become an insatiable demand in the search to obtain more and more frequent rewards.

From a practical point of view, an empirical investigation should be conducted to see if these relationships hold from a statistical point of view, either in the classical (correlational) or in the probabilistic (Bayesian Networks) plane.

The results obtained in this study, exclusively researched from the desk, force the authors to develop subsequent empirical and/or experimental studies in two senses: (1) what interrelationships exist between the so called ‘positive’ and ‘negative’ psychological variables and traits in sport, and in what sense are each of them produced; and, (2) from a global, motivational point of view, can currently accepted theoretical frameworks, such as SDT, easily accommodate this duality, which is becoming increasingly evident in applied work?

Finally, these studies should lead to proposals applied to the two fields that have appeared to be relevant: educational and cultural.

Application/transfer of results

A clear application of these results is aimed at guiding the training of sports and physical exercise practitioners, directing it towards strategies for assessing achievements, improvements and failure management, which keep them in line with well-being enhancement, eudaimonic, intrinsic and self-determined, which enhances the quality of their learning and their results and also favours personal health and social relationships.

Data availability

There are no further external data.

Cantón E, Checa I. Los estados emocionales y su relación con las atribuciones y las expectativas de autoeficacia en El deporte. Revista De Psicología Del Deporte. 2012;21(1):171–6.

Google Scholar  

Cantón E, Checa I, Espejo B. (2015). Evidencias de validez convergente y test-criterio en la aplicación del Instrumento de Evaluación de Emociones en la Competición Deportiva. 24(2), 311–313.

Olmedilla A, Martins B, Ponseti-Verdaguer FJ, Ruiz-Barquín R, García-Mas A. It is not just stress: a bayesian Approach to the shape of the Negative Psychological Features Associated with Sport injuries. Healthcare. 2022;10(2):236. https://doi.org/10.3390/healthcare10020236 .

Article   Google Scholar  

Ong NCH, Chua JHE. Effects of psychological interventions on competitive anxiety in sport: a meta-analysis. Psycholy Sport Exerc. 2015;52:101836. https://doi.org/10.1016/j.psychsport.2020.101836 .

Candel MJ, Mompeán R, Olmedilla A, Giménez-Egido JM. Pensamiento catastrofista y evolución del estado de ánimo en futbolistas lesionados (Catastrophic thinking and temporary evolf mood state in injured football players). Retos. 2023;47:710–9.

Li C, Ivarsson A, Lam LT, Sun J. Basic Psychological needs satisfaction and frustration, stress, and sports Injury among University athletes: a Four-Wave prospective survey. Front Psychol. 2019;26:10. https://doi.org/10.3389/fpsyg.2019.00665 .

Wiese-Bjornstal DM. Psychological predictors and consequences of injuries in sport settings. In: Anshel MH, Petrie TA, Steinfelt JA, editors. APA handbook of sport and exercise psychology, volume 1: Sport psychology. Volume 1. Washington: American Psychological Association; 2019. pp. 699–725. https://doi.org/10.1037/0000123035 .

Chapter   Google Scholar  

Godoy PS, Redondo AB, Olmedilla A. (2022). Indicadores De Salud mental en jugadoras de fútbol en función de la edad. J Univers Mov Perform 21(5).

Golding L, Gillingham RG, Perera NKP. The prevalence of depressive symptoms in high-performance athletes: a systematic review. Physician Sportsmed. 2020;48(3):247–58. https://doi.org/10.1080/00913847.2020.1713708 .

Xanthopoulos MS, Benton T, Lewis J, Case JA, Master CL. Mental Health in the Young Athlete. Curr Psychiatry Rep. 2020;22(11):1–15. https://doi.org/10.1007/s11920-020-01185-w .

Cantón E, Checa I, Vellisca-González MY. Bienestar psicológico Y ansiedad competitiva: El Papel De las estrategias de afrontamiento / competitive anxiety and Psychological Well-being: the role of coping strategies. Revista Costarricense De Psicología. 2015;34(2):71–8.

Hahn E. Emotions in sports. In: Hackfort D, Spielberg CD, editors. Anxiety in Sports. Taylor & Francis; 2021. pp. 153–62. ISBN: 9781315781594.

Carrasco A, Brustad R, García-Mas A. Bienestar psicológico Y Su uso en la psicología del ejercicio, la actividad física y El Deporte. Revista Iberoamericana De psicología del ejercicio y El Deporte. 2007;2(2):31–52.

García-Mas A, Olmedilla A, Laffage-Cosnier S, Cruz J, Descamps Y, Vivier C. Forever Young! Tintin’s adventures as an Example of Physical Activity and Sport. Sustainability. 2021;13(4):2349. https://doi.org/10.3390/su13042349 .

McCarthy P. Positive emotion in sport performance: current status and future directions. Int Rev Sport Exerc Psycholy. 2011;4(1):50–69. https://doi.org/10.1080/1750984X.2011.560955 .

Cerin E. Predictors of competitive anxiety direction in male Tae Kwon do practitioners: a multilevel mixed idiographic/nomothetic interactional approach. Psychol Sport Exerc. 2004;5(4):497–516. https://doi.org/10.1016/S1469-0292(03)00041-4 .

Silva A, Monteiro D, Sobreiro P. Effects of sports participation and the perceived value of elite sport on subjective well-being. Sport Soc. 2020;23(7):1202–16. https://doi.org/10.1080/17430437.2019.1613376 .

Mansfield L, Kay T, Meads C, Grigsby-Duffy L, Lane J, John A, et al. Sport and dance interventions for healthy young people (15–24 years) to promote subjective well-being: a systematic review. BMJ Open. 2018;8(7). https://doi.org/10.1136/bmjopen-2017-020959 . e020959.

Ryff CD. Happiness is everything, or is it? Explorations on the meaning of psychological well-being. J Personal Soc Psychol. 1989;57(6):1069–81. https://doi.org/10.1037/0022-3514.57.6.1069 .

Lemelin E, Verner-Filion J, Carpentier J, Carbonneau N, Mageau G. Autonomy support in sport contexts: the role of parents and coaches in the promotion of athlete well-being and performance. Sport Exerc Perform Psychol. 2022;11(3):305–19. https://doi.org/10.1037/spy0000287 .

Adler AB, Bliese PD, Barsade SG, Sowden WJ. Hitting the mark: the influence of emotional culture on resilient performance. J Appl Psychol. 2022;107(2):319–27. https://doi.org/10.1037/apl0000897 .

Article   PubMed   Google Scholar  

Pekrun R, Murayama K, Marsh HW, Goetz T, Frenzel AC. Happy fish in little ponds: testing a reference group model of achievement and emotion. J Personal Soc Psychol. 2019;117(1):166–85. https://doi.org/10.1037/pspp0000230 .

Seligman M. Authentic happiness. New York: Free Press/Simon and Schuster; 2002.

Seligman M, Florecer. La Nueva psicología positiva y la búsqueda del bienestar. Editorial Océano; 2016.

Giles S, Fletcher D, Arnold R, Ashfield A, Harrison J. Measuring well-being in Sport performers: where are we now and how do we Progress? Sports Med. 2020;50(7):1255–70. https://doi.org/10.1007/s40279-020-01274-z .

Article   PubMed   PubMed Central   Google Scholar  

Piñeiro-Cossio J, Fernández-Martínez A, Nuviala A, Pérez-Ordás R. Psychological wellbeing in Physical Education and School sports: a systematic review. Int J Environ Res Public Health. 2021;18(3):864. https://doi.org/10.3390/ijerph18030864 .

Gómez-García L, Olmedilla-Zafra A, Peris-Delcampo D. Inteligencia emocional y características psicológicas relevantes en mujeres futbolistas profesionales. Revista De Psicología Aplicada Al Deporte Y El Ejercicio Físico. 2023;15(72). https://doi.org/10.5093/rpadef2022a9 .

Balk YA, Englert C. Recovery self-regulation in sport: Theory, research, and practice. International Journal of Sports Science and Coaching. SAGE Publications Inc.; 2020. https://doi.org/10.1177/1747954119897528 .

King PR Jr, Beehler GP, Donnelly K, Funderburk JS, Wray LO. A practical guide to applying the Delphi Technique in Mental Health Treatment Adaptation: the example of enhanced problem-solving training (E-PST). Prof Psychol Res Pract. 2021;52(4):376–86. https://doi.org/10.1037/pro0000371 .

Glass G. Primary, secondary, and Meta-Analysis of Research. Educational Researcher. 1976;5(10):3. https://doi.org/10.3102/0013189X005010003 .

Gillham J, Seligman M. Footsteps on the road to a positive psychology. Behav Res Ther. 1999;37:163–73. https://doi.org/10.1016/s0005-7967( . 99)00055 – 8.

Castillo J. Salud mental en El Deporte individual: importancia de estrategias de afrontamiento eficaces. Fundación Universitaria Católica Lumen Gentium; 2021.

Cantón E. Deporte, salud, bienestar y calidad de vida. Cuad De Psicología Del Deporte. 2001;1(1):27–38.

Froment F, García-González A. Retos. 2017;33:3–9. https://doi.org/10.47197/retos.v0i33.50969 . Beneficios de la actividad física sobre la autoestima y la calidad de vida de personas mayores (Benefits of physical activity on self-esteem and quality of life of older people).

Jürgens I. Práctica deportiva y percepción de calidad de vida. Revista Int De Med Y Ciencias De La Actividad Física Y Del Deporte. 2006;6(22):62–74.

Carpintero H. (2004). Psicología, Comportamiento Y Salud. El Lugar De La Psicología en Los campos de conocimiento. Infocop Num Extr, 93–101.

Page M, McKenzie J, Bossuyt P, Boutron I, Hoffmann T, Mulrow C, et al. Declaración PRISMA 2020: una guía actualizada para la publicación de revisiones sistemáticas. Rev Esp Cardiol. 2001;74(9):790–9.

Royo M, Biblio-Guías. Revisiones sistemáticas: PRISMA 2020: guías oficiales para informar (redactar) una revisión sistemática. Universidad De Navarra. 2020. https://doi.org/10.1016/j.recesp.2021.06.016 .

Urrútia G, Bonfill X. PRISMA declaration: a proposal to improve the publication of systematic reviews and meta-analyses. Medicina Clínica. 2010;135(11):507–11. https://doi.org/10.1016/j.medcli.2010.01.015 .

Núñez A, Ponseti FX, Sesé A, Garcia-Mas A. Anxiety and perceived performance in athletes and musicians: revisiting Martens. Revista De Psicología. Del Deporte/Journal Sport Psychol. 2020;29(1):21–8.

Rascado S, Rial-Boubeta A, Folgar M, Fernández D. Niveles De rendimiento y factores psicológicos en deportistas en formación. Reflexiones para entender la exigencia psicológica del alto rendimiento. Revista Iberoamericana De Psicología Del Ejercicio Y El Deporte. 2014;9(2):373–92.

Reche-García C, Cepero M, Rojas F. Efecto De La Experiencia deportiva en las habilidades psicológicas de esgrimistas del ranking nacional español. Cuad De Psicología Del Deporte. 2010;10(2):33–42.

Kang C, Bennett G, Welty-Peachey J. Five dimensions of brand personality traits in sport. Sport Manage Rev. 2016;19(4):441–53. https://doi.org/10.1016/j.smr.2016.01.004 .

De Vries R. The main dimensions of Sport personality traits: a Lexical Approach. Front Psychol. 2020;23:11. https://doi.org/10.3389/fpsyg.2020.02211 .

Laborde S, Allen M, Katschak K, Mattonet K, Lachner N. Trait personality in sport and exercise psychology: a mapping review and research agenda. Int J Sport Exerc Psychol. 2020;18(6):701–16. https://doi.org/10.1080/1612197X.2019.1570536 .

Stamp E, Crust L, Swann C, Perry J, Clough P, Marchant D. Relationships between mental toughness and psychological wellbeing in undergraduate students. Pers Indiv Differ. 2015;75:170–4. https://doi.org/10.1016/j.paid.2014.11.038 .

Nicholls A, Polman R, Levy A, Backhouse S. Mental toughness, optimism, pessimism, and coping among athletes. Personality Individ Differences. 2008;44(5):1182–92. https://doi.org/10.1016/j.paid.2007.11.011 .

Weissensteiner JR, Abernethy B, Farrow D, Gross J. Distinguishing psychological characteristics of expert cricket batsmen. J Sci Med Sport. 2012;15(1):74–9. https://doi.org/10.1016/j.jsams.2011.07.003 .

García-Naveira A, Díaz-Morales J. Relationship between optimism/dispositional pessimism, performance and age in competitive soccer players. Revista Iberoamericana De Psicología Del Ejercicio Y El Deporte. 2010;5(1):45–59.

Reche C, Gómez-Díaz M, Martínez-Rodríguez A, Tutte V. Optimism as contribution to sports resilience. Revista Iberoamericana De Psicología Del Ejercicio Y El Deporte. 2018;13(1):131–6.

Lizmore MR, Dunn JGH, Causgrove Dunn J. Perfectionistic strivings, perfectionistic concerns, and reactions to poor personal performances among intercollegiate athletes. Psychol Sport Exerc. 2017;33:75–84. https://doi.org/10.1016/j.psychsport.2017.07.010 .

Mansell P. Stress mindset in athletes: investigating the relationships between beliefs, challenge and threat with psychological wellbeing. Psychol Sport Exerc. 2021;57:102020. https://doi.org/10.1016/j.psychsport.2021.102020 .

Reis N, Kowalski K, Mosewich A, Ferguson L. Exploring Self-Compassion and versions of masculinity in men athletes. J Sport Exerc Psychol. 2019;41(6):368–79. https://doi.org/10.1123/jsep.2019-0061 .

Cantón E, Checa I, Budzynska N, Canton E, Esquiva Iy, Budzynska N. (2013). Coping, optimism and satisfaction with life among Spanish and Polish football players: a preliminary study. Revista de Psicología del Deporte. 22(2), 337–43.

Mulvenna M, Adie J, Sage L, Wilson N, Howat D. Approach-achievement goals and motivational context on psycho-physiological functioning and performance among novice basketball players. Psychol Sport Exerc. 2020;51:101714. https://doi.org/10.1016/j.psychsport.2020.101714 .

Malinauskas R, Malinauskiene V. The mediation effect of Perceived Social support and perceived stress on the relationship between Emotional Intelligence and Psychological Wellbeing in male athletes. Jorunal Hum Kinetics. 2018;65(1):291–303. https://doi.org/10.2478/hukin-2018-0017 .

Scanlan T, Carpenter PJ, Simons J, Schmidt G, Keeler B. An introduction to the Sport Commitment Model. J Sport Exerc Psychol. 1993;1(1):1–15. https://doi.org/10.1123/jsep.15.1.1 .

Ryff CD. Eudaimonic well-being, inequality, and health: recent findings and future directions. Int Rev Econ. 2017;64(2):159–78. https://doi.org/10.1007/s12232-017-0277-4 .

Ryff CD, Singer B. The contours of positive human health. Psychol Inq. 1998;9(1):1–28. https://doi.org/10.1207/s15327965pli0901_1 .

Romero-Carrasco A, García-Mas A, Brustad RJ. Estado del arte, y perspectiva actual del concepto de bienestar psicológico en psicología del deporte. Revista Latinoam De Psicología. 2009;41(2):335–47.

James IA, Medea B, Harding M, Glover D, Carraça B. The use of self-compassion techniques in elite footballers: mistakes as opportunities to learn. Cogn Behav Therapist. 2022;15:e43. https://doi.org/10.1017/S1754470X22000411 .

Fernández-Río J, Cecchini JA, Méndez-Giménez A, Terrados N, García M. Understanding olympic champions and their achievement goal orientation, dominance and pursuit and motivational regulations: a case study. Psicothema. 2018;30(1):46–52. https://doi.org/10.7334/psicothema2017.302 .

Ortiz-Marholz P, Chirosa LJ, Martín I, Reigal R, García-Mas A. Compromiso Deportivo a través del clima motivacional creado por madre, padre y entrenador en jóvenes futbolistas. J Sport Psychol. 2016;25(2):245–52.

Ortiz-Marholz P, Gómez-López M, Martín I, Reigal R, García-Mas A, Chirosa LJ. Role played by the coach in the adolescent players’ commitment. Studia Physiol. 2016;58(3):184–98. https://doi.org/10.21909/sp.2016.03.716 .

Download references

This research received no external funding.

Author information

Authors and affiliations.

General Psychology Department, Valencia University, Valencia, 46010, Spain

David Peris-Delcampo & Enrique Cantón

Basic Psychology and Pedagogy Departments, Balearic Islands University, Palma de Mallorca, 07122, Spain

Antonio Núñez, Javier Ponseti & Alejandro Garcia-Mas

Education and Social Sciences Faculty, Andres Bello University, Santiago, 7550000, Chile

Paula Ortiz-Marholz

Personality, Evaluation and Psychological Treatment Deparment, Murcia University, Campus MareNostrum, Murcia, 30100, Spain

Aurelio Olmedilla

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization, AGM, EC and ANP.; planification, AO; methodology, ANP, AGM and PO.; software, ANP, DP and PO.; validation, ANP and PO.; formal analysis, DP, PO and ANP; investigation, DP, PO and ANP.; resources, DVP and JP; data curation, AO and DP.; writing—original draft preparation, ANP, DP and AGM; writing—review and editing, EC and JP.; visualization, ANP and PO.; supervision, AGM.; project administration, DP.; funding acquisition, DP and JP. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Antonio Núñez .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Informed consent statement

Consent for publication, competing interests.

The authors declare no conflict of interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Peris-Delcampo, D., Núñez, A., Ortiz-Marholz, P. et al. The bright side of sports: a systematic review on well-being, positive emotions and performance. BMC Psychol 12 , 284 (2024). https://doi.org/10.1186/s40359-024-01769-8

Download citation

Received : 04 October 2023

Accepted : 07 May 2024

Published : 21 May 2024

DOI : https://doi.org/10.1186/s40359-024-01769-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Positive emotions
  • Sports performance

BMC Psychology

ISSN: 2050-7283

literature review on emotion recognition

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Sensors (Basel)

Logo of sensors

Facial Emotion Recognition: A Survey and Real-World User Experiences in Mixed Reality

Extensive possibilities of applications have made emotion recognition ineluctable and challenging in the field of computer science. The use of non-verbal cues such as gestures, body movement, and facial expressions convey the feeling and the feedback to the user. This discipline of Human–Computer Interaction places reliance on the algorithmic robustness and the sensitivity of the sensor to ameliorate the recognition. Sensors play a significant role in accurate detection by providing a very high-quality input, hence increasing the efficiency and the reliability of the system. Automatic recognition of human emotions would help in teaching social intelligence in the machines. This paper presents a brief study of the various approaches and the techniques of emotion recognition. The survey covers a succinct review of the databases that are considered as data sets for algorithms detecting the emotions by facial expressions. Later, mixed reality device Microsoft HoloLens (MHL) is introduced for observing emotion recognition in Augmented Reality (AR). A brief introduction of its sensors, their application in emotion recognition and some preliminary results of emotion recognition using MHL are presented. The paper then concludes by comparing results of emotion recognition by the MHL and a regular webcam.

1. Introduction

Humans interact socially with the help of emotions, which are considered as a universal language. These emotions surpass cultural diversities and ethnicity. Facial expressions are responsible for conveying the information, which was difficult to perceive. It gives the mental state of a person that directly relates to his intentions or the physical efforts that he must be applying for performing tasks. As a result, automatic recognition of emotion with the help of high-quality sensors is quite useful in a variety of areas such as image processing, cybersecurity, robotics, psychological studies, and virtual reality applications to name a few. Efforts in this area are being made to gather information of high-quality to meet the demands of the system so that it can read, process and simulate human emotions. Geometric and machine learning based algorithms for an effective recognition are being refined, emphasizing emotion recognition in real-time and not just ideal laboratory conditions. Hence, building a system that is capable of both face detection and emotion recognition has been a crucial area of research.

It is a well-established fact that human beings are responsible for the depiction of six basic emotions, namely happiness, anger, surprise, sadness, fear, and disgust [ 1 ]. These primary emotions form the primary classification of the study of human emotional responses. Apart from these basic emotions, several other emotions have been considered for research. These include contempt, envy, pain, drowsiness and various micro expressions. Facial expression is seen as the primary mode of recognition of human emotion. It works on facial motion and the deformations of facial features to classify them into emotion categories. This classification is based on visual information and may not be the sole indicator of emotion. Other factors also contribute to the recognition of a person’s emotional state such as voice, body language, gestures or even the direction of the gaze. Emotion recognition, therefore, demands a more precise knowledge of all these factors together with contextual information to convey more accurate results.

Facial emotional recognition is essentially pattern recognition and involves finding regularities in the set of data being analyzed. Using these regularities, faces, as well as emotions, can be recognized. Various techniques are followed to carry out the tasks that widely fall into two classes, method of parameterization, and the method of recognition. The method of parameterization includes segmentation, assigning binary labels to each pixel and detection where a boundary box is obtained when the face is located in the given data [ 2 ]. FACS (Facial Action Coding System) is an example of this method, where all facial emotions are considered and described by the contraction of facial muscles being considered as AUs (Action Units) [ 1 , 3 ]. These Geometric Feature-based techniques give importance to the structural shape of facial components such as nose, mouth, and eyes. The other method is appearance-based where attributes such as intensities, pixel values, and histograms are considered. After exhaustive training is done with the help of prelabeled datasets, machine learning techniques are applied to detect emotions. Figure 1 and Figure 2 give an overall idea of the use of these machine learning and geometric feature based processes, respectively, in emotion recognition. An essential requirement for the identification of the emotion by applying a machine learning algorithm is the availability of appropriate datasets for training. Different experiments have been conducted in literature with different sizes and types of data to check the maximum achievable accuracy of these algorithms. For building a robust system that recognizes basic emotions, some preprocessing is required, including image standardization, face detection, facial component detection, emotion extraction, and emotion matching or classification [ 4 ]. These tasks are becoming more challenging and complicated because of the variance in several known factors. In general, the factors that have been considered include pose and lighting conditions, gender, age, and facial hair. This study provides a brief overview of the variety of the available databases, and a comparison of the accuracy of algorithms available in the literature is made for all the described databases.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-00416-g001.jpg

Face detection and emotion recognition using machine learning.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-00416-g002.jpg

Face detection and emotion recognition using geometric feature-based process.

This paper presents results from several experiments, performed with the help of a popular mixed reality device known as Microsoft HoloLens (MHL) (Microsoft, Redmond, WA, USA), and a webcam for emotion recognition. The emotions considered for this experimentation include anger, neutral, happiness, sadness, and surprise [ 1 ]. As deformations of facial features are used to classify them into particular emotions, it is also important to consider the sensors of the device used for capturing the deformations in the faces. MHL provides the sensors that are essential in conducting experiments for facial emotion recognition. Depth sensors in these mixed reality devices have become very popular, allowing for the development of new algorithms for the identification of human pose, gestures, face, and facial expressions. MHL provides a variety of sensors such as an ambient light sensor, four microphones, one depth camera, an IMU (inertial measurement unit), four environment understanding cameras, mixed reality capture, and one 2.0 MP (Mega Pixels) photo/HD video camera. These sensors make the emotion recognition system robust against varying environmental conditions that may be difficult for other forms of emotion recognition systems (e.g., system having only a 2D camera). These sensors provide a high-quality input that assists in a superior dissection of subjects facial components, thus improving the efficiency of algorithms even in difficult lighting conditions.

1.1. Face Detection and Emotion Recognition Using Machine Learning

Facial emotion recognition [ 5 ] is a complex task and the machine learning approach to recognize faces requires several steps to perform it, as shown in Figure 1 .

  • Feature selection: This stage refers to attribute selection for the training of the machine learning algorithm. The process includes the selection of predictors for construction of the learning system. It helps in improving prediction rate, efficiency, and cost-effectiveness. Many tools such as Weka and sci-kit-learn have inbuilt tools for automated feature selection.
  • Feature classification: When it comes to supervised learning algorithms, classification consists of two stages. Training and classification, where training helps in discovering which features are helpful in classification. Classification is where one comes up with new examples and, hence, assigning them to the classes that are already made through training the features.
  • Feature extraction: Machine learning requires numerical data for learning and training. During feature extraction, processing is done to transform arbitrary data, text or images, to gather the numerical data. Algorithms used in this step include principal component analysis, local binary patterns, linear discriminant analysis, independent component analysis, etc.
  • Classifiers: This is the final step in this process. Based on the inference from the features, the algorithm performs data classification. It comprises classifying the emotions into a set of predefined emotion categories or mapping to a continuous space where each point corresponds to an expressive trait. It uses various algorithms such as Support Vector Machine (SVM), Neural Networks, and Random Forest Search.

1.2. Face Detection and Emotion Recognition Using Geometric Feature-Based Process

The geometric feature-based approach requires several steps to perform facial emotion recognition, as shown in Figure 2 [ 5 ].

  • Image standardization: It includes various sub-processes such as the removal of noise from the image, making all the images uniform in size and conversion from RGB (Red, Green and Blue) to grayscale. This makes the image data available for image analysis.
  • Face detection: This phase involves detecting of a face in the given image data. It aims to remove all the unwanted things from the picture, such as background, and to keep only relevant information, the face, from the data. This phase employs various methodologies such as face segmentation techniques and curvature features. Some of the algorithms that are used in this step include edge detection filters such as Sobel, Prewitt, Laplacian, and Canny.
  • Facial component detection: Here, regions of interests are detected. These regions vary from eyes to nose to mouth, etc. The primary step is to localize and track a dense set of facial points. This step is necessary as it helps to minimize the errors that can arise due to the rotation or the alignment of the face.
  • Decision function: After the feature point tracking of the face using parameters such as localized feature Lucas Kanade Optical flow tracker [ 6 ], it is the decision function responsible for detecting the acquired emotion of the subject. These functions make use of classifiers such as AdaBoost and SVM for facial emotion recognition.

1.3. Popular Mixed Reality Device: Microsoft HoloLens (MHL)

In this paper, we introduce a popular mixed reality device called Microsoft HoloLens (MHL), shown in Figure 3 . Mixed reality (MR) devices provide users with a combined experience of both Virtual Reality (VR) and Augmented Reality (AR). Holograms are used for interaction with the surroundings, and it helps us to create a digital world in nearby surroundings thus, forming a self-contained computer that is holographic in nature. AR is experienced when a computer generates sounds, graphics, photos, videos and displays it in 3D in our world all through the sensors. It is a blend of VR with the real world to enhance the perception of reality. In this way, the actual view of the world is modified by the computer to show different things. The only difference between VR and AR is that virtual reality substitutes the real world with a simulated holographic world and hence none of the surroundings are seen. AR is in real time and is also interactive with the environment. AR is the one, bringing out the components of the world, which is digital, into the real world as observed by the human. Hence, MHL can give us a maximum of the experiences that were available first either in only AR or only VR.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-00416-g003.jpg

Popular mixed-reality device (MRD): Microsoft HoloLens (MHL).

1.4. Sensor Importance in Mixed Reality Devices for Emotion Recognition

The slightest change in the expression is necessary to be detected to recognize the emotion portrayed by the person. Efficient and high-quality sensors are essential for measuring even an exiguous change in expression. MHL contains sensors such as an inertial measuring unit (IMU), which is responsible for keeping track of orientation, gravitational force and also the velocity of the device. The primary function of the IMU in mixed reality is to perform rotational tracking for input devices and HMDs (Head mount displays). It measures rotational movements of the yaw, pitch, and roll. For facial emotion recognition, the algorithms that use the video camera are susceptible to elements such as facial illumination environment (caused due to bad lighting or unevenly lightened scene), which affects the accuracy of the algorithm. A possible solution to this problem is to have a sensor with a scene depth information. The sensors in the depth camera use additional light source information, which, in turn, makes the capturing of the face and emotion easier in real time as they are insensitive to ambient lighting conditions. The grayscale sensing camera additionally adds up to the depth camera and hence keeps track of the heads movement, hands, and environment understanding. Other sensors like mixed reality capture help in displaying the captured emotions with ease, which is otherwise burdensome for different devices. To enhance emotion recognition of face along with other aspects such as voice, it also contains four microphones. The combination of all these sensors in this mixed reality device adds up to give a more efficient system, resulting in more accurate results, which will be shown in the experimentation section of the paper.

1.5. MHL Experimentation

MHL shows a merged world, that is, all the objects are merged with the places or people in the surroundings. MHL is a multi-sensor device. The sensors include one depth camera, four environment cameras, and light sensor. For human understanding, MHL has introduced spatial sound, gaze tracking, gesture input, voice support, built-in speakers, audio 3.5 mm jack volume up/down, power button, etc., which makes HoloLens more interactive and increases its usefulness. It does not require a PC connection, nor their wires in it, which makes it extremely portable and self-contained.

In this study, the importance of the sensors in the MHL is highlighted by using it to observe how the emotion recognition algorithm works in identifying human emotions more accurately. An application has been built that uses MHL to detect faces and recognize the emotion of the person in facing it. The emotions recognized are Happiness, Sadness, Anger, Surprise and Neutral. These results will be compared with the results of the facial emotion recognition experimentation, which is performed by using a simple webcam.

1.6. Closest Competitors of MHL

Ever since the first project called the “Google glass” came into the picture, the reality of wearable eye-wear seems to be inevitable. Since then, developers from around the world have been trying to come up with different MR devices. They all have an underlying motive to fill the users with live information, having to look at project data in a high resolution, increasing security and learning capabilities in various streams. Many such devices already on the market today are Meta, Google Glass, Meta 2, Google Daydream, Sony PlayStation VR, etc. To the best of our knowledge, the closest competitors of MHL are Google Glass and Meta 2 [ 7 ].

Google glass is an eyeglass where the lens is replaced by the optical head-up display, communicating with the internet through natural language commands. It is a smartphone in the form of glass. However, MHL is designed to recognize the vocal communication of the wearer, eye movement, hand gestures, etc. whereas Google Glass is not able to do so. Google glass was developed as a smartphone which people thought was newer but costlier, hence Microsoft HoloLens having a brighter future. The Meta 2 also feels like a complete AR experience. It offers the widest view, direct interaction with holograms and intuitive access to digital information (uses a neuroscience-driven interface). Having only a few similarities with Meta 2, MHL has proved to be better. It has features like environment understanding, human interaction with holograms, having speech control in the form of Cortana and, most important of these, being tetherless, which, as of today, is not possible with Meta 2. MHL has chosen portability over the processing power, which makes it possible to use it practically anywhere.

The structure of the paper is as follows. Section 2 gives a brief insight into techniques used for facial emotion recognition, along with a mention of the algorithms used. It also describes the types of databases that are used as data sets. Section 3 talks about our motivation to gather information and perform experiments. Section 4 describes the face detection and emotion recognition experimentation performed and the results obtained with the help of MHL and a webcam. Section 5 is a conclusion of the literature survey and the experimentation, along with which the future work is presented.

2. Literature Survey

In the 19th century, the worth of Emotion Recognition was recognized when “The Expression of Emotion in Man and Animals” was written by Charles Darwin [ 8 ]. This book greatly inspired the study of emotions. Emotion Recognition due to its various applications gained immense importance like a drowsy driver could be spotted using emotion recognition systems [ 9 ]. Corneanu et al., Matusugu et al., and Viola et al. [ 8 , 10 , 11 ] gave a primary classification for the emotion recognition using multimodal approaches. They primarily talked about techniques and the parameters used for emotion recognition. The methods mentioned consisted of localization of the face using detection and segmentation, which made use of Support Vector Machine (SVM) and Convolutional Neural Networks (CNN) algorithms. Along with all these techniques, Corneanu et al. concentrated on the categorization for emotion recognition considering the two principal components, one was parametrization, and the other was recognition of facial expressions. In his research, parametrization was relating the emotions detected while recognition of facial expressions was accomplished by using the algorithms such as Viola and Jones. This study also experiments with other algorithms like CNN [ 12 ] and SVM [ 13 ], and this study concludes by proving that CNN demonstrates to have comparatively better accuracy on Viola and Jones algorithms.

Matsugu et al. [ 10 ] developed the first facial ER model. This developed system claimed to be robust in appearance and independent of the subject. They used a CNN model, which was used to find local differences between neutral and emotional face. A single structure CNN was used to experiment in spite of two CNN models, which was similar to the Fasel’s model [ 14 ]. Fasel’s model had two CNNs that were independent, one was used for facial expression, and the other was used for face identity recognition. Furthermore, an MLP was used to combine them. The experiment was performed with images of various types and achieved a performance rate of 97.6% for 5600 still images of 10 subjects. Tanaya et al. [ 15 , 16 ] applied a Curvelet-based feature extraction. Here, they took advantage of the discontinuities in 2D functions, which were represented by Curvelet. In their work, they converted the images to grayscale. These images were then exploited to the 256 resolution further to 16 and then to 8 and 4 resolution, respectively. Later in his work, curvelet was used for training the algorithm. The reason to follow this flow was that the person image would be recognized by bigger curves, which are present at the lower bit resolution if initially a person’s face is not recognized in the 8-bit image. Finally, the One-Against-All (OAA) SVM method was performed, and the results of wavelet and curvelet-based methods were compared on various known databases in which curvelet method proved to have a higher performance than wavelet methods.

Li et al. [ 17 ] mentioned that recognition of emotion is completely based on visual information. He conducted an experiment that was based on the recognition of smiles. The subjects were made to depict the smiles. The comparison was made between 3D and 2D emotion recognition. To complete the test, symmetric property of the face method was used for registration. The cubic spline interpolation method was used to overcome holes in the image which were caused due to dark hair on the face. Feature extraction was done by using Principal Component Analysis (PCA); along with that, a LIBSVM (library for SVM) package was used in accordance with a linear discriminant classifier to execute SVM for emotion recognition. Linear Discriminant Analysis (LDA) and SVM, and both had a performance rate of more than 90% for 3D images and also, when considered for 2D images, it had performance rate around 80%.

Anil et al. [ 4 ] made a succinct survey of the techniques, which were used for emotion recognition along with the accuracies measured on various databases. A brief comparison was made between the 2D and 3D techniques. The standard classification was done which consisted of algorithms falling into the category of Geometric Feature-based and Appearance-based methods [ 18 ]. The methods observed were Gabor Filters [ 19 ] , Local Directional Number Pattern [ 20 ], Patched Geodesic Texture [ 15 ], Curvelet Feature extraction [ 4 ], FARO: Face Recognition Against Occlusions and Expression Variations [ 21 ], and Gradient Feature Matching, which was observed for Expression-Invariant Face Recognition with the help of a Single Reference Image [ 22 ]. In this approach, both local and host features were taken into account to achieve a higher rate of performance. The few databases used in this study were CMU Advanced Multimedia Processing (AMP) database, AR database, Cohn–Kanade (CK) database [ 23 ], Yale database, Japanese Female Facial Expression (JAFFE) database [ 24 ], and /hlBinghamton University 3D Facial Expression (BU-3DFE) database [ 25 , 26 ]. His work concluded that even in the case of occlusions the Bag of Words and FARO methods could recognize the emotion. After the Anil et al. work was finished, Matthews and Baker [ 27 ] extended the work by mentioning there were two types of (Active Appearance Model) AAM, Independent AAMs—they perform linear modeling and also the appearance of deformable objects (both separately), while, for the combined AAMs, they use a single set of parameters for both shape and appearance.

Mohammed et al. [ 19 ] introduced a new algorithm that was combined using both Bilateral Two-dimensional Principle Component Analysis (B2DPCA) and an extreme learning machine (ELM). Along with these two, a curvelet-based algorithm was also used. Significant experimentation was performed on databases like FERET [ 28 ], Faces94 [ 29 ], JAFFE [ 24 ], Georgia Tech [ 30 ], Sheffield [ 31 ], ORL [ 32 ], and YALE [ 33 ]. The curvelet features were used as input to the ELM after being dimensionally reduced with the help of B2DPCA for learning the enormous model and hence emotion recognition was achieved at a very high rate. This method was independent of any hidden neurons plus the training datasets size.

Rivera et al. [ 20 ] talked about Local Directional Number Pattern (LDN), a method that had the capability of outperforming the code more discriminately than many other existing methods. Using a compass mask, computation of the structure of each micropattern, which was responsible for extracting the directional information, was achieved. They also used prominent direction indices along with the signs that help in distinguishing the similar structural patterns having different intensities. These methods were tested under various conditions like noise, timelapse, and illumination. The accuracies of that computed LDN with different expressions were observed and compared. The use of varying compass masks (a derivative-Gaussian and Kirsch) was analyzed by them to extract information like direction and their performance rates on multiple applications. In his study, he suggested that LDN for face and expression recognition approach is very robust and reliable under different lighting conditions and could recognize even subtle emotions.

Kahou et al. [ 34 ] worked on a hybrid CNN—RNN (Recurrent Neural Network) architecture for emotion recognition. The hybrid approach was done by performing aggregation of CNN and RNN. The paper explored three CNN structures:

  • a very deep one with 3 × 3 frame size;
  • a three-layer with 5 × 5 filter size; and
  • finally in the third one increased the filter size to 9 × 9.

The work done by them used RNN to aggregate the frame features. The main reason to do this was that RNN could learn from an event irrespective of the time at which it must have occurred in a sequence. They performed a feature level and a decision level fusion, which, in turn, provided a significant improvement. The fusion with feature level was done using an MLP that had different hidden layers for each modality. In the fusion of decision level, the weighted sum of the class probabilities that were estimated was used. This architecture outperformed all the other methods like the aggregation of CNN-RNN performed and averaged of per frame based classifications.

Enrique Correa et al. [ 35 ] worked on three architectures of the deep neural network that were used for emotion recognition, and, out of those, the best one was chosen for further optimization.

  • Their first architecture was based on Krizhevsky and Hinton [ 36 ]; it consisted of three convolutional layers with two fully connected layers. The process had reduced size of images through max-pooling and also, to overcome overfitting, it had a dropout layer.
  • In the second architecture, instead of two fully connected layers, they applied three fully connected layers, with local normalization to speed up the process.
  • The third architecture had three different layers like one convolution layer, one local contrast normalization, and max-pooling layer, and, in later stages, they added the third max-pooling layer to reduce the number of parameters.

The accuracies achieved were 63%, 53%, and 63%, respectively, for the above three architectures. The paper observed that having a smaller network size reduces the performance of the original network more than expected. Hence, the second architecture was not competitive to the remaining two architectures. Yu et al. [ 37 ] proposed a technique where the face detection had three state-of-the-art face detectors followed by multiple (seven hidden layers) deep CNN models. There, the CNN were initialized randomly and retrained on a larger dataset. For combining the CNN, two methods were used: minimizing the hinge loss and minimizing the log-likelihood loss. The experiments were done using Static Facial Expressions in the Wild (SFEW) database [ 38 ], SVM was trained and tested on the output responses of the concatenated network. The results are shown in Table 1 . Shan et al. [ 39 ] indicated that Local Binary Pattern (LBP) features could be derived at a faster rate as compared to the Gabor wavelets. The discriminative features were stored in small representations, and it learned more LBP features with AdaBoost (Adaptive Boosting). Along with these stored discriminative features, it was proven to give the best recognition to SVM. The study performed algorithms other than SVM like linear discriminant analysis and template matching. Zang et al. observed improvement by the work done by (LBP feature) [ 39 , 40 ], and it also performed better than Gabor filtering emotion recognition. It used AU intensity detector, which was regression-based, emotion clustering for recognition of emotions and also unsupervised facial point detector. The facial detector outperformed AAM and Constrained Local Model (CLM) by 13% and 9%. The summary of the literature survey is presented in the form of Table 1 . It gives a brief idea of the accuracies achieved through different methods and use of databases. The table consists of the technique used, the database used, accuracies achieved and the drawbacks.

Comparison of the accuracies achieved through various techniques.

2.1. Database Description

Databases form an essential part of the training of machine learning algorithms. Approaches using supervised learning for emotion classification require a vast and varied dataset for proper training and testing, hence using proper training and testing datasets are of great importance. The correctness of the results is remarkably affected if the database is in abundance and is highly accurate. Even though there are many available databases of facial images for recognition of face [ 41 ], a lesser number of databases have been formed, which support recognition of facial emotions. The type of images in most of the databases offered are 2D. However, some offer 3D images captured through stereography or 3D scanners. 3D images efficiently improve recognition accuracy even in unevenly lighted conditions. These various classes of databases are being used for training human facial expressions for recognizing emotions.

The images/video recordings in the databases can be divided into two major categories: spontaneous and posed datasets. Each of them has advantages and disadvantages.

  • Posed Datasets: Popular for capturing extreme emotions. The disadvantage is the artificial human behavior.
  • Spontaneous Datasets: Natural human behavior. However, it is extremely time-consuming for capturing the right emotion.

Thus, we can conclude that a database for facial emotion recognition should have a combination of both the categories. A close relationship exists between the advancement of face recognition algorithms and the availability of face databases. This section describes in depth about the various RGB databases, the thermal image databases, and the 3D image databases with their characteristics.

2.1.1. RGB Databases

Many databases are available online for public use for conducting experiments on facial emotion recognition. However, only a few of them contain 3D information. The very first comprehensive database to get public was the Cohn-Kanade (CK) [ 23 ] database. The CK database was formed in a sequence, and each sequence was labeled by the desired emotion to be expressed and as observed by the facial movements using FACS [ 1 , 3 ]. However, the major drawback of this system was that the sequences were not verified against the real facial emotions they contained. The emotion label referred to what expression was requested rather than what could have been performed. The CK database had only frontal views of the subjects, and it also did not have a variety of data available, namely, a few gender sets and age. In the CK database, it had similar illumination ranges. The biggest drawback of this RBG dataset is the lack of intensity labels. Hence, modifications were done, and, furthermore, this database was extended to CK+ [ 53 ]. In the later stages, CK+ had increased their number of samples and also had spontaneous expressions of the clicked images of subjects. There was an addition of 593 (327 sequences having discrete emotion labels) sequences with more number of frames per sequence. CK+ provided FACS coding with the validated emotion labels. CK+ had posed as well as spontaneous facial emotions. CK+ database has a frame resolution of 640 × 490.

JAFFE [ 24 ] is a database of static images that was created in laboratory conditions where emotions were acted. The database was formed by 213 images having seven facial expressions, which were enacted by 10 Japanese female models. It has a resolution frame rate of 256 × 256 with facial expressions labels. It has only posed expressions. The data sets improved their quality when the MMI (named after their creators M aja Pantic, M ichel Valstar and I oannis Patras) [ 54 ] datasets were created. The MMI datasets consisted of many of the AU expressions of the FACS. It had 43 subjects (1280 videos and over 250 images) with a resolution frame rate of 720 × 526. The MMI database was formed with AU label for the image frame with apex facial expression in each image sequence. It had both posed and spontaneous facial expressions. In the RGB databases, JAFFE was all gray, and MMI was colored. The MULTI PIE [ 55 ] database had multiple views of subjects at various angles, also adding a variety of illumination situations. It was made up of more than 750,000 images recorded in up to four sessions over the span of five months by considering 337 people as subjects plus high-resolution frontal view images as well. These 337 subjects were considered under 19 illumination conditions while being imaged, for displaying a range of facial expressions, and they were also imaged for 15 viewpoints. In total, the database contains more than 305 GB of face data.The AFEW [ 56 ] database has in total six primary emotions with the special mention of age, pose and gender. In comparison to AFEW, SEMAINE [ 57 ] had labels of shakes and nods, laughs, states while interacting with agents. The SEMAINE database consists of 150 participants with a total of 959 conversations, which were conducted with individual Sensitive Artificial Listener (SAL) agents that lasted approximately five minutes each. SEMAINE had information of the FACS, like laughter identification and annotation on selected extracts.

2.1.2. Thermal Databases

Presently, thermal databases are used for a variety of reasons like for target acquisition in the military, medical diagnosis, etc. Thermal IR sensors are the ones who are responsible for capturing the emitted heat patterns by the objects. Hence, thermal Infra-Red (IR) imagery is not dependent on ambient lighting conditions, thus forming a promising alternative for recognition of facial emotions. The high use of the thermal database is due to the reason that they already contain the data in the RGB form. Thermal databases, however, are very few, and primarily they lack thermal information. One of the drawbacks is that the database includes only posed facial images. The very first ones only had three expressions (laugh, surprise and angry) with different poses of the head in different illuminations. The very first ones are called the Imaging, Robotics and Intelligent System Database (IRIS) [ 58 ] and National Institute of Standards and Technology (NIST) [ 59 ] databases. The IRIS database consisted of 30 individuals and their images with a resolution of 320 × 240 in which 28 were men, and two were women. Two different sensors were used to form this database, the first is Thermal—the Raytheon Palm-IR-Pro (Raytheon, Waltham, Massachusetts, USA)—and the second is Visible—the Panasonic WV-CP234 (Panasonic, Kadoma, Osaka Prefecture, Japan ). A total of 11 images per rotation was recorded where 176–250 images per person were recorded. Out of 3058 images, there are in total 1529 images that are thermal. NIST consists of 1573 individual images that have 78 female images and remaining male images. It consists of side and front profiles.

The Natural Visible and Infrared Facial Expression (NVIE) [ 60 ] database gives a collection that consists of six expressions in total. They are both in posed and spontaneous expressions, while posed expressions are both with and without glasses. It has more than 100 subjects; the recording was done simultaneously by a thermal infrared camera and visible camera. It had illumination from three different positions. The drawbacks of this database were that not all of the six expressions recorded are spontaneous. One more disadvantage is the gap between video clippings is 1–2 min long, which is too short for a person to establish a neutral status. The Kotani Thermal Facial Emotion (KTFE) [ 61 ] database, like NVIE databases, have all six spontaneous and triggered emotions. This database contains 26 subjects who are Vietnamese, Japanese, and Thai from 11-year-olds to 32-year-olds with seven emotions. It has 131 GB of visible and thermal facial emotion videos.

2.1.3. 3D Databases

It is difficult to handle subtle facial expression changes and large pose variations in 2D image-based databases. 3D facial expression databases hence come into the picture to facilitate the recognition of subtle emotions to improve the overall accuracy of algorithms. One of the most used 3D databases for emotion recognition is the BU-3DFE [ 25 ] database. The database was made of 100 subjects in which 56% were female, and 44% were male with 2500 facial expression models. Due to its variety of subjects like Indian, Latino, White, Middle-East Asian, it was considered a well-formed database with a variety of ethnic/racial ancestries. Six different expressions were depicted using four different levels of intensity. The database is high resolution and has videos from 101 subjects that were taken, each consisting of more than a hundred frames. The Bosphorus database consists of many expressions at the low-cost ethnic diversity of subjects in comparison to BU-3DFE [ 62 ]. The Bosphorus database has systematically varied poses with different types of occlusions. It has some unique properties that include a rich variation of head pose, face occlusion, and facial expressions are composed of a judiciously selected subset of Action Units. However, both Bosphorus and BU-3DFE have posed expressions.

3. Motivation

It was only in the 1990s that the term Augmented Reality (AR) came into existence, following which was the Virtual Reality (VR), which was not patented until 1962. These terms are now gaining importance as the market has started developing devices based on AR and VR. However, not much work has been done with these devices in real-time. Research is underway on use of these devices for various applications including judging a suspicion by the military, helping a medical student learn surgery, emotional branding to create an emotional connection with customers, and observing behavior in Congress during discussion of critical issues. Automated verification of human identity is indispensable in security and surveillance applications these days. Biometric authentication schemes based on video modalities are non-intrusive and are therefore more suitable in real-world settings compared to the intrusive methods like fingerprint and retina scans. This forms the basis of the research for observing and detecting emotions through facial expressions in AR. Facial expressions constitute the prime source of emotional recognition supported in some cases by various other modalities such as speech or physiological signals. A high accuracy algorithm supported by the exemplary sensors (providing high-grade input data) can achieve useful results that can be used to recognize the emotions in real time in AR. With the whole new world of mixed reality devices gearing up to astonish the world with its multiple uses, testing the algorithms for emotion recognition with such devices plays a crucial part in research for both the developers and users. For this study, the mixed reality device introduced by Microsoft, the Microsoft HoloLens (MHL), has been used.

MHL is more than just Augmented Reality (AR) and Virtual Reality (VR); it is all about Mixed Reality. Microsoft HoloLens has introduced a new type of computing called the Holographic computing, which is developed with high-quality sensors. Microsoft is inviting people to build applications for the HoloLens in the field of 3D, medical purposes, and STEM learning (STEM education is an interdisciplinary approach to learning where rigorous academic concepts are coupled with real-world lessons as students apply science, technology, engineering, and mathematics in contexts that make connections between school, community, and work.) The use of a mixed reality device for emotion recognition will get one’s feet wet in practical purposes like cyber-security, learning the mental state of a person, etc. The emotion detection can assist to find the concealed emotions when a person is being sarcastic at times or happy or sad or is having some suspicious intentions. MHL offers one of the most influential factors for experimenting with such MR devices, the choicest sensors. High-quality, different sensors allow the development of new systems for the recognition of human pose, gestures, face, and facial expressions. With the help of sensors, the existing algorithms can be tweaked based on, e.g., on an RGB camera and hence improved recognition efficiency in difficult illumination conditions along with better localization of facial parts. MHL has many high-quality sensors, which include one depth camera, four environment cameras, and light sensors. For human understanding, MHL has introduced spatial sound, gaze tracking, gesture input, voice support, built-in speakers, audio 3.5 mm jack and volume up/down, power button, etc., which makes HoloLens more interactive and increases its usefulness. Making facial emotion recognition work with Microsoft HoloLens has more effective results as only the observer has to wear the MHL, and it has no amount of sensors put on the subject’s body. An experiment was performed with this MR device and a webcam to compare the achieved accuracy.

4. Experimental Results

This section presents an experiment with the help of MHL and a webcam. A brief comparison is done, which is supported by results.

4.1. Experimental Setup

The setup for this experiment included Microsoft HoloLens, a webcam and a laptop/desktop with Windows 10 version (Microsoft, Redmond, WA, USA). The standard webcam was used along with the desktop or laptop while the Microsoft HoloLens has its own Windows 10 operating system. The webcam used had a single camera while the MHL uses six individual cameras to capture images from various angles and then combines them [ 63 ]. Due to this merging, the best exposures of each image sections make MHL pictures better in quality. Three people participated, as the subject for this experiment to detect five different emotions with no change in the lighting. In addition to those 15 images taken by MHL and 15 images taken by a webcam, a dataset of 200 images (100 taken by MHL and 100 taken by webcam) was included to provide concrete comparison results for the experiment. The experiment was performed under normal laboratory conditions, without any background being tampered with. All light sources that maintained ambient lighting in the laboratory were switched on while conducting the experiment. We also captured the emotions from various angles and with different environments in the lab, in order to reduce the effect of rotation in-variance and background. Likewise, some emotions were captured with the subject standing, and some were captured while sitting on a chair. Emphasis was given to capturing poses as natural as possible in a typical environment.

4.2. MHL Results

The experiment focused on detecting the emotions of the subjects under consideration. The first step of the process was to detect the face and then the facial expression of the person. The analysis was done using the existing Microsoft Application Programming Interface (API) titled “Emotion API” [ 64 ]. It took the input image and set up a bounding box across the face in the image. The API used Microsoft’s cutting-edge cloud-based emotion recognition algorithm. As three different subjects enacted the emotions, the experimenter who is wearing the Microsoft HoloLens was taking the video. The process was done in the Microsoft HoloLens and the emotion depicted was displayed. The five emotions taken into consideration for this experiment are happiness, sadness, anger, surprise and neutral. The final result images displayed attributes like gender, age, glasses, beard, and emotion. Unity (Unity Technologies, San Francisco, USA) Visual Studio Community 2015 was the platform used for this experiment.

4.3. Webcam Results

Using a webcam, a similar experiment with the same three subjects was performed and compared. An existing API was used for comparison. The API took an input image as well as a box detecting the face. These emotions are communicated cross-culturally and universally via the same basic facial expressions, where they are identified by Emotion API [ 64 ]. The API used Microsoft’s cutting-edge cloud-based emotion recognition algorithm. Using the webcam, a picture of the subject depicting the expression was taken and then processed. Here, for the sake of comparison, same attributes of all the subjects were considered, i.e., the values of expressions like happiness, sadness, etc.

4.4. Analysis

The MHL and a webcam were used to detect emotions with a Microsoft emotion API. The results are shown in Table 2 where the comparison is shown in the form of which emotions were identified for the same environmental conditions. The result from the standard webcam is tabulated in the form of probability of the emotion, where one will be the absolute surety for a particular emotion. The results were displayed for all the five emotions, and the one with the highest probability was taken as the observed emotion. There were a few more emotions (such as disgust and contempt) that were supported by the API but were not considered for this study because they were too hard to enact. In a few cases, the probability was very high, and the emotions that were detected showed high accuracy. This can be observed in the first image of the Table 2 where the emotion happy is detected with a probability of 0.99999. In certain cases, the probability of an emotion identified was confusing and was not at all near to 1; however, it had a considerate amount of accuracy (>0.5), so that it could be concluded that the emotion depicted falls into which category. For instance, image 15 of Table 2 shows that the emotion was not recognized very accurately, but, among the detected emotions, the probability for emotion angry is sufficiently higher than other emotions (0.575), and hence the emotion angry can be inferred from that.

Comparison of emotion recognition using the HoloLens and a webcam.

In the case of HoloLens, the emotions depicted were recognized with remarkable accuracy. In this case, the result was not in the form of probabilities of the detected emotions. Instead, there was one pure emotion that was detected. It was also able to identify specific other features such as gender, age, mustache, beard, etc. of the subjects. The square boxes seen in the Table 2 of MHL section are the ones recognizing the faces of subjects. Although some of the recognitions were not that accurate, such as the age, a few of them were determined somewhat accurately such as gender, mustache, etc. When compared with the standard webcam, it can be observed that the recognized emotions of the subjects from MHL were more accurate than the webcam. It could be seen in Table 2 , even in the case of subtle emotions and at a low intensity of emotions, that most of the times MHL detected the emotions more accurately than a webcam. This can be seen in images 13 and 14 of Table 2 . The factors affecting this accuracy can be subjected to the more advanced camera in the HoloLens with more number of sensors that are inbuilt.

Graphical representation of the results is reported in the shown bar graphs in Figure 4 , Figure 5 and Figure 6 , where the highlighted region depict the emotion recognized. The x -axis represents the emotions recognized through camera/MHL, and the y -axis represents the accuracy achieved, 0 indicating the absence of emotion and 1 indicating emotions with an absolute surety. In Figure 4 , considering the emotion “Happy” with red color, it is seen that MHL has higher accuracy and it does not mix the emotion with the other emotions, unlike the webcam. The camera, despite having accuracy higher than 0.9, it still confuses it with emotions like sad, surprise and neutral. For the emotion “Sad”, which is green in color, comparing the results of MHL and camera, it is observed that MHL has a lower tendency to misread an emotion. While MHL confuses the sad emotion with emotions like neutral and angry, the camera misreads it for a greater number of emotions like happy, neutral, angry and surprise. Emotions “Angry” and “Surprise” have almost the same accuracy with only one emotion misread by the camera for emotion “angry” and two emotions for “surprise” and no emotion misread by the MHL. The emotion in orange is for neutral. The MHL has a higher accuracy for neutral while the webcam is confusing it with the sad emotion. The bar graphs evidently show that the accuracy is significantly affected by the use of sensors in the mixed reality device, which is having higher accuracy for the predicted emotions and a lesser number of misread emotions.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-00416-g004.jpg

Comparison of average emotion recognition accuracy for the five emotions for Subject 1 using MHL and webcam.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-00416-g005.jpg

Comparison of average emotion recognition accuracy for the five emotions for Subject 2 using MHL and webcam.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-00416-g006.jpg

Comparison of average emotion recognition accuracy for the five emotions for Subject 3 using MHL and webcam.

The confusion matrices are the average of the results obtained from the whole dataset of 200 images and 20 images per emotion per device. Figure 7 and Figure 8 show the confusion matrices (in the form of heat maps) of the whole dataset (100 images taken by MHL (100 images, 20 images of each of the five emotions) and 100 by a webcam (100 images, 20 images of each of the five emotions). Figure 7 and Figure 8 show that the measured emotion through a dataset of 200 images remained in alignment with the results provided by the images of the three subjects.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-00416-g007.jpg

Confusion matrix of webcam-based emotion recognition results for the complete dataset.

An external file that holds a picture, illustration, etc.
Object name is sensors-18-00416-g008.jpg

Confusion matrix of MHL-based emotion recognition results for the complete dataset.

4.5. Limitations

After conducting the experiment, some of the observed limitations are listed below:

  • MHL was made to detect the emotions enacted by the people. As people found it difficult to hold the emotions that were depicted, the accuracy of the algorithm was affected.
  • People had to give several shots for detection of the ‘SAD’ emotions, as detection of the ‘SAD’ emotion was a major limitation of MHL.
  • MHL code runs in video mode and not the real-time mode, due to which, for every real-time change in emotion, the MHL has to be set up again to detect it.
  • Technical support was not very available for MHL since a limited amount of work is done using it.
  • Expressions of people are changing every minute, rather every mini-second, as a result of which it is challenging for MHL to work with detection of face and recognition of emotions in real time.

5. Conclusions

The face detection and emotion recognition are very challenging problems. They require a heavy effort for enhancing the performance measure of face detection and emotion recognition. This area of emotion recognition is gaining attention owing to its applications in various domains such as gaming, software engineering, and education. This paper presented a detailed survey of the various approaches and techniques implemented for those approaches to detect human facial expressions for identification of emotions. Furthermore, a brief discussion on the combination of steps involved in a machine learning based approach and geometric-based approach for face detection and emotion recognition along with classification were described. While reporting this survey, a comparison of accuracies was made for the databases that were used as datasets for training and testing as shown in Table 1 . Different kinds of databases were described in detail to give a brief outline of how the datasets were made, whether they were posed or spontaneous, static or dynamic, experimented in a lab or non-lab conditions and of how diverse the datasets are. A conclusion derived from this survey of databases is that RGB databases lack the intensity labels, making it less convenient for the experiments to be performed and, hence, compromises on the efficiency. The drawbacks of thermal databases are that it does not work with pose variation, variation in temperature, aging and different scaling (e.g., identical twin problem). Disguises cannot be captured if the person has put on glasses. Thermal images have a very low resolution, which affects the database quality. The 3D databases are not available in abundance to perform experiments and improve accuracy. The accuracies of different algorithms with these databases were also mentioned, which showed that there is scope for improvement in the field of emotion recognition regarding accuracy and for detecting subtle micro-expressions.

Later, a quick comparison was made between MHL and existing AR devices. An experiment was further performed using two different devices: the Microsoft HoloLens and a webcam. Five expressions—anger, neutral, happy, surprise and sad—were depicted by three subjects selected. The primary limiting factor affecting the accuracy of both the devices was that the emotions to be predicted had to be posed for a longer time. As it was difficult to hold on to the sad emotion until its detection, it created a serious issue. The experiment was conducted in a lab with different light conditions, and it was proven through this experiment that, because of the sensors available in MHL, it gives better accuracy in emotion recognition than the webcam with similar conditions. Through this experiment, we derived that sensors play a very influential role in recognizing an emotion. A number of high-quality sensors are required for capturing subjects emotions in real time with better accuracy in different environmental conditions. The results achieved by this experimentation are presented in the form of a comparison table that contains the predictions and the probability of the emotions. The left column of the table is for camera deductions where all the emotions with their values are seen, and the range with a maximum value of emotion, considered in the right column, is for Microsoft HoloLens, where gender, age, and emotion of the subject are mentioned. The paper concludes by suggesting limitations of the device and the problems that were observed while experimenting.

Mixed Reality is the future. The device Microsoft HoloLens available can be used to improve the accuracy of the current experiment by exploiting the use of its high-quality sensors, and more robust testing can be done using an extensive database for real-time emotion recognition. Expression change should be detected as soon as a person’s expression changes with time. This brings us to a conclusion that we will be making the algorithm work in real time. Secondly, the accuracies of the recognition of emotions, notably the “Sad” emotion, will be improved. Thirdly, various algorithms will be used for testing and validation, in order to see which one of these algorithms have the best accuracy in recognizing the emotions. In addition, the bug observed during experimentation in the previous version of MHL is now removed, and hence experiments can currently be performed multiple times quickly.

Acknowledgments

The authors would like to thank the members of Paul A. Hotmer Family Cybersecurity and Teaming Research (CSTAR) Laboratory at The University of Toledo for the infrastructural support to perform the experiments. The authors would also like to thank Electrical Engineering and Computer Science Department at The University of Toledo for financially supporting the students working on this project. Finally, the authors are thankful to students who volunteered to help in developing the data set for the experiments.

Author Contributions

In this paper, the experiment was designed and performed by Dhwani Mehta and Mohammad Faridul Haque Siddiqui under the supervision of Ahmad Y. Javaid. Dhwani Mehta performed the experiment with Microsoft HoloLens while Mohammad Faridul Haque Siddiqui performed it using a webcam. Both of them collected the data from results and analyzed the data. The whole experiment was performed in a secure laboratory environment, Microsoft HoloLens, webcam and the high-end computers provided by Ahmad Y. Javaid. Dhwani Mehta wrote the paper with the help of Mohammad Faridul Haque Siddiqui and Ahmad Y. Javaid. Further editing and proofreading of the paper was done by Mohammad Faridul Haque Siddiqui and Ahmad Y. Javaid.

Conflicts of Interest

The authors declare that there are no conflict of interest as of the date of this publication.

  • Mobile Site
  • Staff Directory
  • Advertise with Ars

Filter by topic

  • Biz & IT
  • Gaming & Culture

Front page layout

artificial emotional intelligence —

Major chatgpt-4o update allows audio-video talks with an “emotional” ai chatbot, new gpt-4o model can sing a bedtime story, detect facial expressions, read emotions..

Benj Edwards and Kyle Orland - May 13, 2024 5:58 pm UTC

Abstract multicolored waveform

On Monday, OpenAI debuted GPT-4o (o for "omni"), a major new AI model that can ostensibly converse using speech in real time, reading emotional cues and responding to visual input. It operates faster than OpenAI's previous best model, GPT-4 Turbo , and will be free for ChatGPT users and available as a service through API, rolling out over the next few weeks, OpenAI says.

Further Reading

OpenAI revealed the new audio conversation and vision comprehension capabilities in a YouTube livestream titled "OpenAI Spring Update," presented by OpenAI CTO Mira Murati and employees Mark Chen and Barret Zoph that included live demos of GPT-4o in action.

OpenAI claims that GPT-4o responds to audio inputs in about 320 milliseconds on average, which is similar to human response times in conversation, according to a 2009 study , and much shorter than the typical 2–3 second lag experienced with previous models. With GPT-4o, OpenAI says it trained a brand-new AI model end-to-end using text, vision, and audio in a way that all inputs and outputs "are processed by the same neural network."

"Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations," OpenAI says.

During the livestream, OpenAI demonstrated GPT-4o's real-time audio conversation capabilities, showcasing its ability to engage in natural, responsive dialogue. The AI assistant seemed to easily pick up on emotions, adapted its tone and style to match the user's requests, and even incorporated sound effects, laughing, and singing into its responses.

OpenAI CTO Mira Murati seen debuting GPT-4o during OpenAI's Spring Update livestream on May 13, 2024.

The presenters also highlighted GPT-4o's enhanced visual comprehension. By uploading screenshots, documents containing text and images, or charts, users can apparently hold conversations about the visual content and receive data analysis from GPT-4o. In the live demo, the AI assistant demonstrated its ability to analyze selfies, detect emotions, and engage in lighthearted banter about the images.

Additionally, GPT-4o exhibited improved speed and quality in more than 50 languages, which OpenAI says covers 97 percent of the world's population. The model also showcased its real-time translation capabilities, facilitating conversations between speakers of different languages with near-instantaneous translations.

OpenAI first added conversational voice features to ChatGPT in September 2023 that utilized Whisper , an AI speech recognition model, for input and a custom voice synthesis technology for output. In the past, OpenAI's multimodal ChatGPT interface used three processes: transcription (from speech to text), intelligence (processing the text as tokens), and text to speech, bringing increased latency with each step. With GPT-4o, all of those steps reportedly happen at once. It "reasons across voice, text, and vision," according to Murati. They called this an "omnimodel" in a slide shown on-screen behind Murati during the livestream.

OpenAI announced that GPT-4o will be accessible to all ChatGPT users, with paid subscribers having access to five times the rate limits of free users. GPT-4o in API form will also reportedly feature twice the speed, 50 percent lower cost, and five-times higher rate limits compared to GPT-4 Turbo. (Right now, GPT-4o is only available as a text model in ChatGPT, and the audio/video features have not launched yet.)

In <em>Her</em>, the main character talks to an AI personality through wireless earbuds similar to AirPods.

The capabilities demonstrated during the livestream and numerous videos on OpenAI's website recall the conversational AI agent in the 2013 sci-fi film Her . In that film, the lead character develops a personal attachment to the AI personality. With the simulated emotional expressiveness of GPT-4o from OpenAI (artificial emotional intelligence, you could call it), it's not inconceivable that similar emotional attachments on the human side may develop with OpenAI's assistant, as we've already seen in the past.

Murati acknowledged the new challenges posed by GPT-4o's real-time audio and image capabilities in terms of safety, and stated that the company will continue researching safety and soliciting feedback from test users during its iterative deployment over the coming weeks.

"GPT-4o has also undergone extensive external red teaming with 70+ external experts in domains such as social psychology, bias and fairness, and misinformation to identify risks that are introduced or amplified by the newly added modalities," says OpenAI. "We used these learnings [sic] to build out our safety interventions in order to improve the safety of interacting with GPT-4o. We will continue to mitigate new risks as they’re discovered."

Updates to ChatGPT

Also on Monday, OpenAI announced several updates to ChatGPT, including a ChatGPT desktop app for macOS, which began to roll our to a few testers who subscribe to ChatGPT Plus today and will become "more broadly available" in the coming weeks, according to OpenAI. OpenAI is also streamlining the ChatGPT interface with a new home screen and message layout.

And as we mentioned briefly above, when using the GPT-4o model (once it becomes widely available), ChatGPT Free users will have access to web browsing, data analytics, the GPT Store , and Memory features, which were previously limited to ChatGPT Plus, Team, and Enterprise subscribers.

reader comments

Promoted comments.

literature review on emotion recognition

I grew up in the US, live right next door to the US, and dislike that kind of chirpiness in general. But even so, it sounds way over the top. It's not that you'd never hear that level from an actual person, but it would indicate that the person was insincere (above and beyond formulaic "how are you" when you don't actually care), and not very good at acting and/or gauging their audience. From your name I'm guessing maybe you come from somewhere other than the US and aren't quite calibrated to see it as excessive even for the US. But I admit I could also be the one who's miscalibrated, especially because I'm old. And it's true 's true that women are really expected to lay it on pretty thick in some situations.

Channel Ars Technica

Speech emotion recognition systems and their security aspects

  • Open access
  • Published: 21 May 2024
  • Volume 57 , article number  148 , ( 2024 )

Cite this article

You have full access to this open access article

literature review on emotion recognition

  • Itzik Gurowiec 1 , 2 &
  • Nir Nissim 1 , 2  

Speech emotion recognition (SER) systems leverage information derived from sound waves produced by humans to identify the concealed emotions in utterances. Since 1996, researchers have placed effort on improving the accuracy of SER systems, their functionalities, and the diversity of emotions that can be identified by the system. Although SER systems have become very popular in a variety of domains in modern life and are highly connected to other systems and types of data, the security of SER systems has not been adequately explored. In this paper, we conduct a comprehensive analysis of potential cyber-attacks aimed at SER systems and the security mechanisms that may prevent such attacks. To do so, we first describe the core principles of SER systems and discuss prior work performed in this area, which was mainly aimed at expanding and improving the existing capabilities of SER systems. Then, we present the SER system ecosystem, describing the dataflow and interactions between each component and entity within SER systems and explore their vulnerabilities, which might be exploited by attackers. Based on the vulnerabilities we identified within the ecosystem, we then review existing cyber-attacks from different domains and discuss their relevance to SER systems. We also introduce potential cyber-attacks targeting SER systems that have not been proposed before. Our analysis showed that only 30% of the attacks can be addressed by existing security mechanisms, leaving SER systems unprotected in the face of the other 70% of potential attacks. Therefore, we also describe various concrete directions that could be explored in order to improve the security of SER systems.

Avoid common mistakes on your manuscript.

1 Introduction

Human beings can recognize emotions in the human voice (Blanton 1915 ). Even animals like dogs and horses can recognize and interpret the human voice, discerning tones of love, fear, anger, anxiety, and even depression. "The language of tones" is perhaps the most universal and oldest language known to human beings, animals, and all living creatures (Blanton 1915 ); in fact, this language is the basis of our communication.

Humans’ language of tones contains a range of personal information regarding the speaker 0. By understanding people's emotions, one can both better understand other people and be better understood by others. When interacting with others, we often provide clues that help them understand what we are feeling. These clues may involve tone changes, body language, facial expressions, etc. The emotional expressions of the people around us are a major aspect of our social communication. Being able to interpret and react to others' emotions is essential. The capability allows us to respond appropriately and build deeper connections with our surroundings and the people in them. By knowing how a person feels, we can therefore know the person's emotional state and current needs and react accordingly.

Charles Darwin believed that the emotions are adaptations that allow animals and humans to survive and reproduce. Footnote 1 The ability to identify emotions in human speech has existed since the late 1970s when John D. Williamson created a speech analyzer to analyze the pitch or frequency of human speech and determine the emotional state of the speaker (Williamson 1978 ). Since that time, verbal communication and interactions between humans and computerized systems have increased. Systems like Apple’s Siri, Footnote 2 Amazon’s Alexa, Footnote 3 and Microsoft’s Cortana Footnote 4 have become one of the basic functionalities in daily-used devices. The main question that arises, however, is whether such systems can truly recognize the speaker's emotions and react accordingly.

Emotion recognition systems, which are mainly based on analyzing facial expressions (Dzedzickis et al. 2020 ), learn to identify the link between an emotion and its external manifestation from large arrays of labeled data. This data may include audio or video recordings of TV shows, interviews, and experiments involving real people, clips of theatrical performances or movies, or dialogue delivered by professional actors.

Many areas can benefit from emotion recognition capabilities, including security (Utane and Nalbalwar 2013 ), customer-focused services (Utane and Nalbalwar 2013 ), and even the socialization of people with special needs (Rázuri et al. 2015 ). According to Gartner, Footnote 5 by the end of 2023, one in 10 gadgets (meaning 10% of technologies) will include emotion recognition technology.

The methods and sensors used to recognize emotions (Dzedzickis et al. 2020 ) can be categorized broadly as self-report techniques that are based on the examined person's assessment and machine assessment techniques that are based on physiologic and biologic parameters collected from the examined person, which include: electroencephalography (EEG) produced from the brain’s electrical system; electrocardiography (ECG) and heart rate variability (HRV) produced from the heart’s actions; skin temperature and skin response to different stimuli; respiration rate (RR); facial expression (FE); and speech analysis, which is known as speech emotion recognition (SER). Machine assessment techniques lie at the core of many emotion recognition systems, including SER systems, and their market share is projected to be valued at $37.1 billion by the end of 2026. Footnote 6 Figure  1 presents various domains in which SER systems have been implemented in recent years. As can be seen, SER systems are implemented in a wide range of domains. Figure  2 presents the number of papers pertaining to SER systems published over the last four decades (based on Google Scholar), with a forecast for the next decade (2020–2030) which was created using simple exponential regression (the red solid line in the graph), including the upper and lower bounds of the confidence interval using 5% statistical significance (the two dashed orange lines). Figure  2 includes papers relevant to SER systems, regardless of whether they pertain to the security of such systems. As can be seen, there has been exponential growth in the number of publications in the SER domain, with a forecast of over 230 publications per year by the end of 2030.

figure 1

Domains in which SER systems have been implemented

figure 2

SER publications per year

The growing popularity and distribution of emotion recognition technologies (Dzedzickis et al. 2020 ; Garcia-Garcia et al. 2017 ), including SER systems, raises another important issue. Such technologies and systems can invade individuals’ personal cyber space and compromise their privacy and security. Therefore, they must be secured with mechanisms like user authentication and encryption. The General Data Protection Regulation Footnote 7 (GDPR) defines the rules for processing personal data in the European Union (EU). According to this regulation, voice and speech are not considered personal data, but voice or speech recordings are considered personal data if they are related to an identified person. According to the GDPR, sound data (voice/speech) can even be considered sensitive data (a special category of personal data benefitting from a higher level of protection because of its nature), since it may reveal ethnicity, political opinions, etc. SER systems, which rely on voice and speech recordings, must be secure in order to adhere to such regulations.

As voice-based systems have become more ubiquitous, they have become an attractive target for cyber-attacks. In March 2019, The Wall Street Journal (WSJ) Footnote 8 reported on a cyber-attack aimed at a British energy company. Attackers used an artificial intelligence (AI) algorithm to impersonate the CEO who called an employee and demanded a fraudulent transfer of $243,000. According to the WSJ , the attackers used publicly available sound recordings (such as those used in the SER system training phase) to perform the attack. In 2018, The New York Times Footnote 9 reported on the ease of performing a "dolphin attack" on voice assistant systems in which the attacker plays inaudible voice commands in order to exploit the ability of smartphones in the surrounding area that can be operated and controlled by sound gestures. The article reported that this capability can be used to switch a smartphone to airplane mode (preventing it from having Internet access) or visit a website (which could be malicious). In October 2021, a group of attackers used voice-based deep fake technology ("deep-voice" Footnote 10 ) to transfer $35 million from a company's bank account to several other accounts. Footnote 11 In the phone call to the bank, they mimicked the voice of a senior manager of the company. The use of a SER system by the bank may have prevented such an attack. For example, a SER system with synthetic sound as a neutral reference (Mdhaffar et al. 2021 ) can be used to detect fraudulent calls. Such systems can both detect synthetic sound (as used in "deep voice") and neutral emotion in the attacker's voice. The classification of the attacker's emotion during the attack as "neutral" could raise suspicion, since most attackers would probably feel nervous, excited, or stressed when performing the attack.

Although extensive research has been conducted in the area of SER systems (as shown in Fig.  2 ), and there is a wide range of potential attacks (as described above), not enough research focusing on the security of SER systems has been performed; while studies have been conducted on this topic (Latif et al. 2018 ; Zhang et al. 2017 ; Jati et al. 2020 ), there is a significant gap between the solutions they provide, the vulnerabilities we discovered (discussed in Sect. 6), and the potential attacks outlined in our paper.

It is essential to differentiate between direct cyber-attacks on SER systems and attacks focused on SER system model alteration or imitation. Cyber-attacks that directly compromise the integrity, availability, or confidentiality of SER systems are considered direct cyber-attacks. These include system breaches (Aloufi et al. 2019 ), data theft (McCormick 2008 ), denial-of-service attacks (McCormick 2008 ), and other malicious activities that target the core functionality and security of the system (McCormick 2008 ). On the other hand, attacks focused on the alteration or imitation of SER system models aim to manipulate the underlying machine learning (ML) models employed by SER systems. By modifying or injecting adversarial input into the training data, these attacks attempt to deceive or manipulate the SER system's decision-making process. Although both types of attacks pose significant threats to SER systems, they represent distinct categories of vulnerabilities (see Sect. 6.1). By examining both, we aim to provide a holistic and comprehensive overview of the diverse range of challenges and risks faced by SER systems. While we present a variety of cyber-attacks aimed at SER systems, due to the increased use of ML in diverse domains, most of the attacks are part of the second group of attacks, aimed at altering and imitating the SER model. Such attacks exploit existing vulnerabilities of SER systems (e.g., their use of external recording devices, training on publicly available datasets) enabling attackers to launch and initiate such attacks.

In addition to the existing cyber-attacks, new attacks can always be performed. This is exemplified by the COVID-19 pandemic, which created three kinds of worldwide crises: economic, healthcare, and mental health crises (Lotfian and Busso 2015 ). The periods of isolation and need to quarantine affected millions of people around the world, causing the depression rate to rise (27.8% of US adults reported suffering from depression during the pandemic, compared to just 8.5% before the pandemic, according to a study performed by the Boston University School of Public Health Footnote 12 ). Early detection of radical changes in a person's mood, especially during a pandemic, is crucial. Emotion recognition systems are a valuable tool in detecting changes in a person's mood, and just as the pandemic created new applications for SER systems, it also created new opportunities for attackers. The global shift to online work, learning, and other daily activities allowed people to apply filters to their voice (transferred via any online medium, e.g., Zoom), changing the way they sound. This ability can be utilized by cyber-attackers at any time.

Despite the wide range of existing and potential cyber-attacks, to the best of our knowledge, no studies have explored and analyzed the security aspects of SER systems, such as potential cyber-attacks on the systems and the systems’ vulnerabilities, which might have great impact on individuals, society, companies, the economy, and technology.

In this paper we address this gap. We provide the basic definitions required to understand SER systems and improve their security. We discuss the main studies performed in the SER field; the methods used in those studies include support vector machines (SVMs), hidden Markov models (HMMs), and deep learning algorithms (e.g., convolutional neural networks). We also present the SER system ecosystem and analyze potential cyber-attacks aimed at SER systems. In addition, we describe the existing security mechanisms aimed at providing protection against SER cyber-attacks and introduce two concrete directions that could be explored in order to improve the security of SER systems.

Although some of the studies performed a review of the SER domain (Joshi and Zalte 2013 ; Ayadi et al. 2011 ; Schuller 2018 ; Swain et al. 2018 ; Yan et al. 2022 ), they provided limited information regarding aspects of the SER process, focusing instead on the basic and fundamental information needed to develop such systems. For example, Joshi and Zalte ( 2013 ) focused mainly on the classifier selection and feature extraction and selection methods suitable for speech data, while Ayadi et. al. ( 2011 ) focused on the databases available for the task of classifying emotions, the recommended features to extract, and the existing classification schemes. However, none of the abovementioned papers provided detailed information on the sound wave itself or the data representation techniques used in SER systems. In 2022, Yan et. al. ( 2022 ) explored the security of voice assistant systems, providing a thorough survey of the attacks and countermeasures for voice assistants. Despite the study’s comprehensiveness, it did not discuss SER systems in particular or the SER system ecosystem, which is crucial for the analysis of potential cyber-attacks aimed directly at SER systems. Although the authors presented a wide range of cyber-attacks aimed at voice assistants, providing a comprehensive analysis of each attack, they only focused on existing cyber-attacks, without suggesting new attacks or security mechanisms, as we do in this paper. Our paper aims to address the abovementioned gaps identified in the previous studies, and its contributions are as follows:

We identify the different players (humans and components) within SER systems, analyze their interactions, and by doing so, create ecosystem diagrams for SER systems for the main domains and applications they are implemented in.

We discuss 10 possible attacks and vulnerabilities relevant for SER systems. Using the understanding derived from the ecosystem diagrams, we identify the vulnerable components and elements within SER systems that are exposed to cyber-attacks, as well as the attack vectors from which a cyber-attack can be initiated against SER systems.

We describe nine existing security mechanisms that can be used to secure SER systems against potential cyber-attacks and analyze the mechanisms’ ability to address the possible attacks; by doing so, we identify uncovered gaps regarding attacks and vulnerabilities.

We propose two security mechanisms for SER systems that can help address some of the attacks and vulnerabilities that are currently uncovered by an existing security mechanism.

2 Research methodology

In this section, we provide an overview of the structured methodology employed to explore the security aspects of SER systems, which is the main goal of our study. This methodology enabled our comprehensive analysis of SER systems. The methodology's six steps, which are presented in Fig.  3 , can be summarized as follows: In step 1, we perform a technical analysis of the foundations, principles, and building blocks of SER systems. In step 2, we explore the evolution of SER systems and the domains they are used in. In step 3, we analyze the SER system's ecosystem in the domains explored in step 2. In step 4, we perform a security analysis of SER systems in which we explore their vulnerabilities and identify potential cyber-attacks targeting them. In step 5, we assess the coverage of the existing security mechanisms against the cyber-attacks identified in step 4. The methodology concludes with step 6 in which we identify the security gaps associated with SER systems and propose security enhancements. This research methodology guided our investigation of the security aspects of SER systems, ensuring a systematic and comprehensive approach to the analysis of both the technical and security-related aspects of the SER system domain.

figure 3

Overview of the research methodology employed in this study

2.1 Technical analysis of the foundations and building blocks of SER systems

To lay the groundwork for our study, we perform a comprehensive technical analysis of the principles of SER systems. This involves the in-depth exploration of sound waves, speech signals, signal processing techniques, feature extraction methods, and more.

2.2 Exploring of the SER system domain and the evolution of SER systems

A thorough literature review is performed, in which we cover the existing studies in the SER field and identify the main domains in which SER systems are used. This step served as a crucial step to the definition of the SER ecosystem.

2.3 Analysis and formulation of SER ecosystems

Building on the insights gained in the first two steps, we analyze and formulate the SER ecosystems in each domain. To do so, we identify all of the components in SER systems and the associated dataflow, starting from the development phase and continue to the diverse applications across various domains. The SER ecosystem provides a holistic framework for our subsequent security analysis.

2.4 Security analysis and potential cyber-attacks aimed at SER systems

To assess the security aspects of SER systems, we survey cyber-attacks targeting speech-based systems, with a particular focus on SER systems. By leveraging our understanding of the SER ecosystem, we analyze the vulnerabilities inherent in SER systems, examine the relevance of existing cyber attacks to the SER system domain, and present some new cyber attacks that could target SER systems.

2.5 Analysis of the coverage of existing security mechanisms against SER cyber attacks

In this step, we survey existing security mechanisms against cyber-attacks aimed at SER systems, reviewing the countermeasures designed to safeguard speech-based systems. This step results in an assessment of the current state of security measures and their efficacy in mitigating potential threats to SER systems.

2.6 Identification of security gaps and security enhancements

Based on our analysis of the potential cyber-attacks and existing security mechanisms, we first identify gaps in the security measure coverage against cyber-attacks aimed at SER systems. Then, we propose security enhancements for SER systems aimed at addressing the identified vulnerabilities and strengthening the overall resilience of these systems against potential threats.

3 Emotions and the principles of speech emotion recognition systems

To understand speech emotion recognition systems and design and develop proper security mechanisms for it, we must first provide several basic definitions. For example, what an emotion is and how it is manifested in sound waves. Thus, in this section we provide information regarding emotions, sound, sound waves, and how these abstract definitions can become data which can be used for emotion recognition or alternatively, can be exploited by an attacker. This section includes several sub-sections as follows: "Emotions" provide basic information of different emotional concepts, including types of emotions and sub-emotions and how they are classified; "Sound Waves" provide a brief explanation of what a sound wave is; Followed by that, in "Sound Data Representation" sub-section we present different representations of sound waves so that a digital system will be able to analyze it; Next, "Feature Extraction" sub-section provide information regarding different audio features' families that can be used to train a SER system; To conclude this section, in "Reflection of Emotions in Sound" sub-section we present how different emotions can be expressed in human's sound, and how each emotional state effects the sound humans produce.

3.1 Emotions

The scientific community has made numerous attempts to classify emotions and differentiate between emotions, mood, and affect; we now briefly explain some important terms.

An emotion is the response of a person to a specific stimulus; the stimulus can be another person, a real or virtual scenario, smell, taste, sound, image, or an event) (Wang et al. 2021 ). Usually the stimulus is intense with a brief experience, and the person is aware of it. Mood is the way a person feels at a particular time. It tends to be mild, long lasting, and less intense than a feeling (Dzedzickis et al. 2020 ). Affect is the result of the effect caused by a certain emotion (Dzedzickis et al. 2020 ; Wang et al. 2021 ) In other words, affect is the observable manifestation of a person's inner feelings.

In (Feidakis et al. 2011a ), the authors described 66 different emotions and divided them into two main groups: basic emotions, which include ten emotions (anger, anticipation, distrust, fear, happiness, joy, love, sadness, surprise, and trust) and secondary emotions, which include 56 semi-emotions. Classifying between different emotions is an extremely difficult task, especially when the classification process needs to be performed automatically. This is because the definition of emotion and its meaning have changed from one scientific era to another (Feidakis et al. 2011b ), therefore it is hard to define which emotional classes a SER system should include. It is also difficult to identify the relevant features to extract from the raw audio, since the different features selected could dramatically affect the performance of the SER system (Ayadi et al. 2011 ). Nevertheless, among the 66 emotions described by the authors, some emotions are considered "similar" (for example, calm and natural). To handle the issue of emotions' similarity in the classification process, researchers have focused on making classifications between the parameters of the emotions, including valence (negative/positive) and arousal (high/low), and analyzing just the basic emotions which can be defined more easily. Russell’s circumplex model of emotions (Dzedzickis et al. 2020 ) (presented in Fig.  4 ) provides a two-dimensional parameter space to differentiate between emotions with respect to the valence and arousal.

figure 4

Russell's circumplex model of emotions

Using the abovementioned model, the classification of emotions becomes easier for a human expert, but still, as mentioned earlier, there are many challenges related to automated emotion recognition performed by a machine. To accomplish that, several measurable parameters for emotion assessment must be used, including heart rate, respiration rate, brain electric activity (measured in an electroencephalography), facial expression, natural speech, etc. Understanding the differences between emotions, especially when there are some emotions that are similar to one another, is crucial for developing a SER system. On the other hand, an attacker interested in creating bias and interfering with the accurate emotion classification process of a SER system could exploit the similarities that exist between emotions. Therefore, the detection of the appropriate attributes (level of arousal and valence) of each emotion is a basic step in the development of a proper security mechanism for SER.

3.2 Sound waves

Before discussing the representation of the sound by a machine, we need to have a basic understanding of sound waves and their attributes. Every sound we hear is the result of a sound source that has induced a vibration. The sound we hear is caused by vibrations that create fluctuations in the atmosphere. These fluctuations are called "sound waves." Fig.  5 illustrates invisible sound wave. The "pressure" axis in Fig.  5 represents the difference between the local atmospheric pressure and the ambient pressure.

figure 5

Invisible sound wave

Sound waves are nothing but organized vibrations that pass from molecule to molecule through almost every medium, including air, steel, concrete, wood, water, and metal. As a sound vibration is produced, the fluctuations are passed through these mediums, transferring energy from one particle of the medium to its neighboring particles. When air carries sound, waves contact our ear, and the eardrum vibrates at the same resonance as the sound wave. Tiny ear bones then stimulate nerves that deliver the energy we recognize as sound. While some sounds are pleasant and soothing to our brain, others are not, and this is considered noise. The loudness (amplitude) of a sound wave is measured in intensity by decibels (dB), while the pitch of the sound wave is measured in frequency by hertz (Hz). One hertz is equal to one sound wave cycle per second as illustrated in Fig.  5 . The hertz level does not decay over time or distance, but the decibel level does.

3.3 Sound data representation

In order to use algorithms that analyze and learn informative patterns in sound (as in SER systems), the sound waves should be converted into data types that can be read by a digital system. Sound waves can be represented in a variety of ways, depending on the conversion process applied. There are several algorithms used to convert sound waves, each of which utilizes different features (mainly frequency, amplitude, and time) of the sound wave and represents the sound differently. The main methods used for the conversion process are analog-to-digital conversion (ADC) and time–frequency conversion. Once the sound is converted to a digital format and saved in a computerized system, it becomes vulnerable to cyber-attacks; therefore, the process of representing the sound data in the computer for the task of SER must be done with the appropriate knowledge, to decrease the possibility of malicious usage of it.

A taxonomy of the main sound data representation methods (A/D conversion and time–frequency domain representations) is presented in Fig.  6 . It includes a layer for the various conversion techniques (with reference to the relevant paper), as well as a layer for the conversion algorithm sub-type, where we present several conversion sub-algorithms (with the year it was first presented), related to their higher-level main algorithm. Each method is suitable for a different task and provides different information regarding the original sound wave. Some of the methods (e.g., Mel frequency representation) are more suitable for learning tasks associated with the human perception of sound. Footnote 13 A description and comparison of the methods follows the taxonomy.

figure 6

Taxonomy of raw sound data representation methods

3.3.1 Analog-to-digital conversion

Sound waves have a continuous analog form. Computer systems store data in a binary format using distinct values (sequences of 1 s and 0 s). Therefore, in order to be processed by computers, sound must be converted to a digital format. First, sound is recorded using a device (that can turn sound waves into an electrical signal). Following that, regular measurements of the signal's level (referred to as samples) are obtained. The samples are then converted into binary form. A computer can then process and store the digitized sound as a sequence of 1 s and 0 s. This method was invented by Bernard Marshall Gordon (Batliner et al. 2011 ), who is "the father of high-speed analog-to-digital conversion." During the conversion process, two main parameters need to be defined: the sampling frequency and the sample size. The Sampling frequency is the number of samples obtained per second, measured in hertz. An audio recording that was recorded with a high sampling frequency can be represented more accurately in its digital form. Figure  7 illustrates the effect of the sampling frequency on the representation accuracy. The upper plot shows the analog form of the sound, while the middle and bottom plots illustrate the effect of the sampling frequency on the sound wave; the middle plot shows a lower sampling frequency (and therefore a less accurate representation of the sound wave) than the bottom plot. The sample size is defined as the number of bits used to represent each sample. A larger sample size improves the quality of an audio recording; when more bits are available for each sample, more signal levels can be captured, resulting in more information in the recording, as illustrated in Fig.  8 . The top plot illustrates the use of one bit (1\0) per sample. The middle plot, which demonstrates the use of two bits per sample, which provides a more accurate representation of the sound. In the bottom plot, by using 16 bits to represent each sample, the accuracy of the digitized sound wave is almost identical to the analog form of the sound.

figure 7

The effect of the sampling frequency on the representation accuracy

figure 8

The effect of the sample size on the representation accuracy

To determine the size of a sound file, we need to multiply the sampling frequency by the sample size and the length of the sound recording in seconds:

Equation  1 . Calculation of a sound file size in bits.

Therefore, a sound file will become larger if the sampling frequency, sample size, or duration of the recording increase. When playing sound files over the Internet, the quality of the sound is affected by the bit rate, which is the number of bits transmitted over a network per second. The higher the bit rate, the more quickly a file can be transmitted and the better the sound quality of its playback. One should note that, in the case of securing SER systems, small-sized files are easier to transmit as part of a cyber-attack. That means that large sound files, which are more accurate in terms of the original sound recorded, are harder to utilize in order to perform a cyber-attack.

There are several types of analog-to-digital conversion algorithms used today to produce a digital form of a sound wave (Failed 2016 ), including the counter (which contains sub-types, e.g., ramp compare), ramp, integrative (which contains sub-converters, e.g., dual slope), and delta-sigma types.

3.3.2 Time-frequency domain

  • Spectrogram

Visualizing the sound wave can be described in two domains: time domain and frequency domain. In order to convert the time domain to the frequency domain, we need to apply mathematical transformations. The time domain visualization shows the amplitude of a sound wave as it changes over with time. When the amplitude in the time domain is equal to zero, it represents silence. These amplitude values are not very informative, as they only refer to the loudness of an audio recording. To better understand the audio signal, it must be transformed into the frequency domain. The frequency domain representation of a signal tells us what different frequencies are present in the signal. Fourier transform (FT) is a mathematical transformation that can convert a sound wave (which is a continuous signal) from the time domain to the frequency domain.

Sound waves, which are audio signals, are complex signals. Sound waves travel in any medium as a combination of waves. Each sound wave has a specific frequency. As we record a sound, we can only record the resultant amplitudes of its constituent waves. By applying FT, a signal can be broken into its constituent frequencies. A Fourier transformation not only provides the frequency of each signal but also its magnitude. To use a sound wave as an input, we need to use fast Fourier transform (FFT) (Huzaifah 2017 ). FFT and FT differ in that FFT takes a continuous signal as input (like a sound wave, which is a sequence of amplitudes that were sampled from a continuous audio signal), while in FT the input is a discrete signal. Figures  9 and 10 illustrates the input audio signal and the output of the same audio signal after applying FFT. The original signal in Fig.  10 is a recording of the term "speech emotion recognition."

figure 9

Fourier transform applied on a given sound wave

figure 10

- Sound wave in the time domain (left) and frequency domain (right), after applying FFT

When we apply FFT to an audio file, we have the frequency values, but the time information is lost. In tasks such as speech recognition, the time information is critical to understand the flow of a sentence. Spectrograms are visual representations of the frequencies of a signal as they change in time. In a spectrogram representation plot one axis represents time, while the other axis represents frequencies. The colors represent the amplitude (in dB) of the observed frequency at a particular time. Figure  11 represents the spectrogram of the same audio signal shown above in Fig.  10 . Bright colors represent high frequencies. Similar to the FFT plot, lower frequencies ranging from 0–1 kHz are bright.

figure 11

- Spectrogram of an audio signal

Mel spectrogram

The Mel Scale

The human brain does not perceive frequencies on a linear scale. The human brain is more sensitive to differences in low frequencies than high frequencies, meaning that we can detect the difference between 1000 and 1500 Hz, but we can barely tell the difference between 10,000 Hz and 10,500 Hz. Back in 1937, Stevens et al. ( 1937 ) proposed a new unit of pitch such that equal differences in pitch would sound equally distant to the listener. This unit is called Mel, which comes from the word "melody" and indicates that the scale is based on a pitch comparison. To convert frequencies (f in Hz) into Mel (m), we need to perform a mathematical operation:

Equation  2 . Frequency to Mel scale conversion formula.

Mel Spectrogram

A Mel spectrogram is a spectrogram where the frequencies (the y-axis) are converted to the Mel scale. Figure  12 presents the Mel-Spectrogram produced by the Librosa package Footnote 14 in Python. Note that in contrast to the spectrogram shown in Fig.  11 , the Mel-Spectrogram of the same sound signal has a different frequency which is directly affected by Eq.  1 .

figure 12

- Mel-Spectrogram of an audio signal

Given the above methods for the time–frequency domain, we can now compare the two representation methods (spectrograms and Mel-spectrograms). The frequency bins of a spectrogram are spaced at equal intervals based on a linear scaling. In contrast, Mel-scale frequency uses a logarithmical scheme similar to that of a human's auditory system.

Table 1 provides a comparison of the differences between abovementioned methods. As can be seen in the table, there are several main differences between analog-to-digital conversion and time–frequency domain representation methods. The parameters for comparison were chosen based on their relevance to audio analysis.

Based on Table  1 , digital representations and spectrograms, for example, are not well suited for human sound perception, meaning that those kinds of representations do not represent speech as a human hears it, a desired property in sound analysis tasks, as we wish the machine to "think," "hear," and "react" like a human being. Therefore, their use is not recommended in the SER task. Analog representation represents the sound in a way that suits human’s sound perception, however it poses analysis challenges (Batliner et al. 2011 ) (e.g., due to its inaccurate representation and high computational complexity). Spectrogram representation is widely used for text-to-speech tasks, as it can identify the difference in letters and words derived from the amplitudes of sound variance over time, however it is limited in its ability to identify informative changes in the same amplitude (it is more sensitive to variations in sound’s intonation), which are crucial for SER and categorization of sub-emotions. Therefore, and based on variety of recent successful studies, we find the Mel spectrogram as the ideal representation of sound for the task of SER mainly when using deep learning algorithms (Lech et al. 2018 ; Yao et al. 2020 ).

3.4 Feature extraction from the raw sound data

After converting the audio signals using one of the methods described in the previous section, a feature extraction phase needs to be performed. Anagnostopoulos et.al (Yan et al. 2022 ) described various features that can be extracted from the raw sound data. Swain et.al (Swain et al. 2018 ) surveyed a work performed in the speech emotion recognition field; they divided the features used in the SER domain into two main approaches. The first approach is based on prosodic features. This approach includes information regarding the flow of the speech signal and consists of features such as intonation, duration, intensity, and sound units correlated to pitch, energy, duration, glottal pulse, etc. (Rao and Yegnanarayana 2006 ). The second approach is based on vocal tract information, and these features are known as spectral features. To produce the spectral features, FT is applied on the speech frame, and the main features are the Mel frequency cepstral coefficient (MFCC), Footnote 15 perceptual linear prediction coefficient (PLPC), Footnote 16 and linear prediction cepstral coefficient (LPCC). Different studies used different features or a combination of the features described above. Each feature has its own advantages and disadvantages, but they all have one property in common: they contain the most sensitive information regarding the audio file, meaning that sabotaging the creation of the features will sabotage the entire SER system, including its accuracy and functionality. The related work section presents the studies that have leverage these kinds of features for the purpose of the SER task.

3.5 The reflection of emotions in human’s speech

Reliable acoustic correlates of emotion or affect in the audio signal's acoustic characteristics are required for the detection of emotions in speech. This has been investigated by several researchers (Harrington 1951 ; Cooley and Tukey 1965 ) who found that speech correlates are derived from physiological constraints and reflect a broad range of emotions. Their results are more controversial when examining the differences between fear and surprise, or boredom and sadness. Physical changes are often correlated with emotional states, and such changes in the physical environment produce mechanical effects on speech, especially on the pitch, timing, and voice quality. People who are angry, scared, or happy, for instance, experience an increase in heart rate and blood pressure, as well as a dry mouth. This results in higher frequency, louder, faster, and strongly enunciated speech. When a person is bored or sad, their parasympathetic nervous system is activated, their heart rate and blood pressure decrease, and their salivation increases. This results in monotonous, slow speech with little high-frequency energy. Based on this, we can classify four main emotions that have different effects on the human's speech (Pierre-Yves 2003 ). Table 2 compares different emotions in speech. Using the visualization of a sound wave described earlier, we can identify emotions in speech based on the specific physiological characteristics of each emotion.

4 The evolution of speech emotion recognition methods along the years

Understanding and categorizing the main methods by which an emotion is recognized by SER systems and how the systems analyze speech are important in identifying SER systems’ vulnerabilities and potential attacks that might exploit these vulnerabilities. Moreover, the ability to identify trends in the use of the methods could shed light on ways to mitigate potential attacks and address existing vulnerabilities.

The field of speech emotion recognition has engaged many researchers over the last three decades. At the beginning, the main challenges were determining how to analyze speech waves and which features could be extracted directly from sound waves. Later studies leveraged this knowledge and developed different algorithms for modern SER systems. The evolution of SER systems has been a continuous process of incorporating new components, algorithms, and databases to improve accuracy and efficiency. However, with each addition, the need for protection and security also increased. Therefore, previous studies on SER systems must be reviewed to identify potential vulnerabilities and develop measures to safeguard the system's integrity. By doing so, we can create more secure and reliable SER systems that can effectively recognize and respond to human emotions.

Although many studies have been performed in the field of SER, some of them has leveraged the knowledge in the SER domain by using novel techniques or unique methods, and some of them has focused on improving the system's performance by using existing methods. Table 3 presents the studies performed over the years that involved uniqueness, and the main methods and features proposed in those studies. The last column in Table  3 ("Uniqueness") presents the uniqueness of each study in a nutshell, while other studies that present similar work are mentioned in the text following Table  3 . As can be seen in Table  3 , the studies used different methods and techniques to detect the emotions in the speakers' utterances, with the usage of unique and \ or novel methods.

First analysis of emotions in speech

The first attempt to analyze speech was made in the 1970s, and the study’s main goal was to determine the emotional state of a person using a novel speech analyzer (Williamson 1978 ). In this study, Williamson used multiple processors to analyze pitch or frequency perturbations in a human speech to determine the emotional state of the speaker. This study was the first to analyze the existence of emotions in speech. Twenty years passed before another study proposed novel technologies in the context of modern SER; in the 1990s, with the increased use of machine learning algorithms and advancements in computational technology, different researchers began to use a variety of classic machine learning methods to detect emotions in speech.

First use of ML algorithms and feature extraction methods

The early twenty-first century saw significant advances in artificial intelligence (AI) in general and in the machine learning domain in particular, and this was accompanied by the development of unique advanced machine learning algorithms and designated feature extraction methodologies. Dellaert et.al ( 1996 ) used both new features (based on a smoothing spline approximation of pitch contour) and three different ML algorithms (Kernel Regression, K-nearest neighbors and Maximum Likelihood Bayes classifier) for the task of SER. Until 2000, no large-scale study using the modern tools developed in the data mining and machine learning community has been conducted. Either one or two learning schemes were tested (Polzin and Waibel 2000 ; Slaney and McRoberts 1998 ), as a few or just simple features were used (Polzin and Waibel 2000 ; Slaney and McRoberts 1998 ; Whiteside 1998 ), which caused these statistical learning schemes to be inaccurate and unsatisfactory. In 2000, McGilloway et al. ( 2000 ) used the ASSESS system (Automatic Statistical Summary of Elementary Speech Structures) to extract features from a sound wave, which produced poor quality features resulting 55% accuracy with Linear Discriminants classification method.

First use of neural networks

Progress was made in 2003 when Pierre-Yves ( 2003 ) used neural networks, mainly radial basis function artificial neural networks (RBFNNs) (Orr 1996 ) for the task of SER for a human–robot interface. In this case, basic prosodic features, such as pitch and intensity extracted from audio recordings, served as input to the algorithm. Since then, many other studies have been conducted in the field of SER, in a variety of domains, ranging from single linguistic (Kryzhanovsky et al. 2018 ; Badshah et al. 2017 ; Lech et al. 2018 ; Lim et al. 2017 ; Bakir and Yuzkat 2018 ) to para-linguistic (Pierre-Yves 2003 ; Satt et al. 2017 ; Hajarolasvadi and Demirel 2019 ; Khanjani et al. 2021 ), from real-life utterances (Pierre-Yves 2003 ) to the recorded utterances of actors (Kryzhanovsky et al. 2018 ; Satt et al. 2017 ; Badshah et al. 2017 ; Lech et al. 2018 ; Hajarolasvadi and Demirel 2019 ), and from the use of digital data in the time domain (Williamson 1978 ) to the use of spectrograms in the frequency-time domain (Kryzhanovsky et al. 2018 ; Satt et al. 2017 ; Badshah et al. 2017 ; Lech et al. 2018 ; Hajarolasvadi and Demirel 2019 ).

Combining prosodic and spectral features

Between 2005 and 2010, several experiments were performed using prosodic features and/or spectral features (Luengo et al. 2005 ; Kao and Lee 2006 ; Zhu and Luo 2007 ; Zhang 2008 ; Iliou and Anagnostopoulos 2009 ; Pao et al. 2005 ; Neiberg et al. 2006 ; Khanjani et al. 2021 ). Those studies compared the performance of different machine learning and deep learning algorithms in the task of detecting the correct emotion in a specific utterance. Since then, many studies used a plethora of advanced data science methods for the task of SER. in Satt et al. ( 2017 ) for example, Satt et. al used an ensemble of neural networks (Convolutional and Recurrent neural networks) applied on spectrograms of the audio files to detect the emotions concealed in each recording. Moreover, they used harmonic analysis to remove non-speech components from the spectrograms. Later research (Badshah et al. 2017 ) used also deep neural networks on spectrogram, but with the usage of Transfer learning using the pre-trained AlexNet. In (Alshamsi et al. 2018 ) the researchers used cloud computing to classify real-time recordings from smartphones (using a SVM model stored in a cloud). In the last few years, many researchers have attempted to improve the accuracy of methods proposed in prior research by adjusting the algorithms (e.g., replacing layers in a neural network to adapt it for the SER task), creating new feature extraction methods, and modifying the algorithms for different types of technologies (e.g., robots, human–computer interface).

Figure  13 contains a timeline showing advancements in the SER domain and the most important milestones in the domain’s evolution. As can be seen below, SER has advanced significantly over the last three decades. The first attempt to detect emotions in speech in the 1970s in which prosodic features were proposed paved the way for the development of the advanced AI methodologies used in modern SER systems.

figure 13

SER development timeline

5 Main domains and applications of SER systems

Understanding a human's emotions can be useful in many ways. As human beings, reacting appropriately to an emotion expressed in a conversation is part of our daily life. When interacting with a machine, we expect it to react like a human being. Therefore, emotion recognition systems were, and are still being developed to improve human–machine interfaces. Emotion recognition systems are used in many fields. In recent studies researchers have applied SER software code to real-life systems. For example, in 2015, Beherini et al. ( 2015 ) presented FILTWAM, a framework for improving learning through webcams and microphones. The system was developed to provide online feedback based on students’ vocal and facial expressions to improve the learning process. The data was collected in real time and inserted into a SER system to attempt to determine whether a student was satisfied with the learning, frustrated, depressed, etc. In another study, robots were programmed to recognize people's emotion in order to improve human–robot interaction (Rázuri et al. 2015 ). Such robots can react based on the emotional feedback of the person speaking. This could be a useful tool for understanding people with autism and the actual content of their speech. According to Utane and Nalbalwar ( 2013 ), SER systems can be used in a variety of different domains; for example, the use of SER systems at call centers could be helpful in identifying the best agent to respond to a specific customer's needs. Likewise, in airplane cockpits, SER systems can help recognize stress among pilots; conversations between criminals can be analyzed by SER systems to determine whether a crime is about to be performed; the accuracy of psychiatric diagnosis and the performance of lie detection systems could be improved with the use of a SER system; and in the field of cyber-security, the use of sound biometrics in authentication is being explored in another application of SER. Based on the studies mentioned above, we created a taxonomy of the various domains in which SER systems are used. The taxonomy is presented in Fig.  14 .

figure 14

Taxonomy of SER domains and applications

As can be seen in the taxonomy, SER systems are used in diverse domains and applications. In the previous section, our overview of prior studies and the work performed in the SER field showed that despite their applications in many domains, no studies have focused on the security aspects of SER systems. Cyber-attacks, when aimed at SER systems, may damage a variety of different domains, regarding the fact that SER systems are useful in many daily actions. Attackers might attack customer service applications to sabotage a company's reputation. Moreover, attacks might be conducted on security systems using SER technology to disrupt investigations. Therefore, a secured system is required. As far as we could find, no work has been done regarding analyzing the security of SER systems. Since SER systems use actual voices of humans, they can be seen as a huge database of people's voices. For example, looking at a sound biometric system, which can use SER algorithm to accurately identify the individual in different emotional states, those systems are vulnerable to "Spoofing attack"- an attack in which a person or a program identifies as another. As written above, a huge amount of personal information is involved in the human's speech, especially when detecting the person's emotions. Cyber-attacks on SER systems may decrease drastically the demand of those systems, due to the privacy invasion latent in such systems. Therefore, on the following section we will elaborate and analyze the security aspects associated with SER systems.

6 Security analysis of speech emotion recognition systems

6.1 speech emotion recognition ecosystem.

Before we discuss the cyber-attacks aimed at SER systems, it is necessary to understand the SER ecosystem. An ecosystem (Kintzlinger and Nissim 2019 ; Landau et al. 2020 ; Eliash et al. 2020 ) is the combination of all of the players and devices contained in the main system and the interactions between them, which are crucial for the information flow in the system. This knowledge is important for understanding the SER system process, its existing vulnerabilities, and the potential cyber-attacks associated with it. Figure  15 below shows the SER ecosystem, and Table  4 contains the legend for Fig.  15 .

figure 15

SER ecosystem

A SER system has two main phases, the training and production phases. In the training phase, the sound wave data is collected via a device in the (A) personal cyber space; such devices include external recording devices or various microphone-integrated devices, such as a smartphone, smartwatch, tablet, Bluetooth earphone, personal computer, beeper, or hearing aid device (Kintzlinger and Nissim 2019 ). The recordings collected in this phase are the raw data, which serves as input to the SER system. Note that to produce clear sound waves, without noise or background sounds, a noise reduction device is needed. Then, after recording the person, data processing and analysis is performed, in which features are extracted and a (B) classifier is induced and used to determine the emotion of the person when the recording was made. In addition, the SER system may use an external database (DB) that contains additional information, such as demographic information or gender, to improve its performance. After identifying the emotions expressed in the recorded utterances, it is possible to store both the original sound and its labels (the emotions) in a DB (i.e., the training DB in Fig.  15 ). Note that this training DB can also be used to train other classifiers or perform statistical analysis.

In the production phase, the SER system can be used in different applications, each of which has its own DB, end users, and operators. In some cases, the application’s end user and the application’s operator are the same person (for example, in entertainment applications), and in other cases, they are different people (for example, in employee recruitment applications). In addition, each end user or operator may be the recorded person for the SER algorithm. This usually occurs when the end user’s sounds are needed to continuously update (re-train and re-induce) SER classification models (e.g., cockpit controlling systems, entertainment SER-based systems). Note that in a SER system aimed at maintaining its updateability and relevance in the long term, these test DBs (after being verified and labeled) may be used to enrich the training DB and induce an updated SER classifier.

As can be seen in the SER ecosystem, in some cases, the end user receives and uses feedback from the SER system for his/her own benefit (e.g., entertainment systems), however there are other cases in which the operator is the only one who receives and uses the feedback, which serves as input to the SER system.

To better understand the importance of the end user and the operator to the ecosystem, we provide a brief description of their interaction in each of the domains shown in Fig.  15 (in the production phase).

Cockpit systems and physical security systems—the end user's utterances are fed to the SER system, and the operator analyzes the user’s emotional state based on the SER classification results. Since SER systems do not have the ability to fully understand human common sense in general, and particularly are unable to identify the context in which the utterances were said, intervention by a human operator is needed in some cases. For example, a soldier entering a battlefield wearing a helmet in which a SER system is embedded send signals to the operator (see the one-directional arrow in Fig.  15 ) who needs to decide whether to drill down (asking the end user questions and receiving answers for additional classification).

Employee recruitment systems—the operator (in this case, the human resources recruiter) uses the SER system to analyze the end user's (the candidate for the position) mental state during the job interview. The operator asks questions, and the SER system receives the answers from the end user, analyzes the user’s utterances, classifies his/her emotions, and provides feedback to the operator.

Educational systems—the operator, who may be a teacher, social worker, pedagogical director, etc., uses the SER system to better understand the emotions expressed by the end user (a student). For example, the end user may be a student with an autistic spectrum disorder who has difficulty expressing his/her emotions during a class or guidance session; the use of a SER system, which can automatically extract the student’s utterances and accurately classify them and provide additional information regarding the student’s mental and emotional state, may enable the human consultant (i.e., the operator) to better understand the student and meet his/her needs.

Entertainment systems—in this case, the same person acts as an operator and end user. An entertainment SER system allows the user to interact with the system, meaning that the user sends speech waves and receives the emotions classified during his/her leisure time. Virtual assistants (e.g., Siri, Alexa) in smartphones may contain a SER system that enables them to react properly in response to the end user's mood (see the two-directional arrow which reflects the interaction between the end user and the system in Fig.  15 ).

Cyber-security systems—the end user of SER systems in the cyber-security domain may be a person being queried by a lie detection system (e.g., polygraph). Lie detection systems that use an emotion-based approach (EBA) analyze the answers provided by the user in response to the operator's questions. Then the EBA system classifies the answers using its designated SER module, providing the operator with the classification decision regarding the emotions hidden in the user’s answers (in this case, a one-directional arrow reflects the interaction between the EBA system and the operator or the examined person in Fig.  15 ).

The SER ecosystem can suffer from vulnerabilities that leave it exposed to cyber-attacks; in some case, specific components may be vulnerable, and in others, the malicious use of the SER system can result in a cyber-attack. The different players (humans and components), and their interactions are presented in Fig.  15 , which describes the full SER ecosystem.

6.2 Potential cyber-attacks aimed at SER systems

Our analysis of the ecosystem presented in the previous subsection enabled us to identify vulnerabilities that can be exploited and compromised by adversaries to perform cyber-attacks. This analysis of the SER ecosystem, along with several review studies that focused on the cyber-security domain (Orr 1996 ; Chen et al. 2011 ), enabled us to further explore and identify potential attacks that can be performed on SER systems. In addition to the new attacks we suggest, there is a wide range of cyber-attacks aimed at voice-based systems (Orr 1996 ).

Given that SER systems incorporate an ML model and operate in the domain of voice-based technology, it is important to emphasize that any cyber-attack targeting voice-based or ML-based systems is also relevant to SER systems. While the core principles of cyber-security apply universally, SER systems' unique characteristics mean that attacks on these systems can have different, far-reaching consequences. Cyber-attacks aimed at SER systems can not only compromise model integrity and data privacy; they can also manipulate the interpretation of emotions and impact user experience. Therefore, understanding the threats pertaining to voice and ML-based systems is crucial. In this section we present the attacks that have the greatest impact on SER systems yet still lack a security mechanism capable of providing a defense against them.

Table 5 summarizes our analysis of 10 attacks aimed at SER systems, as well as the causes and impact of the attacks. For each attack, we indicate whether the attack is passive or active (in passive attacks there is no impact on the system’s resources, but the attacker can observe and/or copy content from the system, whereas in active attacks the attacker tries to modify the data and/or the system's functionality), the system phase in which the attack occurs (the training or production phase), the implications of the attack (meaning the impact the attack has on the system or its users), and the relevant application. For some attacks known in the cyber-security community (e.g., replay or poisoning attacks), we present the attack and its variation in the SER domain; for example, a replay attack, which is usually performed by replaying an original transaction for a different and malicious purpose, can be performed in the SER domain by combining parts of the original transaction (e.g., voice recording) in a different order, creating "new" content with the same voice (see attack #8 in Table  5 ).

The text that follows Table  5 provides more detailed information regarding each attack, particularly about the possible attack vectors (the path used by the attacker to access the system and exploit its vulnerabilities) and the attack's flow description.

6.2.1 Attack No. 1- data exfiltration

Possible attack vectors.

Malware is transmitted to the user’s device (e.g., smartphone, tablet, laptop) via the Web (e.g., Google Play) or a USB (e.g., using a malicious USB-based device).

Attack flow description

The compromised device uploads the recorded voice to a malicious third-party who exploits it for malicious purposes. Although this attack may be general and relevant to multiple systems, its execution in the SER domain is relatively easy, making the SER system extremely vulnerable to this kind of attack. Since SER systems (in production environments) use actual voice recordings of their users, which are usually collected via a user's personal cyber space devices, any malware distributed to such devices can compromise the SER system, allowing an attacker to exploit the voice recordings collected by the system and use them for their own malicious purposes.

6.2.2 Attack No. 2- malware distribution

An SER developer innocently downloads a malicious version of a common library for SER system development (e.g., Librosa, SoundFile, FFTPACK, from the Web in Python, Java, etc.).

SER system developers commonly use third-party libraries to streamline the development process. For example, SER systems usually rely on various libraries and software packages to perform audio processing, feature extraction, etc. Therefore, the integration of a malicious library can compromise the core functionality and security of a SER system. Moreover, since SER systems are usually implemented as part of larger systems, attackers might exploit SER systems to launch attacks on connected networks or devices. For example, after a malicious library has been imported to and compiled in the developed SER system, the system would become infected; the infected system could then be uploaded to application markets or Web applications and thereby get distributed to additional users, with the ability to compromise their systems for a variety of malicious purposes.

6.2.3 Attack No. 3- SER DB poisoning

An online SER DB used to train a SER-based model is maliciously manipulated by downloading an existing DB (e.g., RAVDESS Footnote 17 ) and uploading a new malicious one.

The attacker downloads an existing and widely used SER DB and adds some adversarial sound examples (e.g., mislabeled samples, misleading noise that will confuse the learning model) to it; this will result in an inaccurate induced SER model trained on a poisoned DB. Then the attacker uploads the poisoned DB to the Web in the form of a DB update or alternatively shares it as a new, publicly available SER DB.

6.2.4 Attack No. 4- malicious SER model distribution

Uploading an open-source malicious SER induced model to the Web, making it publicly available for use.

The trained malicious SER model is uploaded to the Web for public use (as a publicly available trained model, such as a VGG neural network, or ResNet model). The malicious model is intentionally trained using misleading data, which will result in incorrect emotion classification. The user may download a malicious SER model and adjust it for his/her SER system without being aware that it is a malicious model.

6.2.5 Attack No. 5- inaudible sound injection

The sound can be added via sound playback (playing high-frequency audio recordings) or ultrasound injection.

The attacker employs one of the following vectors: (1) the attacker is physically proximate to the target speaker; or (2) the attacker leverages a position-fixed speaker that produces ultrasounds (sound waves at a high frequency that can only be heard by a microphone and not human beings). The device that stores and executes the SER system listens to the ultrasounds, which may contain utterances with specific emotions (as desired by the attacker, such as calm, self-confident utterances), which differ from the victim's current mental state during the attack. By doing so, a candidate for a job position, for example, may be classified as calm and relaxed by the SER system, instead of nervous and anxious, and thus may be considered an appropriate candidate. Note that in this attack the attacker does not change the original voice recording but overrides the victim’s voice with an inaudible sound.

6.2.6 Attack No. 6- emotion removal

The emotion can be removed through the following procedures:

A generative adversarial network (GAN) is used to learn sensitive representations in speech and produce neutral emotion utterances.

Malware is transmitted to the user’s device, via either:

the Web (e.g., Google Play).

a malicious USB-based device.

The attacker uses an emotion removal ML model (based on CycleGAN-VC (Aloufi et al. 2019 )), which creates utterances with neutral emotion by removing the prosodic and spectral features from the original recording. Then the emotionless utterance is sent to the SER system which classifies every input as neutral.

6.2.7 Attack No. 7- adversarial sound

Playback of sound waves produced by a GAN.

An attacker uses a GAN to produce adversarial examples, which are samples aimed at distorting the model’s classification (e.g., producing a sample that expresses anger, which will be classified as joy). Using the classification of the discriminator (one of a GAN’s neural networks that differentiates genuine from artificial samples), the attacker can generate artificial samples (i.e. perturbed samples). After producing the adversarial samples, the attacker can fool the SER system by inserting the samples into the system, thereby misleading the classification process. For example, by producing those samples, an attacker could fool the SER system regarding his/her emotional state (e.g., the system could misclassify the attacker’s emotional state as a depressed state in order to allow the attacker to obtain a prescription for a specific medication).

6.2.8 Attack No. 8- voice impersonation replay attack

Audio samples collected via a spam call to a victim in which the person's voice is recorded, cloud servers that store audio files, online datasets used to train the SER system, and/or a malicious application uploaded to the market that exfiltrates a recording of a user.

An attacker who wishes to impersonate another person collects an audio sample of the victim via one of the attack vectors. A user authentication system (e.g., semi-autonomous car system that identifies the car owner's emotional state to determine his/her ability to drive or his/her emotional state while driving (Sini et al. 2020 )) can be misled by an attacker who uses recordings of the car owner in different emotional states to sabotage the user authentication system.

In a different scenario, the attacker can use the audio samples collected to produce new utterances with the same voice of the victim (by using the deep fake cut and paste method (Khanjani et al. 2021 ) that re-orders parts of the full utterance according to a text dependent system). By replaying the new utterances, an attacker can create fund transfers, smart home commands, etc.

6.2.9 Attack No. 9- induced classification model inference

Querying an online SER service via its API and inferring the model itself.

The SER model f is uploaded to an ML online-service (e.g., PredictionIO Footnote 18 ) for public use (mainly in the entertainment sector). The attacker, who obtains black-box access (accessible only via prediction queries) to the model via its API, queries the model as many times as needed to infer the learning procedure of the model itself (learning the decision boundaries) and then produces a model f ̂ that approximates f . By doing so, the black-box SER model becomes a nearly white-box model available to the attacker who can exploit or steal the model to meet his/her needs or otherwise profit from it. Once the attacker has constructed a model (f^) that approximates the original SER model, they effectively turn the black-box model into a nearly white-box model. This means that they can gain in-depth understanding of how the SER system functions, and this knowledge can be exploited for various purposes. For example, the attacker can manipulate the model to misclassify emotions, potentially causing the SER system to provide incorrect results. Such manipulation could have a variety of consequences, from affecting user experience to faulty decision-making. This cyber-attack, initially aimed at inferring and stealing any ML model, can be adapted to SER systems by accessing the SER model via its API. The consequences have various implications for both the SER application operator and its end users.

6.2.10 Attack No. 10- sound addition using embedded malicious filters

Concealing malicious components in a microphone during the manufacturing process or as part of a supply chain attack and selling them as benign components.

In the process of collecting data and recording the samples to train the SER system, the person being recorded can use a modified malicious recording device (e.g., microphone) rather than a benign device. The malicious microphone can add noise to the original sound wave (thereby distorting the recording process), producing perturbed samples which serve as the raw data used to train the SER model. Since a benign microphone also produces perturbations in some situations, it is impossible for the recorded person to know that he/she is using malicious hardware. Since there are companies developing AI solutions for SER systems (EMOSpeech, CrowdEmotion, deepAffects, etc.), an attacker could impair one of these services or degrade their quality or accuracy by marketing malicious recording equipment.

7 Security mechanisms for SER systems

To secure speech-based systems from cyber-attacks, there are several security mechanisms which were not specifically designed for SER attacks that can be utilized for that purpose. The main security mechanisms that are more tailored to SER systems are aimed at preventing adversarial ML attacks on SER systems, such as in Latif et al. ( 2018 ); Jati et al. 2020 ).

In this section, we first describe each of the relevant security mechanisms, and then in Table  6 , we map each security mechanism, indicating whether it covers the 10 attacks aimed at SER systems listed in Table  5 . For each security mechanism and attack, we calculated the percentage of attacks covered by each mechanism; as can be seen, some attacks remain unaddressed, leaving SER systems vulnerable to those attacks.

Latif et al. (Latif et al. 2018 ) suggested training the model with adversarial examples to defend against adversarial ML attacks on SER systems. In their paper, they trained a SER model with 10% adversarial samples and 90% benign samples to improve the model’s robustness against adversarial ML attacks. They also trained a different model based on a neural network on a dataset of samples with additional noise to generate a model that is robust to sound addition attacks. Another possible defense methodology that can be used against adversarial ML attacks described in Latif et al. ( 2018 ) involves the use of a GAN to clean the perturbed utterance before running the classifier on it. Their results show that training using samples with additional noise produces a higher error rate than training using adversarial samples. Moreover, the use of a GAN to clean the perturbed utterances before training produces the lowest error rate (37.18% error on average on two datasets), yet for training, GANs require precise information on the type and nature of adversarial examples.

In (Zhang et al. 2017 ), the authors described a simple yet efficient solution for preventing attacks using inaudible sound playback. The main reason for the success of an inaudible sound attack (attack #5 in Table  5 above) is that microphones sense sound waves at high frequencies (over 20 kHz). Most microphones implemented in smartphones (MEMS microphones) are built the same way. To prevent such an attack, the microphone should be enhanced and redesigned to block any sound waves in the ultrasound range (e.g., the iPhone 6 Plus microphone is designed to resist voice commands at high frequencies).

A security mechanism used to defend against an ML model theft attack via API querying was suggested by Lee et al. (Lee et al. 2018 ). As described in the previous section, during the attack the attacker will be able to steal an ML model if he/she obtains the outputs of the model for a certain input and the class probabilities (e.g., for a certain utterance the attacker obtains the emotion recognized in the utterance and its probability). The simplest method for avoiding the attack is for the SER system to only provide the final classification decision, without the classification probabilities. The authors suggested a different API query design that forces the attacker to discard the class probabilities when quering the model many times. Without the option for multiple queries, the attacker will be unable to restore the original model used for the SER task.

In (Blue et al. 2018 ), the authors proposed a method to defend sound-controlled systems against replay attacks and adversarial ML attacks by differentiating between sounds produced by humans and artificial sound (human vs artificial sound differentiation). Their strategy relies on the identification of the sound source of the received utterance. Utterances produced by playback devices will have a low frequency. Based on this property, the authors were able to determine whether the voice command came from a human being or a playback device. By leveraging this mechanism, one can differentiate between emotional utterances produced by a SER system’s end user and the utterances produced by a playback device or GAN (in the case of attack #7 in Table  5 ).

Another security mechanism for defending against replay attacks was suggested by Gong et al. ( 2019 ). In their study, the authors created a publicly available dataset containing genuine voice commands and replayed recordings of the same voice commands in various environmental conditions (noise, distance between the speaker and the recording device, etc.). By training on this dataset, an ML model can learn to differentiate between genuine audio samples and their replications, which may make the model robust to replay attacks.

Another countermeasure was proposed to prevent voice impersonation replay attacks (Li et al. 2000 ). Li et.al proposed automatic verbal information verification (VIV) for user authentication. In their method, spoken utterances of the speaker attempting to gain access are verified against a key piece of information in the speaker's registration information. During the authentication process, the speaker trying to gain access is asked a set of diverse questions, and his/her answers are compared to the answers stored for that speaker. Using this method, an attacker that uses voice recordings collected from a SER system's database will not necessarily have the correct answers to the verification questions, which will prevent the attacker from gaining unauthorized access. Although this countermeasure exists, it is important to note that it is only relevant for SER systems implemented in authentication systems.

Gui et al. (Gui et al. 2016 ) proposed a mechanism for defending against artificial input attacks, focusing on replay attacks. The scope of their study was brain print (EEG recordings) biometric systems, therefore the main data used was brain print data (EEG signals). To determine whether noise was added to the original data, the authors used an ensemble of classifiers. Although the study was conducted on brain print biometric systems, it can be utilized for SER systems, since brain prints are basically simple waves, like sound waves, and can be represented by a similar representation method.

To defend against malware distribution attacks, in 2019, an article was published by Veracode, Footnote 19 an American application security company, proposing a method for discovering malicious packages. A paper by Wysopal et al. (Wysopal et al. 2010 ) enabled the authors to identify the patterns commonly seen in malicious open-source libraries. Then, they implemented a malicious software package detector based on static analysis for each of the patterns described in Wysopal et al. ( 2010 ). Since malware distribution attacks in the form of malicious software packages are not well known in the programming community (meaning that a typical programmer would not be concerned with cyber-attacks when using an external software package in his/her code), the method proposed by Veracode is not widely used.

Regarding data exfiltration, in Ullah et al. ( 2018 ), the authors covered a wide range of defense mechanisms (not specially designed for SER systems but rather for data exfiltration attacks in general). The three main countermeasures mentioned are preventive, detective, and investigative countermeasures. Of these, the preventive countermeasures are the most relevant for SER systems; this group includes mechanisms such as data classification, encryption, and distributed storage. In (Kate et al. 2018 ), the authors presented a novel encryption method for voice data. Their method includes three main steps: receiving the audio file as a sequence of zeros and ones, increasing each sequence by two, and multiplying each sequence by 1e + 15 to create a 16-digit integer. After that, DNA encryption and a permutation function are applied. Note that the use of this method requires that the representation of the audio file be in a digital form, necessitating an ADC (as described in Section C).

As the results of our analysis on SER systems presented in the previous sections show, cyber-attacks that are unique to the SER domain are less common, and most of the attacks that can be performed on such systems are general attacks aimed at voice-controlled systems; only a few of the security mechanisms are specifically aimed at SER systems. Table 6 , which maps the security mechanisms’ coverage against the attacks aimed at SER systems, shows that many attacks, such as emotion removal, poisoning, and malicious SER model attacks, (30% of the attacks) remain unaddressed by the existing security mechanisms; these attacks pose a significant threat that must be considered, particularly when developing new SER systems. In addition, the security mechanism with the widest attack coverage, human vs artificial sound differentiation, covers only 30% of the attacks, meaning that even the best security mechanism is not relevant for 70% of the potential cyber-attacks aim at SER systems.

8 Directions for enhancing the security of SER systems

Given the potential cyber-attacks aimed at SER systems and the lack of sufficient defense mechanisms against such attacks, particularly emotion removal, poisoning, and malicious SER model attacks for which no security mechanism currently exists, there is a need to develop simple yet efficient defense solutions.

The first direction we present is aimed at improving SER systems’ defense against emotion removal attacks in which modified inputs are presented to the model which is unable to classify the emotions distorted in the modified input. Given a low-resolution audio sample, which has no emotion features in it, we suggest reconstructing the original audio sample (a.k.a. high-resolution sample) containing the emotion features. Then, the SER system will be able to classify the high-resolution audio sample based on the emotion expressed in it. Our suggestion can be illustrated using the following origami example. Imagine an origami bird constructed of folded paper. Unfolding the origami bird produces a piece of paper that contains traces of folding. The unfolded paper simulates the "modified input" of the origami bird. The goal is to reconstruct the original bird using the traces of the original folding that can be seen on the unfolded piece of paper. In our context of SER systems, we turn to a prior study (Kuleshov et al. 2017 ) that proposed a method for reconstructing a high-resolution audio sample from a low-resolution audio sample, using an artificial neural network trained on a large set of high- and low-resolution samples. While that research was not focused on enhancing the security of SER systems, it can be leveraged for this purpose as follows. For each time stamp, a speech signal has a duration and amplitude. The resolution of the speech signal is represented by the sampling rate; a higher sampling rate provides a higher resolution and vice versa. The high-resolution audio sample received will serve as the input for the SER system, which will now be able to identify the emotion it contains.

More formally, as described in Kuleshov et al. ( 2017 ), an audio sample is denoted as function \(s\left(t\right):[0,T]\to {\mathbb{R}}\) , where \(T\) is the duration of the sample in seconds and \(s\left(t\right)\) is the amplitude of the sample at time \(t\) . Then, to obtain the digital measurements of \(s\) , \(s\left(t\right)\) is discretized into a vector \(x\left(t\right)\) using parameter \(R\) which is the sampling rate of \(x\) , symbolizing the resolution of \(x\) . Based on (Kuleshov et al. 2017 ), the idea is to increase \(R\) by predicting \(x\) from a portion of its samples taken at any timestamp. The high-resolution version of \(x\) is \(y\) , where the sampling rate of \(y\) , denoted by \({R}_{2}\) , is larger than the sampling rate of \(x\) , denoted by \({R}_{1}\) . \(y\) is computed via a function \({f}_{\theta }\left(x\right)\) , where \(\theta\) is determined by training a fully convolutional neural network with parameter \(\theta\) on a set of samples \({x}_{i},{y}_{i}\) . After computing \(y\) , which is the reconstructed speech sample with emotion features, we can use it an input to the SER model, which will now be more robust to emotion removal and modified input attacks.

The above process, along with the origami bird example, are visualized in Fig.  16 . As can be seen, (A) is an unfolded piece of paper, containing the folding traces, while (B) is the bird produced by refolding the paper in a specific order. The inspiration and principle of the origami example in the sound domain follow, where (C) is a spectrogram of a low-resolution audio file containing no emotion features, and (D) is the spectrogram of the high-resolution audio reconstructed from the same audio sample shown in (C).

figure 16

Origami bird and audio file reconstruction

Implementing the suggested concept as a defense mechanism in SER systems could protect the systems from attacks associated with modified inputs (like an emotion removal attack—attack #6 in Table  4 ), since the mechanism can also reconstruct the original audio sample in cases in which the sound has been modified.

Another direction for security enhancement aims at improving SER systems’ robustness against malicious model attacks (attack #4 in Table  5 ), by using natural language processing (NLP) algorithms (Batbaatar et al. 2019 ). We suggest using NLP algorithms to extract and understand the context of the words in utterances in the model’s training set samples. We suggest combining two methods for classifying the emotion expressed: (1) the spectral and prosodic features can be used to determine the emotion concealed in the speaker's voice (as in every SER system), and (2) the context of the utterance can be used to improve the classification’s accuracy. By applying NLP models to understand the meaning of every utterance said by a person, his/her mood and emotional state can be determined (negative words will indicate a bad mood, while positive words will indicate high spirits). This combination can help in situations where the context of the utterance is negative, but the pronunciation of it is positive, and vice versa. Moreover, implementing an NLP model in the SER system's training phase may improve the SER system's robustness against poisoning attacks. As described in Sect. 6.2, a poisoning attack is an attack in which the attacker downloads a publicly available dataset, which is used to train a SER model, replacing the correct labels with incorrect ones and uploading the dataset to the Web as a new dataset. In this way, the attacker can label an utterance as "happy," while its emotion is actually "sad." By using an NLP model, the SER system can combine the emotion features with the context of the words in every utterance to correctly classify the emotions expressed by a person.

Implementing the abovementioned mechanisms (all of which are based on machine learning algorithms) in a SER system can enhance the security of such systems, but in the long term, such machine learning mechanisms might suffer from limitations. For example, our voice changes over time (as a result of aging and lifestyle). Moreover, with technological developments, recording devices (which are used to collect the data) frequently change and are become better equipped with more functionalities and voice filters. Therefore, every SER system is exposed to the concept drift phenomenon (Žliobaitė 2010 ). Concept drift occurs when the statistical characteristics of the target variable that the model is trying to predict alter over time (the emotion concealed in the utterance, in the case of SER systems). Concept drift can result in a decrease in the generalization capability of the detection model as time passes. Therefore, there is a need to frequently update the learning model, both for the core models that are aimed at emotion recognition and for the models used in machine learning-based security mechanisms. An active learning approach can efficiently address the update gap that currently exists in SER systems. Applying active learning in the core SER model’s development and machine learning-based security mechanisms could reduce the effort and costs associated with the training phase. Active learning can reduce the number of samples required to train the model, by selecting a small yet informative set of samples. In addition, to add these informative samples to the training set, their true label must be determined (usually by a human expert); by reducing the number of samples needed, there will also be a reduction in the cost and time associated with the labeling procedure. In recent years, the use of active learning methods in the cyber-security domain has grown (Banse and Scherer 1996 ; Burkhardt and Sendlmeier 2000 ; Nissim et al. 2019 ; Nissim et al. 2017 ; Moskovitch et al. 2007 ), since it has been shown to enhance the detection model's detection capabilities over time and ensure that the model is up to date. Therefore, we suggest using active learning methods to cope with the concept drift phenomenon and further improve defense mechanisms’ detection capabilities. We also suggest considering relevant active learning methods presented in other domains, such as biomedical informatics (Moskovitch et al. 2010 ; Nissim et al. 2014 , 2015 ), for use in SER system security mechanisms.

9 Discussion and conclusions

In the last 10 years, speech emotion recognition systems have been widely implemented in various domains, allowing people to interact with products and services that utilize SER systems and companies to improve their services and interfaces. SER systems have been the subject of research for over four decades; most of the studies aimed at improving the accuracy and capabilities of the systems, while the security aspects of SER systems received little attention from researchers. This paper is the first to explore and analyze SER system security and contribute to the scientific community’s understanding of this underexplored area.

We started by providing information about the main principals of the SER domain and an overview of the work that has been performed in this domain over the years. This ranged from basic knowledge regarding emotions and the definition of sound waves to the main methods used to represent sound waves (formed as audio files) on the computer: analog-to-digital conversion, implemented with various mechanisms, and time–frequency domain conversion, which results in a spectrogram. We also analyzed how different emotions are expressed in a speaker's voice.

To better understand the vulnerabilities of SER systems, we analyzed the entire SER ecosystem, describing the data flow between each component and player in the ecosystem; the analysis performed enabled us to better understand the vulnerabilities in these systems. We identified 10 potential cyber-attacks targeting SER systems and the security mechanisms capable of preventing such attacks. Our analysis of the attacks shed light on the relevant attack vectors, possible attack scenarios, the phase in which the attack can be performed, the relevant domains, and the implications of the attack. This comprehensive analysis revealed major gaps in the existing protection against the attacks. We found that 30% of the attacks (including emotion removal, poisoning, and malicious SER model attacks) are not covered by any security mechanism, posing a very real threat to SER systems. We also found that voice impersonation replay attacks are the attacks best covered by the available defense mechanisms. From the security mechanism perspective, our analysis showed that the best mechanism available, human vs artificial sound differentiating, covers just 30% of the potential attacks, pointing out the need to develop improved security mechanisms.

The abovementioned insights raise some questions regarding the reason that three attacks we identified have no protection mechanism (i.e., no protection mechanism has been published so far). The first question is related to the effort required to address the attack. One may claim that the unaddressed attacks are too difficult to address, or alternatively, are trivial to address using existing tools, however the existing tools were not designed to specifically address these problems. Another reason why the three cyber-attacks have not been addressed is that they may not be considered important enough. From our perspective, the main reason these attacks have gone unaddressed stems from the limited use of SER systems in the past. Only in recent years have SER systems been more widely deployed and integrated in devices heavily used in modern life. As a result, many important aspects of these systems (such as their security and ethical concerns) have not been thoroughly examined.

Future research on the security of SER systems could explore the main security gap identified in this paper – the systems' vulnerability to emotion removal, malicious model, and poisoning attacks. We suggested a potential security mechanism for each of these attacks. For example, reconstruction of the sound algorithm could improve the robustness of SER systems against any type of modified input attacks, while combining NLP algorithms with SER algorithms could create an improved SER model, both in terms of the model’s classification accuracy and its ability to defend against poisoning and malicious SER model attacks. The use of these mechanisms in SER systems, which could dramatically improve the systems' robustness while preserving the users’ privacy of every SER system's user, is a direction for future research.

https://www.verywellmind.com/the-purpose-of-emotions-2795181

https://www.apple.com/siri/

https://www.alexa.com/

https://www.microsoft.com/en-us/cortana

https://www.gartner.com/en

https://www.alliedmarketresearch.com/emotion-detection-and-recognition-market

https://gdpr.eu/

https://www.wsj.com/

www.nytimes.com

http://research.baidu.com/Blog/index-view?id=91

https://www.forbes.com/sites/thomasbrewster/2021/10/14/huge-bank-fraud-uses-deep-fake-voice-tech-to-steal-millions/?sh=441c54157559

https://www.sciencedaily.com/

https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0

https://librosa.org/doc/latest/index.html

http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/

https://www.vocal.com/perceptual-filtering/perceptual-linear-prediction-cepstral-coefficients-in-speech/

https://smartlaboratory.org/ravdess/

https://predictionio.apache.org/

https://www.veracode.com/

Aloufi R, Haddadi H, Boyle D (2019) Emotionless: privacy-preserving speech analysis for voice assistants. arXiv preprint arXiv:1908.03632

Alshamsi H, Këpuska V, Alshamisi H (2018) Automated speech emotion recognition app development on smart phones using cloud computing. https://doi.org/10.9790/9622-0805027177

Badshah AM, Ahmad J, Rahim N, Baik SW (2017) Speech emotion recognition from spectrograms with deep convolutional neural network. 2017 international conference on platform technology and service, PlatCon 2017 - Proceedings, (July 2019). https://doi.org/10.1109/PlatCon.2017.7883728

Bahreini K, Nadolski R, Westera W (2015) Towards real-time speech emotion recognition for affective e-learning. Educ Inf Technol 1–20. https://doi.org/10.1007/s10639-015-9388-2

Bakir C, Yuzkat M (2018) Speech emotion classification and recognition with different methods for Turkish language. Balkan J Electr Comput Eng 6(2):54–60. https://doi.org/10.17694/bajece.419557

Article   Google Scholar  

Banse R, Scherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70(3):614. https://doi.org/10.1037/0022-3514.70.3.614

Bashir S, Ali S, Ahmed S, Kakkar V (2016) "Analog-to-digital converters: a comparative study and performance analysis," 2016 international conference on computing, communication and automation (ICCCA), Noida, pp 999–1001

Batbaatar E, Li M, Ryu KH (2019) Semantic-emotion neural network for emotion recognition from text. IEEE Access 7:111866–111878. https://doi.org/10.1109/ACCESS.2019.2934529

Batliner A, Steidl S, Schuller B, Seppi D, Vogt T, Wagner J, ... Amir N (2011) Whodunnit–searching for the most important feature types signalling emotion-related user states in speech. Comput Speech Lang 25(1):4–28

Blanton S (1915) The voice and the emotions. Q J Speech 1(2):154–172. https://doi.org/10.1145/3129340

Blue L, Vargas L, Traynor P (2018) Hello, is it me you're looking for? differentiating between human and electronic speakers for voice interface security. In Proceedings of the 11th ACM conference on security & privacy in wireless and mobile networks. pp 123–133. https://doi.org/10.1145/3212480.3212505

Burkhardt F, Sendlmeier WF (2000) Verification of acoustical correlates of emotional speech using formant-synthesis. In: ISCA Tutorial and Research Workshop (ITRW) on speech and emotion

Chen Y-T, Yeh J-H, Pao T-L (2011) Emotion recognition on mandarin speech: a comparative study and performance evaluation. VDM Verlag, Saarbrücken, DEU

Google Scholar  

Cooley JW, Tukey JW (1965) An algorithm for the machine calculation of complex Fourier series. Math Comput 19(90):297–301

Article   MathSciNet   Google Scholar  

Dellaert F, Polzin T, Waibel A (1996) Recognizing emotion in speech. In Proceedings of ICSLP 3, (Philadelphia, PA, 1996). IEEE, pp 1970–1973. https://doi.org/10.1109/ICSLP.1996.608022

Dzedzickis A, Kaklauskas A, Bucinskas V (2020) Human emotion recognition: review of sensors and methods. Sensors (Switzerland) 20(3):1–41. https://doi.org/10.3390/s20030592

El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit 44(3):572–587

Eliash C, Lazar I, Nissim N (2020) SEC-CU: the security of intensive care unit medical devices and their ecosystems. IEEE Access 8:64193–64224. https://doi.org/10.1109/ACCESS.2020.2984726

Farhi N, Nissim N, Elovici Y (2019) Malboard: a novel user keystroke impersonation attack and trusted detection framework based on side-channel analysis. Comput Secur 85:240–269. https://doi.org/10.1016/j.cose.2019.05.008

Feidakis M, Daradoumis T, Caballe S (2011a) "Emotion measurement in intelligent tutoring systems: what, when and how to measure," 2011 third international conference on intelligent networking and collaborative systems. pp 807-812. https://doi.org/10.1109/INCoS.2011.82

Feidakis M, Daradoumis T, Caballé S (2011b) Endowing e-learning systems with emotion awareness. In 2011 third international conference on intelligent networking and collaborative systems. IEEE, pp 68–75. https://doi.org/10.1109/INCoS.2011.83

Garcia-Garcia JM, Penichet VM, Lozano MD (2017) Emotion detection: a technology review. 1–8. https://doi.org/10.1145/3123818.3123852

Gong Y, Yang J, Huber J, MacKnight M, Poellabauer C (2019) ReMASC: realistic replay attack corpus for voice controlled systems. https://doi.org/10.21437/Interspeech.2019-1541 . arXiv preprint arXiv:1904.03365

Gui Q, Yang W, Jin Z, Ruiz-Blondet MV, Laszlo S (2016) A residual feature-based replay attack detection approach for brainprint biometric systems. In 2016 IEEE international workshop on information forensics and security (WIFS). IEEE, pp 1–6. https://doi.org/10.1109/WIFS.2016.7823907

Hajarolasvadi N, Demirel H (2019) 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5). https://doi.org/10.3390/e21050479

Harrington DA (1951) An experimental study of the subjective and objective characteristics of sustained vowels at high pitches

Huzaifah M (2017) Comparison of time-frequency representations for environmental sound classification using convolutional neural networks. https://doi.org/10.48550/arXiv.1706.07156 . arXiv preprint arXiv:1706.07156

Iliou T, Anagnostopoulos CN (2009) Statistical evaluation of speech features for emotion recognition. In Fourth international conference on digital telecommunications, Colmar, France, pp 121–126. https://doi.org/10.1109/ICDT.2009.30

Jati A, Hsu CC, Pal M, Peri R, AbdAlmageed W, Narayanan S (2020) Adversarial attack and defense strategies for deep speaker recognition systems. https://doi.org/10.1016/j.csl.2021.101199 . arXiv preprint arXiv:2008.07685

Joshi DD, Zalte MB (2013) Speech emotion recognition: a review. IOSR J Electron Commun Eng (IOSR-JECE) 4(4):34–37

Kao YH, Lee LS (2006) Feature analysis for emotion recognition from Mandarin speech considering the special characteristics of Chinese language. In INTERSPEECH—ICSLP, Pittsburgh, Pennsylvania, pp 1814–1817. https://doi.org/10.21437/Interspeech.2006-501

Kate HK, Razmara J, Isazadeh A (2018) A novel fast and secure approach for voice encryption based on DNA computing. 3D Res 9(2):1–11. https://doi.org/10.1007/s13319-018-0167-x

Khanjani Z, Watson G, Janeja VP (2021) How deep are the fakes? Focusing on audio deepfake: a survey. arXiv preprint arXiv:2111.14203

Kintzlinger M, Nissim N (2019) Keep an eye on your personal belongings! The security of personal medical devices and their ecosystems. J Biomed Inform 95:103233. https://doi.org/10.1016/j.jbi.2019.103233

Kryzhanovsky B, Dunin-Barkowski W, Redko V (2018) Advances in neural computation, machine learning, and cognitive research: Selected papers from the XIX international conference on neuroinformatics, october 2–6, 2017, Moscow, Russia. Studies Comput Intell 736(October 2017):iii–iv. https://doi.org/10.1007/978-3-319-66604-4

Kuleshov V, Enam SZ, Ermon S (2017) Audio super-resolution using neural nets. In ICLR (Workshop Track). https://doi.org/10.48550/arXiv.1708.00853

Landau O, Puzis R, Nissim N (2020) Mind your mind: EEG-based brain-computer interfaces and their security in cyber space. ACM Comput Surv (CSUR) 53(1):1–38. https://doi.org/10.1145/3372043

Latif S, Rana R, Qadir J (2018) Adversarial machine learning and speech emotion recognition: Utilizing generative adversarial networks for robustness. arXiv preprint arXiv:1811.11402

Lech M, Stolar M, Bolia R, Skinner M (2018) Amplitude-frequency analysis of emotional speech using transfer learning and classification of spectrogram images. Adv Sci Technol Eng Syst 3(4):363–371. https://doi.org/10.25046/aj030437

Lee T, Edwards B, Molloy I, Su D (2018) Defending against machine learning model stealing attacks using deceptive perturbations. https://doi.org/10.48550/arXiv.1806.00054 . arXiv preprint arXiv:1806.00054

Li Q, Juang BH, Lee CH (2000) Automatic verbal information verification for user authentication. IEEE Trans Speech Audio Process 8(5):585–596. https://doi.org/10.1109/89.861378

Lim W, Jang D, Lee T (2017) Speech emotion recognition using convolutional and recurrent neural networks. 2016 Asia-pacific signal and information processing association annual summit and conference, APSIPA 2016. pp 1–4 https://doi.org/10.1109/APSIPA.2016.7820699

Liu Y, Ma S, Aafer Y, Lee WC, Zhai J, Wang W, Zhang X (2018) Trojaning attack on neural networks. In: 25th Annual Network And Distributed System Security Symposium (NDSS 2018). Internet Soc

Lotfian R, Busso C (2015) Emotion recognition using synthetic speech as neutral reference. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4759–4763. https://doi.org/10.1109/ICASSP.2015.7178874

Luengo I, Navas E, Hernez I, Snchez J (2005) Automatic emotion recognition using prosodic parameters. In INTERSPEECH, Lisbon, Portugal, pp 493–496). https://doi.org/10.21437/Interspeech.2005-324

McCormick M (2008) Data theft: a prototypical insider threat. In Insider attack and cyber security: beyond the hacker. Springer US, Boston MA, pp 53–68 https://doi.org/10.1007/978-0-387-77322-3_4

McGilloway S, Cowie R, Douglas-Cowie E, Gielen S, Westerdijk M, Stroeve S (2000) Approaching automatic recognition of emotion from voice: A rough benchmark. In: ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion

Mdhaffar S, Bonastre JF, Tommasi M, Tomashenko N, Estève Y (2021) Retrieving speaker information from personalized acoustic models for speech recognition. https://doi.org/10.48550/arXiv.2111.04194 . arXiv preprint arXiv:2111.04194

Moskovitch R, Nissim N, Elovici Y (2007) “Malicious code detection and acquisition using active learning,” ISI 2007 2007 IEEE Intell Secur Informatics 372. https://doi.org/10.1109/ISI.2007.379505

Moskovitch R, Nissim N, Elovici Y (2010) Acquisition of malicious code using active learning. https://www.researchgate.net/publication/228953558

Neiberg D, Elenius K, Laskowski K (2006) Emotion recognition in spontaneous speech using GMMs. In Interspeech—ICSLP. Pittsburgh, Pennsylvania, pp 809–812. https://doi.org/10.21437/Interspeech.2006-277

Nissim N et al (2015) An active learning framework for efficient condition severity classification. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics) 9105:13–24. https://doi.org/10.1007/978-3-319-19551-3_3

Nissim N et al (2019) Sec-lib: protecting scholarly digital libraries from infected papers using active machine learning framework. IEEE Access 7:110050–110073. https://doi.org/10.1109/ACCESS.2019.2933197

Nissim N, Cohen A, Elovici Y (2017) ALDOCX: detection of unknown malicious Microsoft office documents using designated active learning methods based on new structural feature extraction methodology. IEEE Trans Inf Forensics Secur 12(3):631–646. https://doi.org/10.1109/TIFS.2016.2631905

Nissim N, Moskovitch R, Rokach L, Elovici Y (2014) Novel active learning methods for enhanced PC malware detection in windows OS. Expert Syst Appl 41(13):5843–5857

Oh SJ, Schiele B, Fritz M (2019) Towards reverse-engineering black-box neural networks. In explainable AI: interpreting, explaining and visualizing deep learning. Springer, Cham, pp 121–144. https://doi.org/10.1007/978-3-030-28954-6_7

Orr MJ (1996) Introduction to radial basis function networks

Pao TL, Chen YT, Yeh JH, Liao WY (2005) Combining acoustic features for improved emotion recognition in Mandarin speech. In Tao J, Tan T, Picard R (Eds.), LNCS. ACII, Berlin, Heidelberg (pp. 279–285), Berlin: Springer. https://doi.org/10.1007/11573548_36

Pierre-Yves O (2003) The production and recognition of emotions in speech: features and algorithms. Int J Hum Comput Stud 59(1–2):157–183. https://doi.org/10.1016/S1071-5819(02)00141-6

Polzin TS, Waibel A (2000) Emotion-sensitive human-computer interfaces. In: ISCA tutorial and research workshop (ITRW) on speech and emotion

Rao KS, Yegnanarayana B (2006) Prosody modification using instants of significant excitation. IEEE Trans Audio Speech Lang Process 14(3):972–980. https://doi.org/10.1109/TSA.2005.858051.DOI:10.1109/TSA.2005.858051

Rázuri JG, Sundgren D, Rahmani R, Moran A, Bonet I, Larsson A (2015) Speech emotion recognition in emotional feedback for human-robot interaction. Int J Adv Res Artif Intell (IJARAI) 4(2):20–27. https://doi.org/10.14569/IJARAI.2015.040204

Satt A, Rozenberg S, Hoory R (2017) Efficient emotion recognition from speech using deep learning on spectrograms. Proceedings of the Annual conference of the international speech communication association, interspeech, pp 1089–1093. https://doi.org/10.21437/Interspeech.2017-200

Schuller BW (2018) Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends. Commun ACM 61(5):90–99

Sini J, Marceddu AC, Violante M (2020) Automatic emotion recognition for the calibration of autonomous driving functions. Electronics 9(3):518. https://doi.org/10.3390/electronics9030518

Slaney M, McRoberts G (1998) Baby ears: a recognition system for affective vocalization. In: proceedings of ICASSP 1998. https://doi.org/10.1109/ICASSP.1998.675432

Song L, Mittal P (2017) POSTER: Inaudible voice commands. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. pp 2583–2585. https://doi.org/10.1145/3133956.3138836

Stevens SS, Volkmann J, Newman EB (1937) A scale for the measurement of the psychological magnitude pitch. J Acoust Soc Am 8.3:185–190. https://doi.org/10.1121/1.1915893

Swain M, Routray A, Kabisatpathy P (2018) Databases, features and classifiers for speech emotion recognition: a review. Int J Speech Technol 21:93–120. https://doi.org/10.1007/s10772-018-9491-z.10.1007/s10772-018-9491-z

Tramèr F, Zhang F, Juels A, Reiter MK, Ristenpart T (2016) Stealing machine learning models via prediction {APIs}. In: 25th USENIX security symposium (USENIX Security 16), pp 601–618

Ullah F, Edwards M, Ramdhany R, Chitchyan R, Babar MA, Rashid A (2018) Data exfiltration: a review of external attack vectors and countermeasures. J Netw Comput Appl 101:18–54. https://doi.org/10.1016/j.jnca.2017.10.016

Utane AS, Nalbalwar SL (2013) Emotion recognition through Speech. Int J Appl Inf Syst (IJAIS) 5–8

Wang C, Wang D, Abbas J, Duan K, Mubeen R (2021) Global financial crisis, smart lockdown strategies, and the COVID-19 spillover impacts: A global perspective implications from Southeast Asia. Front Psychiatry 12:643783

Whiteside SP (1998) Simulated emotions: an acoustic study of voice and perturbation measures. In: Fifth International Conference on Spoken Language Processing

Williamson JD (1978) U.S. Patent No. 4,093,821. Washington, DC: U.S. Patent and Trademark Office.

Wysopal C, Eng C, Shields T (2010) Static detection of application backdoors. Datenschutz Und Datensicherheit-DuD 34(3):149–155. https://doi.org/10.1007/s11623-010-0024-4

Yan C, Ji X, Wang K, Jiang Q, Jin Z, Xu W (2022) A survey on voice assistant security: attacks and countermeasures. ACM Comput Surv (CSUR). https://doi.org/10.1145/3527153

Yao Z, Wang Z, Liu W, Liu Y, Pan J (2020) Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN. Speech Commun. https://doi.org/10.1016/j.specom.2020.03.005

Zhang S (2008) Emotion recognition in Chinese natural speech by combining prosody and voice quality features. In Sun, et. al. (Eds.), Lecture notes in computer science. Advances in neural networks (pp. 457–464). Berlin: Springer. https://doi.org/10.1007/978-3-540-87734-9_52

Zhang G, Yan C, Ji X, Zhang T, Zhang T, Xu W (2017) Dolphinattack: inaudible voice commands. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. pp 103–117. https://doi.org/10.1145/3133956.3134052

Zhu A, Luo Q (2007) Study on speech emotion recognition system in E-learning. In J. Jacko (Ed.), LNCS. Human computer interaction, Part III, HCII (pp. 544–552). Berlin: Springer

Žliobaitė I (2010) Learning under concept drift: an overview. arXiv preprint arXiv:1010.4784

Download references

Author information

Authors and affiliations.

Malware Lab, Cyber Security Research Center, Ben-Gurion University of the Negev, Beer-Sheva, Israel

Itzik Gurowiec & Nir Nissim

Department of Industrial Engineering and Management, Ben-Gurion University of the Negev, Beer-Sheva, Israel

You can also search for this author in PubMed   Google Scholar

Contributions

Nir Nissim – Conceptualization, Investigation, Funding acquisition, Methodology, Supervision.

Nir Nissim and Itzik Gurowiec – Formal analysis, Resources, Visualization, Writing—original draft, Writing—review & editing.

Corresponding author

Correspondence to Nir Nissim .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Gurowiec, I., Nissim, N. Speech emotion recognition systems and their security aspects. Artif Intell Rev 57 , 148 (2024). https://doi.org/10.1007/s10462-024-10760-z

Download citation

Accepted : 07 April 2024

Published : 21 May 2024

DOI : https://doi.org/10.1007/s10462-024-10760-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Recognition
  • Cyber-attack
  • Find a journal
  • Publish with us
  • Track your research
  • Case Report
  • Open access
  • Published: 14 May 2024

Motor polyradiculoneuropathy as an unusual presentation of neurobrucellosis: a case report and literature review

  • Ahmad Alikhani 1 ,
  • Noushin Ahmadi 1 ,
  • Mehran Frouzanian 2 &
  • Amirsaleh Abdollahi 2  

BMC Infectious Diseases volume  24 , Article number:  491 ( 2024 ) Cite this article

161 Accesses

Metrics details

Brucellosis, a zoonotic disease caused by Brucella species, poses a significant global health concern. Among its diverse clinical manifestations, neurobrucellosis remains an infrequent yet debilitating complication. Here, we present a rare case of neurobrucellosis with unusual presentations in a 45-year-old woman. The patient’s clinical course included progressive lower extremity weakness, muscle wasting, and double vision, prompting a comprehensive diagnostic evaluation. Notable findings included polyneuropathy, elevated brucella agglutination titers in both cerebrospinal fluid and blood, abnormal EMG-NCV tests, and resolving symptoms with antibiotic therapy. The clinical presentation, diagnostic challenges, and differentiation from other neurological conditions are discussed. This case underscores the importance of considering neurobrucellosis in regions where brucellosis is prevalent and highlights this rare neurological complication’s distinctive clinical and radiological features. Early recognition and appropriate treatment are crucial to mitigate the significant morbidity associated with neurobrucellosis.

Peer Review reports

Introduction

Brucellosis, caused by Brucella species, is an infectious ailment recognized by various names such as remitting, undulant, Mediterranean, Maltese, Crimean, and goat fever. Humans contract it through the consumption of unpasteurized milk and dairy products, undercooked meat, or skin contact with infected livestock [ 1 , 2 , 3 ]. Various Brucella species, including Brucella melitensis (primarily sourced from sheep and goats), Brucella abortus (found in cattle), Brucella suis (associated with pigs/hogs), and Brucella canis (linked to dogs), can lead to illness in humans [ 3 , 4 , 5 ]. While brucellosis in humans is rarely fatal, it can lead to disability [ 6 ]. Brucellosis ranks among the most prevalent zoonotic diseases, impacting approximately 500,000 individuals yearly [ 7 ]. The combined estimate for the prevalence of brucellosis was 15.53% [ 8 ].

Neurobrucellosis, a rare complication of systemic brucellosis, can occur in adult and pediatric cases [ 9 ], and can manifest at any stage of the disease. They can present in various clinical presentations such as meningitis, encephalitis, meningoencephalitis, myelitis, radiculopathy, polyneuropathy, stroke, cerebral venous thrombosis, and occasionally psychiatric symptoms [ 10 , 11 ]. Although the mortality rate is low, patients often experience persistent neurological issues following neurobrucellosis [ 12 ]. Studies suggest that around 20% of neurobrucellosis cases result in lasting neurological problems [ 13 ]. It is uncommonly considered in cases of meningoencephalitis or polyneuropathy, making it crucial for clinicians to have a high suspicion of it in patients displaying such symptoms, especially in endemic regions, to prevent severe clinical outcomes. In this study, we present a rare case of neurobrucellosis with unusual clinical presentations in a patient admitted to our center.

Case presentation

A 45-year-old female patient, with no prior medical history, presented to our center after enduring distal pain and weakness in her lower extremities for approximately 10 months. Over this period, the muscle weakness progressed, affecting proximal muscles of upper and lower limbs, and leading to a substantial weight loss of 25–30 kg despite maintaining appetite. Initially dismissive of the limb weakness and pain, the patient sought medical attention six months after symptom onset due to the worsening symptoms and gait impairment. Over the subsequent four months, she underwent multiple medical evaluations and tests, including a lumbar X-ray. Following these initial investigations and due to low serum vitamin D levels, vitamin D and calcium supplements were prescribed, and lumbar MRI were requested for further evaluation. (Table  1 )

Upon referral to an infectious disease specialist, the patient’s history of local dairy consumption and positive serologic test for brucellosis prompted treatment with rifampin and doxycycline. However, the patient’s condition deteriorated significantly five days after starting this treatment. She experienced severe gait disorder, lower extremity weakness, diplopia, and blurred vision that had gradually worsened over two weeks. Subsequently, she presented to our center for further assessment.

Upon admission, the patient was unable to stand even with assistance and exhibited diplopia. Cranial nerve examination revealed no abnormalities, except for the II, III, and IV cranial nerves, which could not be thoroughly examined due to the presence of diplopia. The patient tested negative for Kernig and Brudzinski signs. There were no palpable supraclavicular or inguinal lymph nodes. Physical examinations of the breast, axilla, lungs, heart, and abdomen were unremarkable. Muscle strength was reduced in the lower extremities, and deep tendon reflexes of the knee and Achilles were absent. The plantar reflex was non-responsive, and certain reflexes, including biceps, triceps, and brachioradialis, were absent despite normal movement of the upper extremities. Anorectal muscle tone and anal reflex were normal.

Further investigations included normal urinalysis and abdominal and pelvic ultrasound. Chest X-ray and brain CT were also ordered. Due to the patient’s refusal of lumbar puncture, a suspicion of neurobrucellosis led to the initiation of a three-drug regimen (Table  2 ); ceftriaxone 2 g IV twice daily, rifampin 600 mg PO daily, and doxycycline 100 mg PO twice daily. The ophthalmology consultation did not reveal any ocular pathology, and the neurologist ordered brain MRI and EMG-NCV tests. The patient’s brain MRI was unremarkable, but EMG-NCV showed sensory and motor polyneuropathy. Consequently, intravenous immunoglobulin (IVIG) therapy was initiated at a daily dose of 25 g. After five days, the patient consented to lumbar puncture, confirming the diagnosis of brucellosis. Co-trimoxazole 960 mg PO three times daily was added to her treatment regimen, and IVIG therapy continued for seven days. Following a 3-day course of IVIG treatment, the neuropathy symptoms showed significant improvement. By the seventh day, there was a notable enhancement in limb strength, particularly in the upper limbs, reaching a 2-point improvement. After undergoing three weeks of intravenous therapy, the patient transitioned to oral medication. Despite disagreement regarding the necessity of a second CSF examination, the patient was discharged with a prescription for doxycycline, rifampin, and cotrimoxazole. Upon discharge, the patient could walk with the aid of a walker. However, within a month, a slight limp persisted, and by the third-month post-discharge, all symptoms had resolved completely.

Brucellosis is widely spread globally, with more than half a million reported human cases annually [ 14 , 15 ]. Countries like Kenya, Yemen, Syria, Greece, and Eritrea have experienced high rates of brucellosis. The situation of brucellosis has shown signs of improvement in many epidemic regions. However, new areas with high occurrences of this disease continue to emerge, particularly in Africa and the Middle East, where the incidence of the disease varies [ 16 ]. Brucellosis is linked to various neurological complications collectively known as neurobrucellosis, which is an uncommon condition, and only a few cases have been reported globally [ 17 , 18 , 19 , 20 , 21 ]. Our patient exhibited muscle weakness, polyneuropathy, and inability to walk, which are often not regarded as indicative of a brucella infection by many physicians. While the diagnosis of neurobrucellosis can typically be confirmed through classical clinical signs, radiological examinations, and serological tests, patients might not always display typical symptoms, as observed in our case. Hence, in regions where the disease is prevalent, clinicians should maintain a high level of suspicion if patients do not show improvement with standard treatment. Additionally, the lack of awareness among healthcare professionals and limited access to advanced laboratory facilities can lead to misdiagnosis.

The frequent manifestations of neurobrucellosis include meningitis or meningoencephalitis. Typically, it starts with a sudden headache, vomiting, and altered mental state, which can progress to unconsciousness, with or without seizures [ 22 ]. Additionally, brucellosis can lead to several central nervous system issues such as inflammation of cerebral blood vessels, abscesses in the brain or epidural space, strokes, and cerebellar ataxia. Peripheral nerve problems may include nerve damage or radiculopathy, Guillain-Barré syndrome, and a syndrome resembling poliomyelitis [ 13 ]. Nevertheless, the patient exhibited no indications of seizures, brain hemorrhage, stroke, or focal neurological impairments. Instead, the observed symptoms were consistent with radiculopathy and muscular weakness.

In only 7% of neurobrucellosis cases, the peripheral nervous system is affected. Remarkably, our case falls within this rare category, adding to its unique and intriguing nature. Previous case studies have detailed polyradiculoneuropathies, manifesting as acute, subacute, or chronic forms [ 23 ]. Our patient’s condition aligns with chronic motor polyradiculopathy. Interestingly, some of these cases exhibit sensory deficits or resemble Guillain-Barré syndrome [ 23 , 24 ]. In a prior case study conducted by Abuzinadah and colleagues, a comparable case was described as a subacute motor polyradiculopathy. The patient exhibited gradual bilateral lower limb weakness over three weeks, eventually leading to loss of mobility within seven weeks. Brucella was isolated from the cerebrospinal fluid after a two-week incubation period, and high antibody titers were detected in the patient’s serum [ 23 ]. In another study led by Alanazi and colleagues, a 56-year-old man initially diagnosed with Guillain-Barré syndrome experienced worsening symptoms despite appropriate treatment. Following plasma exchange and antibiotics, his condition improved temporarily, only to relapse, raising suspicion of chronic inflammatory demyelinating polyneuropathy, and treatment with IVIG resulted in substantial improvement. Upon further investigation, he was diagnosed with brucellosis [ 24 ]. This highlights the importance of recognizing GBS-like symptoms in regions where brucellosis is prevalent, prompting clinicians to consider the possibility of brucellosis in their diagnosis.

While there are no established criteria for diagnosing neurobrucellosis [ 25 ], certain articles have suggested several methods for its diagnosis. These methods include the presence of symptoms aligning with neurobrucellosis, isolating brucella from cerebrospinal fluid (CSF) or detecting a positive brucella agglutination titer in CSF, observing lymphocytosis, elevated protein, and decreased glucose levels in CSF, or identifying specific diagnostic indicators in cranial imaging such as magnetic resonance imaging or computed tomography (MRI or CT) [ 13 , 26 , 27 , 28 ]. Neurobrucellosis does not present a distinct clinical profile or specific CSF characteristics. Imaging observations of neurobrucellosis fall into four categories: normal, inflammatory (indicated by granulomas and enhanced meninges, perivascular spaces, or lumbar nerve roots), alterations in white matter, and vascular changes [ 29 ]. We suspected neurobrucellosis based on the patient’s clinical symptoms, geographic correlation, high brucella agglutination test titers in both cerebrospinal fluid and blood, symptom resolution following treatment, and the exclusion of other common causes.

In Iran, one differential diagnosis often confused with brucellosis is tuberculosis, as both chronic granulomatous infectious diseases are prevalent here [ 30 , 31 ]. Neurobrucellosis and tuberculosis exhibit significant similarities in clinical symptoms, lab results, and neuroimaging findings. However, deep grey matter involvement and widespread white matter lesions seen in neuroimaging, resembling demyelinating disorders, appear to be distinctive to brucellosis [ 32 ]. There is a noticeable similarity in the clinical symptoms and laboratory findings of brucellosis and tuberculosis [ 33 ]. It is crucial to thoroughly eliminate the possibility of tuberculosis in any suspected or confirmed brucellosis cases before starting antibiotic treatment.

Due to the challenging nature of treating brucellosis and the likelihood of experiencing relapses, it is crucial to provide an extended course of treatment [ 27 ]. This treatment approach should involve a combination of antibiotics that can easily penetrate the cell wall and effectively reach the central nervous system [ 27 , 34 ]. Neurobrucellosis is treated with 3 to 6 months of combination therapy comprising doxycycline, rifampicin, and ceftriaxone or trimethoprim-sulfamethoxazole [ 35 ], similar to the treatment administered to our patient. For patients allergic to cephalosporins, quinolones are recommended, which are considered to be effective in treating brucellosis [ 36 , 37 ]. In complicated situations such as meningitis or endocarditis, streptomycin or gentamicin is administered in the initial 14 days of treatment, in addition to the previously mentioned regimen. Timely and proper treatment results in a positive prognosis, with a less than 1% fatality rate for such complex cases [ 17 , 38 ]. Our patient experienced a highly positive outcome following the prescribed therapy. Initially relying on a walker, a slight limp endured for a month, and by the third month after discharge, all symptoms completely disappeared.

The present study underscores the significance of considering neurobrucellosis as a potential diagnosis when evaluating muscle weakness and radiculopathy, especially in regions where the disease is prevalent. A comprehensive patient history, precise clinical examination, positive serology in blood or cerebrospinal fluid, imaging results, or cerebrospinal fluid analysis can contribute to establishing a conclusive diagnosis.

Data availability

The datasets generated and/or analysed during the current study are not publicly available due to our team’s privacy concerns but are available from the corresponding author on reasonable request.

Galińska EM, Zagórski J. Brucellosis in humans–etiology, diagnostics, clinical forms. Ann Agric Environ Med. 2013;20(2):233–8.

PubMed   Google Scholar  

Głowacka P, Żakowska D, Naylor K, Niemcewicz M, Bielawska-Drózd A. Brucella - Virulence factors, Pathogenesis and treatment. Pol J Microbiol. 2018;67(2):151–61.

Article   PubMed   PubMed Central   Google Scholar  

Khurana SK, Sehrawat A, Tiwari R, Prasad M, Gulati B, Shabbir MZ, et al. Bovine brucellosis - a comprehensive review. Vet Q. 2021;41(1):61–88.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Yagupsky P, Morata P, Colmenero JD. Laboratory diagnosis of human brucellosis. Clin Microbiol Rev. 2019;33(1).

Kurmanov B, Zincke D, Su W, Hadfield TL, Aikimbayev A, Karibayev T et al. Assays for identification and differentiation of Brucella species: a review. Microorganisms. 2022;10(8).

Franco MP, Mulder M, Gilman RH, Smits HL. Human brucellosis. Lancet Infect Dis. 2007;7(12):775–86.

Article   CAS   PubMed   Google Scholar  

Mantur BG, Amarnath SK, Shinde RS. Review of clinical and laboratory features of human brucellosis. Indian J Med Microbiol. 2007;25(3):188–202.

Khoshnood S, Pakzad R, Koupaei M, Shirani M, Araghi A, Irani GM, et al. Prevalence, diagnosis, and manifestations of brucellosis: a systematic review and meta-analysis. Front Vet Sci. 2022;9:976215.

Dhar D, Jaipuriar RS, Mondal MS, Shunmugakani SP, Nagarathna S, Kumari P et al. Pediatric neurobrucellosis: a systematic review with case report. J Trop Pediatr. 2022;69(1).

Mahajan SK, Sharma A, Kaushik M, Raina R, Sharma S, Banyal V. Neurobrucellosis: an often forgotten cause of chronic meningitis. Trop Doct. 2016;46(1):54–6.

Article   PubMed   Google Scholar  

Dreshaj S, Shala N, Dreshaj G, Ramadani N, Ponosheci A. Clinical manifestations in 82 neurobrucellosis patients from Kosovo. Mater Sociomed. 2016;28(6):408–11.

Gul HC, Erdem H, Bek S. Overview of neurobrucellosis: a pooled analysis of 187 cases. Int J Infect Dis. 2009;13(6):e339–43.

Guven T, Ugurlu K, Ergonul O, Celikbas AK, Gok SE, Comoglu S, et al. Neurobrucellosis: clinical and diagnostic features. Clin Infect Dis. 2013;56(10):1407–12.

Alkahtani AM, Assiry MM, Chandramoorthy HC, Al-Hakami AM, Hamid ME. Sero-prevalence and risk factors of brucellosis among suspected febrile patients attending a referral hospital in southern Saudi Arabia (2014–2018). BMC Infect Dis. 2020;20(1):26.

Pappas G, Papadimitriou P, Akritidis N, Christou L, Tsianos EV. The new global map of human brucellosis. Lancet Infect Dis. 2006;6(2):91–9.

Liu Z, Gao L, Wang M, Yuan M, Li Z. Long ignored but making a comeback: a worldwide epidemiological evolution of human brucellosis. Emerg Microbes Infect. 2024;13(1):2290839.

Naderi H, Sheybani F, Parsa A, Haddad M, Khoroushi F. Neurobrucellosis: report of 54 cases. Trop Med Health. 2022;50(1):77.

Farhan N, Khan EA, Ahmad A, Ahmed KS. Neurobrucellosis: a report of two cases. J Pak Med Assoc. 2017;67(11):1762–3.

Karsen H, Tekin Koruk S, Duygu F, Yapici K, Kati M. Review of 17 cases of neurobrucellosis: clinical manifestations, diagnosis, and management. Arch Iran Med. 2012;15(8):491–4.

Türel O, Sanli K, Hatipoğlu N, Aydoğmuş C, Hatipoğlu H, Siraneci R. Acute meningoencephalitis due to Brucella: case report and review of neurobrucellosis in children. Turk J Pediatr. 2010;52(4):426–9.

Guney F, Gumus H, Ogmegul A, Kandemir B, Emlik D, Arslan U, et al. First case report of neurobrucellosis associated with hydrocephalus. Clin Neurol Neurosurg. 2008;110(7):739–42.

Corbel MJ. Brucellosis: an overview. Emerg Infect Dis. 1997;3(2):213–21.

Abuzinadah AR, Milyani HA, Alshareef A, Bamaga AK, Alshehri A, Kurdi ME. Brucellosis causing subacute motor polyradiculopathy and the pathological correlation of pseudomyopathic electromyography: a case report. Clin Neurophysiol Pract. 2020;5:130–4.

Alanazi A, Al Najjar S, Madkhali J, Al Malik Y, Al-Khalaf A, Alharbi A. Acute Brucellosis with a Guillain-Barre Syndrome-Like Presentation: a Case Report and Literature Review. Infect Dis Rep. 2021;13(1):1–10.

Raina S, Sharma A, Sharma R, Bhardwaj A, Neurobrucellosis. A Case Report from Himachal Pradesh, India, and review of the literature. Case Rep Infect Dis. 2016;2016:2019535.

PubMed   PubMed Central   Google Scholar  

McLean DR, Russell N, Khan MY. Neurobrucellosis: clinical and therapeutic features. Clin Infect Dis. 1992;15(4):582–90.

Bouferraa Y, Bou Zerdan M, Hamouche R, Azar E, Afif C, Jabbour R. Neurobrucellosis: brief review. Neurologist. 2021;26(6):248–52.

Aygen B, Doğanay M, Sümerkan B, Yildiz O, Kayabaş Ü. Clinical manifestations, complications and treatment of brucellosis: a retrospective evaluation of 480 patients. Méd Mal Infect. 2002;32(9):485–93.

Article   Google Scholar  

Kizilkilic O, Calli C, Neurobrucellosis. Neuroimaging Clin N Am. 2011;21(4):927–37. ix.

Chalabiani S, Khodadad Nazari M, Razavi Davoodi N, Shabani M, Mardani M, Sarafnejad A, et al. The prevalence of brucellosis in different provinces of Iran during 2013–2015. Iran J Public Health. 2019;48(1):132–8.

Doosti A, Nasehi M, Moradi G, Roshani D, Sharafi S, Ghaderi E. The pattern of tuberculosis in Iran: A National Cross-sectional Study. Iran J Public Health. 2023;52(1):193–200.

Rajan R, Khurana D, Kesav P. Deep gray matter involvement in neurobrucellosis. Neurology. 2013;80(3):e28–9.

Dasari S, Naha K, Prabhu M. Brucellosis and tuberculosis: clinical overlap and pitfalls. Asian Pac J Trop Med. 2013;6(10):823–5.

Ko J, Splitter GA. Molecular host-pathogen interaction in brucellosis: current understanding and future approaches to vaccine development for mice and humans. Clin Microbiol Rev. 2003;16(1):65–78.

Zhao S, Cheng Y, Liao Y, Zhang Z, Yin X, Shi S. Treatment efficacy and risk factors of Neurobrucellosis. Med Sci Monit. 2016;22:1005–12.

Hasanain A, Mahdy R, Mohamed A, Ali M. A randomized, comparative study of dual therapy (doxycycline-rifampin) versus triple therapy (doxycycline-rifampin-levofloxacin) for treating acute/subacute brucellosis. Braz J Infect Dis. 2016;20(3):250–4.

Falagas ME, Bliziotis IA. Quinolones for treatment of human brucellosis: critical review of the evidence from microbiological and clinical studies. Antimicrob Agents Chemother. 2006;50(1):22–33.

Budnik I, Fuchs I, Shelef I, Krymko H, Greenberg D. Unusual presentations of pediatric neurobrucellosis. Am J Trop Med Hyg. 2012;86(2):258–60.

Download references

This research did not receive any funding or financial support.

Author information

Authors and affiliations.

Infectious Diseases Department and Antimicrobial Resistance Research Center and Transmissible Diseases Institute, Mazandaran University of Medical Sciences, Sari, Iran

Ahmad Alikhani & Noushin Ahmadi

Student Research Committee, School of Medicine, Mazandaran University of Medical Sciences, Sari, Iran

Mehran Frouzanian & Amirsaleh Abdollahi

You can also search for this author in PubMed   Google Scholar

Contributions

A.A oversaw and treated the case, including the entire revision process. N.A. contributed to the article’s composition. M.F. authored the discussion section, along with the complete revision. AS.A. played a role in crafting the case report discussion and participated in the entire revision process.

Corresponding author

Correspondence to Amirsaleh Abdollahi .

Ethics declarations

Ethics approval and consent to participate.

In adherence to ethical standards, rigorous protocols were followed to obtain approval from the relevant ethics committee and secure informed consent from all participants involved in the study.

Consent for publication

informed consent was obtained from the patient for both study participation AND publication of identifying information/images in an online open-access publication.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Alikhani, A., Ahmadi, N., Frouzanian, M. et al. Motor polyradiculoneuropathy as an unusual presentation of neurobrucellosis: a case report and literature review. BMC Infect Dis 24 , 491 (2024). https://doi.org/10.1186/s12879-024-09365-2

Download citation

Received : 04 December 2023

Accepted : 29 April 2024

Published : 14 May 2024

DOI : https://doi.org/10.1186/s12879-024-09365-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Neurobrucellosis
  • EMG-NCV tests
  • Polyradiculoneuropathy
  • Antibiotic therapy
  • Intravenous immunoglobulin therapy
  • Zoonotic disease
  • Gait disorder
  • Lower extremity weakness
  • Blurred vision

BMC Infectious Diseases

ISSN: 1471-2334

literature review on emotion recognition

IMAGES

  1. Table 1 from A Literature Review on Emotion Recognition Using Various

    literature review on emotion recognition

  2. Ethical considerations in emotion recognition technologies: a review of

    literature review on emotion recognition

  3. (PDF) Facial emotion processing and recognition among maltreated

    literature review on emotion recognition

  4. (PDF) Derin Öğrenme Yöntemleri İle Konuşmadan Duygu Tanıma Üzerine Bir

    literature review on emotion recognition

  5. (PDF) A literature review on emotion recognition system using various

    literature review on emotion recognition

  6. (PDF) Automatic Emotion Recognition in Children with Autism: A

    literature review on emotion recognition

VIDEO

  1. Emotion Recognition using Audio AR using ML

  2. Video Emotion Recognition Project Demo

  3. SPEECH EMOTION RECOGNITION

  4. 306: Speech Emotion Recognition System Using Discrete Wavelet Transform and Support Vector Machine

  5. Grand Theft Auto: Vice City Soundtrack Review (Emotion 98.3 Part 3)

  6. Product test with facial recognition of emotions

COMMENTS

  1. A systematic literature review of emotion recognition ...

    A systematic literature review (SLR) is a framework for fairly identifying, assessing and interpreting all existing research. SLR aims to answer some specific research question that has been formulated (Kitchenham & Charters, 2007). In this paper, we chose SLR to conduct a literature review on the study of emotional recognition using EEG signals.

  2. A Literature Review on Emotion Recognition Using Various Methods

    Abstract - Emotion Recognition is an important area of work to improve the interaction between. human and machine. Complexity of emotion makes the acquisition task more difficult. Quondam. works ...

  3. Development and application of emotion recognition technology

    A literature review was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines . MeSH terms in Medline were searched. Three categories of keywords were preliminarily identified based on the research question, namely, emotion, recognition, and patients.

  4. Machine learning for human emotion recognition: a comprehensive review

    Emotion is an interdisciplinary research field investigated by many research areas such as psychology, philosophy, computing, and others. Emotions influence how we make decisions, plan, reason, and deal with various aspects. Automated human emotion recognition (AHER) is a critical research topic in Computer Science. It can be applied in many applications such as marketing, human-robot ...

  5. Development and application of emotion recognition technology

    There is a mutual influence between emotions and diseases. Thus, the subject of emotions has gained increasing attention. The primary objective of this study was to conduct a comprehensive review of the developments in emotion recognition technology over the past decade. This review aimed to gain insights into the trends and real-world effects of emotion recognition technology by examining its ...

  6. Human Emotion Recognition: Review of Sensors and Methods

    2. Emotions Evaluation Methods. Emotion evaluations methods which are presented in the literature can be classified into two main groups according to the basic techniques used for emotions recognition: self-repot techniques based on emotions self-assessment by filing various questionnaires [30,31,32]; machine assessment techniques based on measurements of various parameters of human body [33 ...

  7. Systematic Literature Review for Emotion Recognition from ...

    Abstract. Researchers have recently become increasingly interested in recognizing emotions from electroencephalogram (EEG) signals and many studies utilizing different approaches have been conducted in this field. For the purposes of this work, we performed a systematic literature review including over 40 articles in order to identify the best ...

  8. The role of facial movements in emotion recognition

    Most faces that people encounter move, yet most research on emotion recognition uses photographs of posed expressions. In this Review, Krumhuber et al. describe how dynamic information contributes ...

  9. Automatic Speech Emotion Recognition: a Systematic Literature Review

    Automatic Speech Emotion Recognition (ASER) has recently garnered attention across various fields including artificial intelligence, pattern recognition, and human-computer interaction. However, ASER encounters numerous challenges such as a shortage of diverse datasets, appropriate feature selection, and suitable intelligent recognition techniques. To address these challenges, a systematic ...

  10. A Review of Emotion Recognition Using Physiological Signals

    Abstract. Emotion recognition based on physiological signals has been a hot topic and applied in many areas such as safe driving, health care and social security. In this paper, we present a comprehensive review on physiological signal-based emotion recognition, including emotion models, emotion elicitation methods, the published emotional ...

  11. Literature Review on Emotion Recognition System

    Literature Review on Emotion Recognition System Abstract: Emotion plays a significant role in human beings daily lives. Humans can easily sense a person's emotions. But in some cases devices need to sense people's emotions. Machine learning is a sub-part of artificial intelligence that produces robots handling tasks like us.

  12. A systematic literature review of emotion recognition ...

    Abstract. In this study, we conducted a systematic literature review of 107 primary studies conducted between 2017 and 2023 to discern trends in datasets, classifiers, and contributions to human emotion recognition using EEG signals. We identified DEAP (43%), SEED (29%), DREAMER (8%), and SEED-IV (5%) as the most commonly used EEG signal datasets.

  13. Emotion Recognition for Everyday Life Using Physiological Signals From

    Emotion Recognition for Everyday Life Using Physiological Signals From Wearables: A Systematic Literature Review Abstract: Smart wearables, equipped with sensors monitoring physiological parameters, are becoming an integral part of our life. In this work, we investigate the possibility of utilizing such wearables to recognize emotions in the wild.

  14. (PDF) Facial Emotion Recognition: A Brief Review

    Literature Review," Int. J. Signal Process., ... Emotion recognition from facial expression is the subfield of social signal processing which is applied in wide variety of areas, specifically ...

  15. The recognition of emotions conveyed by emoticons and emojis: A

    In this perspective, a systematic review from 2001 to 2021 using the Preferred Reporting of Items for Systematic reviews and Meta-Analyses (PRISMA) method was conducted to determine which emoticons and emojis can help individuals to recognize emotions, and how the recognition of emotions based on emoticons and emojis is studied.

  16. A Brief Review of Facial Emotion Recognition Based on Visual

    Abstract. Facial emotion recognition (FER) is an important topic in the fields of computer vision and artificial intelligence owing to its significant academic and commercial potential. Although FER can be conducted using multiple sensors, this review focuses on studies that exclusively use facial images, because visual expressions are one of ...

  17. EEG-based emotion recognition systems; Comprehensive study

    A comprehensive survey on emotion recognition based on electroencephalograph (EEG) signals. This narrative review is an attempt to provide deep insight into the AI-based techniques, their role in EEG-based emotion recognition, and their potential future possibilities in accurate emotion identification. Expand.

  18. Facial emotion recognition using convolutional neural networks (FERC

    Facial expression for emotion detection has always been an easy task for humans, but achieving the same task with a computer algorithm is quite challenging. With the recent advancement in computer vision and machine learning, it is possible to detect emotions from images. In this paper, we propose a novel technique called facial emotion recognition using convolutional neural networks (FERC ...

  19. Application of Stock Trading-Related Emotion Recognition from EEG

    The principal objective of this research is to address the issue of low classification accuracy within the stock emotion dataset. The article offers a comprehensive explanation of the workflow and methodologies employed within the EEG emotion recognition system, and provides detailed descriptions and analyses of the dataset.

  20. Electronics

    The Graph Convolutional Neural Networks (GCN) method has shown excellent performance in the field of deep learning, and using graphs to represent speech data is a computationally efficient and scalable approach. In order to enhance the adequacy of graph neural networks in extracting speech emotional features, this paper proposes a Temporal-Spatial Learnable Graph Convolutional Neural Network ...

  21. The bright side of sports: a systematic review on well-being, positive

    The objective of this study is to conduct a systematic review regarding the relationship between positive psychological factors, such as psychological well-being and pleasant emotions, and sports performance. This study, carried out through a systematic review using PRISMA guidelines considering the Web of Science, PsycINFO, PubMed and SPORT Discus databases, seeks to highlight the ...

  22. Facial Emotion Recognition: A Survey and Real-World User Experiences in

    Extensive possibilities of applications have made emotion recognition ineluctable and challenging in the field of computer science. ... The survey covers a succinct review of the databases that are considered as data sets for algorithms detecting the emotions by facial expressions. ... by 13% and 9%. The summary of the literature survey is ...

  23. Major ChatGPT-4o update allows audio-video talks with an "emotional" AI

    OpenAI first added conversational voice features to ChatGPT in September 2023 that utilized Whisper, an AI speech recognition model, for input and a custom voice synthesis technology for output ...

  24. Speech emotion recognition systems and their security aspects

    Speech emotion recognition (SER) systems leverage information derived from sound waves produced by humans to identify the concealed emotions in utterances. Since 1996, researchers have placed effort on improving the accuracy of SER systems, their functionalities, and the diversity of emotions that can be identified by the system. Although SER systems have become very popular in a variety of ...

  25. Motor polyradiculoneuropathy as an unusual presentation of

    Brucellosis, a zoonotic disease caused by Brucella species, poses a significant global health concern. Among its diverse clinical manifestations, neurobrucellosis remains an infrequent yet debilitating complication. Here, we present a rare case of neurobrucellosis with unusual presentations in a 45-year-old woman. The patient's clinical course included progressive lower extremity weakness ...