Monday, September 17, 2012

Paper Blogs 02: Towards Visual and Vocal Mimicry Recognition in Human-Human Interactions



Reference Paper
Towards Visual and Vocal Mimicry Recognition in Human-Human Interactions
Xiaofan Sun, Khiet P. Truong, Maja Pantic, Anton Nijholt
(Digital Object Identifier 10.1109/ICSMC.2011.6083693)

Overview of the Paper

Mimicry occurs in face-to-face conversation both when we agree and when we do not agree. It is found that there is more mimicry when people agree than when they don’t. People try to display shared opinion by displaying similar behavior of his/her counterpart. In this paper, the authors present a method to detect and measure behavioral mimicry in face-to-face conversation by analyzing human actions and human vocal behaviors.

They developed an audiovisual corpus specifically for mimicry research. Their data is drawn from a study session of face-to-face discussion and conversation, of 43 different subjects. The experiment divided into two sessions, in the first session, the participants were asked to present their own stance on a specific topic, and this is the presentation episode. Then they need to discuss with their partner about the topic, and this is the discussion episode. In the second session, they participants were asked to talk about non-task oriented topic among themselves, and this is the conversation episode. For visual recording 7 cameras for each person and 1 camera for both persons were used. Their voices were also recorded. The corpus is later annotated by human behavior science specialist annotators.

For visual mimicry detection, they first extract motion features to identify visual mimicry. They use accumulated motion images (AMI) as feature to represent the motions. In AMI, higher intensity value represents higher degree of complex motions. Then they use hand gesture mimicry behavior of the conversation and present the cross-correlation of movements between two persons. From the similar cross-relation of body movements, they assume that behavioral mimicry probably occurs in that time periods.

For detection of non-verbal vocal mimicry, the authors use the speech rate divided by the length of the signal as the feature. They calculate correlations between the speech patterns in different episodes and compare the correlation to each other. ‘Participant in presentation’ is used as the participant’s base line behavior. Correlations will decrease if the participant adapts the speech behavior of the partner, and will increase if the partner’s speech behavior becomes more similar to the participant’s.   

Evaluation

To validate their result, the correlation curve is presented in the following figure.  The solid line represents the correlation between participants performance in presentation and discussion episode (curve A). The dashed line represents the correlation between participants’ and partners’ performance in discussion episode (curve B).



Three phases are observed in the result. Up to window number 8, both correlations increase. Between window number 8 and 17, curve A decreases and B increases. And, after that, curve A increases and B decreases. In phase 1, correlation A is increasing g because the participant begin with the similar speech style in both presentation and discussion episode. And, curve B is also increasing because the confederate in the discussion phase starts the discussion with more similar speech behavior of the participants. In phase 2, correlation A decreases because the participant starts mimicking the confederate. Correlation B still increases because the participant and confederate are mimicking each other. In phase 3, participant and confederate both know that end of discussion is approaching, the they start express their own opinion in their own style, so correlation A increases and B decreases.   
   

Validity of the Paper

In this paper, authors show that the behavioral information can be extracted from audiovisual data and can be used to measure mimicry. The correlation of visual behavior presented in this paper is not reliable enough to detect visual mimicry. From the result it only can be said that the participants show a similar behavior to a certain degree. They also present some future works for approaching this research problem.

 

Improvement Scopes

I think the main challenge faced by the researchers in this field is the annoted data. They authors generate a face-to-face meeting corpus which is very helpful for the researchers in this field, although the data are collected in highly constrained environment. Some improvement scope exists in making the visual mimicry prediction more reliable. In my opinion, the combination method of results achieved from multiple modalities will be the main improvement scope of this research area.  

Further Reading

One of the interesting articles, which are cited by this paper, is “Histograms of oriented gradients for human detection”, by N. Dalal and B. Triggs (Digital Object Identifier 10.1109/CVPR.2005.177)(PDF Download), which is cited by around 4000 articles. In the cited article, the authors use Histograms of Oriented Gradients (HOG) descriptor for human detection. Their result shows that HOG outperforms most of other existing feature sets in this purpose.
 

References

[1] X. Sun, K. Truong, M. Pantic, and A. Nijholt, “Towards visual and vocal mimicry recognition in human-human interactions,” in Systems, Man, and Cybernetics (SMC), 2011 IEEE International Conference on. IEEE, 2011, pp. 367–373.

[2] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 886–893.

No comments:

Post a Comment