Reference Paper
Towards Visual and Vocal Mimicry Recognition in Human-Human
Interactions
Xiaofan Sun, Khiet P. Truong, Maja Pantic, Anton Nijholt
(Digital Object Identifier 10.1109/ICSMC.2011.6083693)
Overview of the Paper
Mimicry occurs in face-to-face conversation both when we
agree and when we do not agree. It is found that there is more mimicry when
people agree than when they don’t. People try to display shared opinion by
displaying similar behavior of his/her counterpart. In this paper, the authors present
a method to detect and measure behavioral mimicry in face-to-face conversation
by analyzing human actions and human vocal behaviors.
They developed an audiovisual corpus specifically for
mimicry research. Their data is drawn from a study session of face-to-face
discussion and conversation, of 43 different subjects. The experiment divided
into two sessions, in the first session, the participants were asked to present
their own stance on a specific topic, and this is the presentation episode.
Then they need to discuss with their partner about the topic, and this is the discussion
episode. In the second session, they participants were asked to talk about non-task
oriented topic among themselves, and this is the conversation episode. For
visual recording 7 cameras for each person and 1 camera for both persons were
used. Their voices were also recorded. The corpus is later annotated by human behavior
science specialist annotators.
For visual mimicry detection, they first extract motion
features to identify visual mimicry. They use accumulated motion images (AMI)
as feature to represent the motions. In AMI, higher intensity value represents
higher degree of complex motions. Then they use hand gesture mimicry behavior
of the conversation and present the cross-correlation of movements between two
persons. From the similar cross-relation of body movements, they assume that behavioral
mimicry probably occurs in that time periods.
For detection of non-verbal vocal mimicry, the authors use
the speech rate divided by the length of the signal as the feature. They calculate
correlations between the speech patterns in different episodes and compare the correlation
to each other. ‘Participant in presentation’ is used as the participant’s base
line behavior. Correlations will decrease if the participant adapts the speech behavior
of the partner, and will increase if the partner’s speech behavior becomes more
similar to the participant’s.
Evaluation
To validate their result, the correlation curve is presented
in the following figure. The solid line represents
the correlation between participants performance in presentation and discussion
episode (curve A). The dashed line represents the correlation between
participants’ and partners’ performance in discussion episode (curve B).
Three phases are observed in the result. Up to window number
8, both correlations increase. Between window number 8 and 17, curve A decreases
and B increases. And, after that, curve A increases and B decreases. In phase
1, correlation A is increasing g because the participant begin with the similar
speech style in both presentation and discussion episode. And, curve B is also
increasing because the confederate in the discussion phase starts the
discussion with more similar speech behavior of the participants. In phase 2, correlation
A decreases because the participant starts mimicking the confederate. Correlation
B still increases because the participant and confederate are mimicking each other.
In phase 3, participant and confederate both know that end of discussion is
approaching, the they start express their own opinion in their own style, so correlation
A increases and B decreases.
Validity of the Paper
In this paper, authors show that the behavioral information
can be extracted from audiovisual data and can be used to measure mimicry. The correlation
of visual behavior presented in this paper is not reliable enough to detect
visual mimicry. From the result it only can be said that the participants show
a similar behavior to a certain degree. They also present some future works for
approaching this research problem.
Improvement Scopes
I think the main challenge faced by the researchers in this
field is the annoted data. They authors generate a face-to-face meeting corpus
which is very helpful for the researchers in this field, although the data are
collected in highly constrained environment. Some improvement scope exists in making
the visual mimicry prediction more reliable. In my opinion, the combination
method of results achieved from multiple modalities will be the main
improvement scope of this research area.
Further Reading
One of the
interesting articles, which are cited by this paper, is “Histograms of oriented gradients for human detection”,
by N. Dalal and B. Triggs (Digital Object Identifier 10.1109/CVPR.2005.177)(PDF Download), which is cited by around 4000 articles. In
the cited article, the authors use Histograms of Oriented Gradients
(HOG) descriptor for human detection. Their result shows that HOG outperforms
most of other existing feature sets in this purpose.
References
[1] X. Sun, K. Truong, M. Pantic, and
A. Nijholt, “Towards visual and vocal mimicry recognition in human-human
interactions,” in Systems, Man, and Cybernetics (SMC), 2011 IEEE International
Conference on. IEEE, 2011, pp. 367–373.
[2] N. Dalal and B. Triggs,
“Histograms of oriented gradients for human detection,” in Computer Vision and
Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol.
1. IEEE, 2005, pp. 886–893.
No comments:
Post a Comment