Automating The Facial Action Coding System:
Issues And Image Representations

Bartlett, M.S., Donato, G.L., Movellan, J.R., Ekman, P., & Sejnowski, T.J.
NIPS Post-Conference Workshop on Affective Computing, Breckenridge, CO, December 2.

Abstract

Faces contain much information beyond what is conveyed by basic emotion categories, including signs of cognitive state such as interest, boredom, and confusion, conversational signals that provide emphasis to speech and information about syntax, and blends of two or more emotions (eg. happiness + disgust = smug). In addition, variations within an emotional category (eg. vengeance vs. resentment), and variations in magnitude (annoyance vs. fury) may be signaled by which muscles are contracted in addition to the intensity of the contraction. Instead of classifying expressions into a few basic emotion categories, this system attempts to measure the full range of facial behavior by recognizing facial animation units that comprise facial expressions. The system is based on the Facial Action Coding System (FACS) (Ekman & Friesen, 1978), which was developed by experimental psychologists to objectively measure facial movement. In FACS, human scorers decompose each facial expression into component muscle movements. Advantages of FACS over other sets of animation parameters defined by the engineering community include 1) Comprehensiveness. Each independent motion of the face is described by one of the forty-six action units, and 2) Robust link with ground truth. There is over 20 years of behavioral data on the relationships between FACS movement parameters and underlying emotional or cognitive states.

In the first part of the talk described the Facial Action Coding System, and motivate its application to affective computing. The second part of the talk explored and compared techniques for automatically recognizing facial actions in sequences of images. These methods include unsupervised learning techniques for finding image filters such as principal component analysis, independent component analysis and local feature analysis, and supervised learning techniques such as Fisher's linear discriminants. These data-driven filters are compared to Gabor wavelets, in which the filter kernels are predefined. Best performances were obtained using the Gabor wavelet representation and the independent component representation, both of which achieved 96% accuracy for classifying twelve facial actions. Both the ICA and the Gabor wavelet kernels share the property of spatial locality. In addition, they both share relationships to receptive fields in the primary visual cortex, and are sensitive to high-order dependencies in the image ensemble. The ICA representation employed two orders of magnitude fewer kernels than the Gabor representation, and used 90% less CPU time to compute for new images. The results provide evidence for the importance of using local filter kernels, high spatial frequencies, and statistical independence for classifying facial actions.