8:30 – 9:00

Chairs’ Welcome

9:00 – 9:50

Prof. Susan Goldin-Meadow

From Action to Abstraction:  Gesture as a Mechanism of Change


The spontaneous gestures that people produce when they talk can index cognitive instability and reflect thoughts not yet found in speech. But gesture can go beyond reflecting thought to play a role in changing thought.  I consider whether gesture brings about change because it is itself an action and thus brings action into our mental representations.  I provide evidence for this hypothesis but suggest that it’s not the whole story.  Gesture is a special kind of action––it is representational and thus more abstract than direct action on objects, which may be what allows gesture to play a role in learning.


Susan Goldin-Meadow is the Beardsley Ruml Distinguished Service Professor in the Departments of Psychology and Comparative Human Development at the University of Chicago. Her research focuses on the most basic building blocks of language and thought as they are developed in early childhood. Specifically, she is interested in uncovering linguistic components that are so basic they will arise in a child’s communication system even if the child has limited access to outside linguistic input. Professor Goldin-Meadow’s research has also generated more broadly applicable insights into how the spontaneous gestures that learners produce can reveal their readiness to learn language, math, and scientific concepts. Professor Goldin-Meadow is the founding Editor of Language Learning and Development, and has been president of the International Society for Gesture Studies, president of the Cognitive Development Society, chair of the Cognitive Science Society, and president of the Association for Psychological Society.  She is the recipient of a Guggenheim Fellowship, received the William James award for lifetime achievement in basic research from APS, was elected to the American Academy for Arts and Science in 2005, and the National Academy of Sciences in 2020.  She was awarded the David E. Rumelhart Prize in 2020 for significant contributions to the theoretical foundations of human cognition.

10:00 – 10:50

Special Session 1: Challenges in Modeling and Representation of Gestures in Human Interactions

Gesture Agreement Assessment Using Description Vectors

Naveen Madapana (Purdue University)*; Glebys Gonzalez (Purdue University); Juan Wachs (Purdue University)

Beyond MAGIC: Matching Collaborative Gestures using an Optimization-based Approach

Edgar J Rojas Muñoz (Purdue University)*; Juan Wachs (Purdue University)

Towards a visual Sign Language dataset for  home care services

Dimitrios Kosmopoulos (University of Patras)*; Iason Oikonomidis (FORTH); Konstantinos Konstantinopoulos (University of Patras); Nikolaos Arvanitis (University of Patras ); Klimis Antzakas (University of Patras); Aristidis Bifis (University of Patras); Georgios Lydakis (ICS-FORTH); Anastasios Roussos (Institute of Computer Science, Foundation for Research and Technology Hellas); Antonis A Argyros (CSD-UOC and ICS-FORTH)

Image-based Pose Representation for Action Recognition and Hand Gesture Recognition

Zeyi Lin (Institute of Software, Chinese Academy of Sciences ); Wei Zhang (Institute of Software, Chinese Academy of Sciences ); Xiaoming Deng (Institute of Software, Chinese Academy of Sciences)*; Cuixia Ma (Institute of Software Chinese Academy of Sciences; University of Chinese Academy of Sciences); Hongan Wang (Institute of Software, Chinese Academy of Sciences)

11:00 – 11:50

Oral Session 1

DeeSCo: Deep heterogeneous ensemble with Stochastic Combinatory loss for gaze estimation

Edouard Yvinec (Datakalab); Arnaud Dapogny (Pierre and Marie Curie University (UPMC)); Kevin Bailly (Sorbonne University / Datakalab)*

Can We Read Speech Beyond Lips? Rethinking RoI Selection for Deep Visual Speech Recognition

Yuan-Hang Zhang (University of Chinese Academy of Sciences)*; Shuang Yang (ICT, CAS); Jingyun Xiao (University of Chinese Academy of Sciences); Shiguang Shan (Chinese Academy of Sciences); Xilin Chen (Institute of Computing Technology, Chinese Academy of Sciences)

Generative Video Face Reenactment by AUs and Gaze Regularization

Josep Famades (UB); Meysam Madadi (CVC); Cristina Palmero (ub); Sergio Escalera (CVC and University of Barcelona)*

Visual Saliency Detection guided by Neural Signals

Simone Palazzo (University of Catania); Francesco Rundo (STMicroelectronics ADG—Central R&D); Sebastiano Battiato (Università di Catania); Daniela Giordano (University of Catania); Concetto Spampinato (University of Catania)*

12:00 – 12:50

Poster Session 1

Image Enhancement for Remote Photoplethysmography in a Low-Light Environment

Lin Xi (Beihang University); Weihai Chen (Beihang University); Xingming Wu (Beihang University); Jianhua Wang (Beihang University); Changchen Zhao (Zhejiang University of Technology)*

IF-GAN: Generative Adversarial Network for Identity Preserving Facial Image Inpainting and Frontalization

Kunjian Li (Sichuan University); Qijun Zhao (Sichuan University)*

Face Video Generation from a Single Image and Landmarks

Kritaphat Songsri-in (Imperial College London)*; Stefanos Zafeiriou (Imperial College London)

End to end facial and physiological model for Affective Computing and applications

Joaquim Comas Martínez (Universitat Pompeu Fabra)*; Decky Aspandi (Universitat Pompeu Fabra); Xavier Binefa (UPF)

Clustering based Contrastive Learning for Improving Face Representations

Vivek Sharma (MIT, KIT)*; Makarand Tapaswi (INRIA); Saquib Sarfraz (Karlsruhe Institute of Technology); Rainer Stiefelhagen (Karlsruhe Institute of Technology)

Gated Variational AutoEncoders: Incorporating Weak Supervision to Encourage Disentanglement

Matthew J Vowels (University of Surrey)*; Necati Cihan Camgoz (University of Surrey); Richard Bowden (University of Surrey)

Recognizing Gestures from Videos using A Network  with Two-branch Structure and Additional Motion Cues

Jiaxin Zhou (Saitama University)*; Takashi Komuro (Saitama University)

DeepVI: A Novel Framework for Learning Deep View-Invariant Human Action Representations using a Single RGB Camera

Konstantinos Papadopoulos (University of Luxembourg)*; Enjie Ghorbel (SnT, University of Luxembourg); Oyebade K Oyedotun (University of Luxembourg); Djamila Aouada (SnT); Bjorn Ottersten (SnT)

MoDuL: Deep Modal and Dual Landmark-wise Gated Network for Facial Expression Recognition

Sacha Bernheim (ISIR); Estephe ARNAUD (Sorbonne University); Arnaud Dapogny (Pierre and Marie Curie University (UPMC)); Kevin Bailly (Sorbonne University / Datakalab)*

EV-Action: Electromyography-Vision Multi-Modal Action Dataset

Lichen Wang (Northeastern University)*; Bin Sun (Northeastern University); Joseph P Robinson (Northeastern University); Taotao Jing (Northeastern University); YUN FU (Northeastern University)

Hybrid Video and Image Hashing for Robust Face Retrieval

Ruikui Wang (Institute of Computing Technology, CAS; University of Chinese Academy of Sciences); Shishi Qiao (ICT, CAS); Ruiping Wang (ICT, CAS)*; Shiguang Shan (Chinese Academy of Sciences); Xilin Chen (Institute of Computing Technology, Chinese Academy of Sciences)

Block Mobilenet: Align Large-Pose Faces with <1MB Model Size

Bin Sun (Northeastern University)*; Jun Li (MIT); YUN FU (Northeastern University)

End-to-end Spatial Attention Network with Feature Mimicking for Head Detection

junjie zhang ( National Laboratory for Parallel and Distributed Processing, National University of Defense Technology,Changsha,Hunan)*; yuntao liu (National Laboratory for Parallel and Distributed Processing, National University of Defense Technology,Changsha,Hunan); Rongchun Li (National Laboratory for Parallel and Distributed Processing, National University of Defense Technology,Changsha,Hunan); Yong Dou (National University of Defense Technology)

13:00 – 13:50

Prof. Aleix Martinez

Toward an AI Theory of Mind: Understanding people’s intent and interests


If we want to improve the user’s experience, we need algorithms that can understand their emotions, intents, and interests. Furthermore, to determine how to best help a user, we need systems that can answer hypothetical questions, i.e., what if questions. In short, we need to equip AI systems with a theory of mind. In this talk, I will present a number of projects my research group worked on to address this general goal. Specifically, I will first present the work we have completed on the interpretation of emotion from faces, bodies, and context. I will present a number of medical applications, the interpretation of user’s interactions with a device, and the automatic recognition of sign languages. Following these example applications, I will show how the biomechanics of agents determine whether their actions are performed intentionally or not. And, we will define the first attempt to designing algorithms that can answer hypothetical questions. Throughout, we will derive supervised and unsupervised methods as well as a new approach that allows developers to know whether their deep neural networks are learning to generalize or simply learning to memorize. I will conclude with a discussion of how these efforts are bringing us closer to our goal of designing a theory of mind for AI and how this will improve the user’s experience with the devices of the future.


Aleix M. Martinez is a Sr. Manager of Applied Sciences at Amazon and a Professor in the Department of Electrical and Computer Engineering at The Ohio State University (OSU). Prior to joining Amazon and OSU, he was with the Department of Electrical and Computer Engineering at Purdue University, and a Research Scientist at the Sony Computer Science Lab. Aleix has served as Associate Editor of several journals, including IEEE Transactions on PAMI, been an Area Chair for many top conferences, and, in 2014, was Program Chair of CVPR. Aleix is most known for being the first to define many problems and solutions in face recognition (e.g., recognition under occlusions, expression, imprecise landmark detection), discriminant analysis (e.g., Bayes optimal solutions, subclass-approaches, optimal kernels), structure from motion (e.g., using kernel mappings to better model non-rigid deformations, noise invariance), demonstrating the existence of a much larger set of cross-cultural facial expressions of emotion than previously known (i.e., compound emotion) as well as the transmission of emotion through changes in facial color, and defining a new algebraic topology approach to explaining how deep networks learn to generalize. He has received best paper awards at CVPR and ECCV, a Google Faculty Research award, a Lumely Research Award, and, from 2012-2018, he served as a member of NIH’s Cognition and Perception study section. His research has been featured by the media, including NYTimes, CNN, NPR, Washington Post, Time, The Guardian, Spiegel, El Pais, and Le Monde, among others.

14:00 – 15:20

Oral Session 2

Real-time Facial Expression Recognition “In The Wild” by Disentangling 3D Expression from Identity

Mohammad Rami Koujan (University of Exeter)*; luma A Alharbawee (Exeter); Giorgos Giannakakis (Institute of Computer Science, Foundation for Research and Technology Hellas); Nicolas Pugeault (Exeter); Anastasios Roussos (Institute of Computer Science, Foundation for Research and Technology Hellas)

An EEG-Based Multi-Modal Emotion Database with Both Posed and Authentic Facial Actions for Emotion Analysis

Xiaotian Li (Binghamton University)*; Xiang Zhang (State University of New York at Binghamton); Huiyuan Yang (Binghamton University-SUNY); Wenna Duan (Binghamton University); Weiying Dai (Binghamton University); Lijun Yin (State University of New York at Binghamton)

Dynamic versus Static Facial Expressions in the Presence of Speech

Ali Salman (University of Texas at Dallas ); Carlos Busso (University of Texas at Dallas)*

Learning Guided Attention Masks for Facial Action Unit Recognition

Nagashri Lakshminarayana (University at Buffalo)*; Srirangaraj Setlur (University at Buffalo, SUNY); Venu Govindaraju (University at Buffalo, SUNY)