Audio-Visual Scene Understanding

News

A survey on the audio-visual learning is released based on this tutorial. [Website], [arXiv]

Overview

Sight and hearing are two of the most important senses for human perception. From cognitive perspective, the visual and auditory information is actually slightly discrepant, but the percept is unified with multisensory integration. What’s more, when there are multiple input senses, human reactions usually perform more exactly or efficiently than single sense. Inspired by this, for computational models, our community has begun to explore marrying computer vision with audition, and targets to address some essential problems of audio-visual learning then further develops them into interesting and worthwhile tasks. In recent years, we were delighted to witness many developments in learning from both visual and auditory data.

This tutorial aims to cover recent advances in audio-visual learning, from the neuroscience study of humans to the computation models of machine. For each research sub-topic, we will give a concrete introduction of the contained problems/tasks, and the current research progress as well as the open problems. We hope the audience, not only the graduate students but also the researchers new in this area, can benefit from this tutorial and learn the principle problems and cutting-edge approaches of audio-visual learning.

Schedule

10:00 - 10:05	Welcome	[Slides]	Chenliang Xu
10:05 - 10:55	Neuroscience in audio-visual perception	[Slides] [Recording]	Ross K. Maddox
10:55 - 11:45	Audio scene understanding	[Slides] [Recording]	Zhiyao Duan
11:45 - 12:35	Audio visual scene-aware dialog based on human perspective scene understanding	[Slides] [Recording]	Chiori Hori
12:35 - 13:25	Audio-visual self-supervised learning	[Slides] [Recording]	Di Hu
13:25 - 13:40	Coffee Break
13:40 - 14:30	Natural interaction with audiovisual messages	[Slides] [Recording]	Amir Zadeh
14:30 - 15:20	Audio-visual sound source localization and separation	[Recording]	Chuang Gan
15:20 - 16:10	Audio-visual cross-modal generation	[Slides] [Recording]	Lele Chen
16:10 - 17:00	Audio-visual video understanding	[Slides] [Recording]	Yapeng Tian
17:00 - 17:30	Panel Discussion	[Recording]

22:00 - 22:05	Welcome	[Slides]	Chenliang Xu
22:05 - 22:55	Neuroscience in audio-visual perception	[Slides] [Recording]	Ross K. Maddox
22:55 - 23:45	Audio scene understanding	[Slides] [Recording]	Zhiyao Duan
23:45 - 00:35 (Day+1)	Audio visual scene-aware dialog based on human perspective scene understanding	[Slides] [Recording]	Chiori Hori
00:35 - 01:25 (Day+1)	Audio-visual self-supervised learning	[Slides] [Recording]	Di Hu
01:25 - 01:40 (Day+1)	Coffee Break
01:40 - 02:30 (Day+1)	Natural interaction with audiovisual messages	[Slides] [Recording]	Amir Zadeh
02:30 - 03:20 (Day+1)	Audio-visual sound source localization and separation	[Recording]	Chuang Gan
03:20 - 04:10 (Day+1)	Audio-visual cross-modal generation	[Slides] [Recording]	Lele Chen
04:10 - 05:00 (Day+1)	Audio-visual video understanding	[Slides] [Recording]	Yapeng Tian
05:00 - 05:30 (Day+1)	Panel Discussion	[Recording]

16:00 - 16:05	Welcome	[Slides]	Chenliang Xu
16:05 - 16:55	Neuroscience in audio-visual perception	[Slides] [Recording]	Ross K. Maddox
16:55 - 17:45	Audio scene understanding	[Slides] [Recording]	Zhiyao Duan
17:45 - 18:35	Audio visual scene-aware dialog based on human perspective scene understanding	[Slides] [Recording]	Chiori Hori
18:35 - 19:25	Audio-visual self-supervised learning	[Slides] [Recording]	Di Hu
19:25 - 19:40	Coffee Break
19:40 - 20:30	Natural interaction with audiovisual messages	[Slides] [Recording]	Amir Zadeh
20:30 - 21:20	Audio-visual sound source localization and separation	[Recording]	Chuang Gan
21:20 - 22:10	Audio-visual cross-modal generation	[Slides] [Recording]	Lele Chen
22:10 - 23:00	Audio-visual video understanding	[Slides] [Recording]	Yapeng Tian
23:00 - 23:30	Panel Discussion	[Recording]