Overview
Sight and hearing are two of the most important senses for human perception. From cognitive perspective, the visual and auditory information is actually slightly discrepant, but the percept is unified with multisensory integration. What’s more, when there are multiple input senses, human reactions usually perform more exactly or efficiently than single sense. Inspired by this, for computational models, our community has begun to explore marrying computer vision with audition, and targets to address some essential problems of audio-visual learning then further develops them into interesting and worthwhile tasks. In recent years, we were delighted to witness many developments in learning from both visual and auditory data.
This tutorial aims to cover recent advances in audio-visual learning, from the neuroscience study of humans to the computation models of machine. For each research sub-topic, we will give a concrete introduction of the contained problems/tasks, and the current research progress as well as the open problems. We hope the audience, not only the graduate students but also the researchers new in this area, can benefit from this tutorial and learn the principle problems and cutting-edge approaches of audio-visual learning.
Schedule
10:00 - 10:05 | Welcome | [Slides] | Chenliang Xu |
10:05 - 10:55 | Neuroscience in audio-visual perception | [Slides] [Recording] | Ross K. Maddox |
10:55 - 11:45 | Audio scene understanding | [Slides] [Recording] | Zhiyao Duan |
11:45 - 12:35 | Audio visual scene-aware dialog based on human perspective scene understanding | [Slides] [Recording] | Chiori Hori |
12:35 - 13:25 | Audio-visual self-supervised learning | [Slides] [Recording] | Di Hu |
13:25 - 13:40 | Coffee Break | ||
13:40 - 14:30 | Natural interaction with audiovisual messages | [Slides] [Recording] | Amir Zadeh |
14:30 - 15:20 | Audio-visual sound source localization and separation | [Recording] | Chuang Gan |
15:20 - 16:10 | Audio-visual cross-modal generation | [Slides] [Recording] | Lele Chen |
16:10 - 17:00 | Audio-visual video understanding | [Slides] [Recording] | Yapeng Tian |
17:00 - 17:30 | Panel Discussion | [Recording] |
22:00 - 22:05 | Welcome | [Slides] | Chenliang Xu |
22:05 - 22:55 | Neuroscience in audio-visual perception | [Slides] [Recording] | Ross K. Maddox |
22:55 - 23:45 | Audio scene understanding | [Slides] [Recording] | Zhiyao Duan |
23:45 - 00:35 (Day+1) | Audio visual scene-aware dialog based on human perspective scene understanding | [Slides] [Recording] | Chiori Hori |
00:35 - 01:25 (Day+1) | Audio-visual self-supervised learning | [Slides] [Recording] | Di Hu |
01:25 - 01:40 (Day+1) | Coffee Break | ||
01:40 - 02:30 (Day+1) | Natural interaction with audiovisual messages | [Slides] [Recording] | Amir Zadeh |
02:30 - 03:20 (Day+1) | Audio-visual sound source localization and separation | [Recording] | Chuang Gan |
03:20 - 04:10 (Day+1) | Audio-visual cross-modal generation | [Slides] [Recording] | Lele Chen |
04:10 - 05:00 (Day+1) | Audio-visual video understanding | [Slides] [Recording] | Yapeng Tian |
05:00 - 05:30 (Day+1) | Panel Discussion | [Recording] |
16:00 - 16:05 | Welcome | [Slides] | Chenliang Xu |
16:05 - 16:55 | Neuroscience in audio-visual perception | [Slides] [Recording] | Ross K. Maddox |
16:55 - 17:45 | Audio scene understanding | [Slides] [Recording] | Zhiyao Duan |
17:45 - 18:35 | Audio visual scene-aware dialog based on human perspective scene understanding | [Slides] [Recording] | Chiori Hori |
18:35 - 19:25 | Audio-visual self-supervised learning | [Slides] [Recording] | Di Hu |
19:25 - 19:40 | Coffee Break | ||
19:40 - 20:30 | Natural interaction with audiovisual messages | [Slides] [Recording] | Amir Zadeh |
20:30 - 21:20 | Audio-visual sound source localization and separation | [Recording] | Chuang Gan |
21:20 - 22:10 | Audio-visual cross-modal generation | [Slides] [Recording] | Lele Chen |
22:10 - 23:00 | Audio-visual video understanding | [Slides] [Recording] | Yapeng Tian |
23:00 - 23:30 | Panel Discussion | [Recording] |