OpenAV: Dataset for Audio-Visual Voice Control of a Computer for Hand Disabled People

Authors are hidden for peer review1
1 Affiliation is hidden for peer review
SPECOM 2024 (submitted)

TODO List

Abstract

In recent years, audio-visual speech recognition (AVSR) assistance systems have gained increasing attention from researchers as an important part of human-computer interaction (HCI). The objective of this paper is to further advance the development of assistive technologies in the AVSR field by introducing a multi-modal OpenAV dataset, intended for state-of-the-art neural network model training. The OpenAV is designed to train AVSR models for assistance to persons without hands or with disabilities of their hands or arms in HCI. The dataset could also be useful for ordinary users at hands-free contactless HCI. The dataset currently includes the recordings of 15 speakers with a minimum of 10 recording sessions for each. Along with this we provide a detailed description of the dataset and its collection pipeline. In addition, we evaluate state-of-the-art audio-visual (AV) speech recognition approach and present a baseline recognition results. We also describe the recording methodology, release the recording software to public, as well as open the access to the dataset.

OpenAV Dataset Description

method

Snapshots of the Speakers

method

Web-Service for Recording

method

Audio-Visual Neural Network Model Architecture

method

Conclusion

In this paper, we have created a multi-speaker audio-visual dataset OpenAV, designed for state-of-the-art neural network model training and intended for building AVSR assistance systems for people with hands disabilities. The dataset could also be useful for ordinary users at hands-free contactless HCI. The dataset currently includes the recordings of 15 speakers with a minimum of 10 recording sessions for each. Along with this we have provided a detailed description of the dataset and its collection pipeline.

In addition, we have evaluated state-of-the-art audio-visual (AV) speech recognition approach and have presented a baseline recognition results. The fusion of both audio and visual modalities results in an accuracy of 91.54%, achieved through a model-level fusion approach. This in terms of recognition accuracy is comparable to the state-of-the-art results achieved for other AV corpora.