OpenAV-dataset

TODO List

Data collection

Data labelling

SPECOM paper submission

GitHub page creation

arXiv paper submission (after accepting)

Release code and models (after accepting)

Abstract

In recent years, audio-visual speech recognition (AVSR) assistance systems have gained increasing attention from researchers as an important part of human-computer interaction (HCI). The objective of this paper is to further advance the development of assistive technologies in the AVSR field by introducing a multi-modal OpenAV dataset, intended for state-of-the-art neural network model training. The OpenAV is designed to train AVSR models for assistance to persons without hands or with disabilities of their hands or arms in HCI. The dataset could also be useful for ordinary users at hands-free contactless HCI. The dataset currently includes the recordings of 15 speakers with a minimum of 10 recording sessions for each. Along with this we provide a detailed description of the dataset and its collection pipeline. In addition, we evaluate state-of-the-art audio-visual (AV) speech recognition approach and present a baseline recognition results. We also describe the recording methodology, release the recording software to public, as well as open the access to the dataset.

Conclusion

In this paper, we have created a multi-speaker audio-visual dataset OpenAV, designed for state-of-the-art neural network model training and intended for building AVSR assistance systems for people with hands disabilities. The dataset could also be useful for ordinary users at hands-free contactless HCI. The dataset currently includes the recordings of 15 speakers with a minimum of 10 recording sessions for each. Along with this we have provided a detailed description of the dataset and its collection pipeline.

In addition, we have evaluated state-of-the-art audio-visual (AV) speech recognition approach and have presented a baseline recognition results. The fusion of both audio and visual modalities results in an accuracy of 91.54%, achieved through a model-level fusion approach. This in terms of recognition accuracy is comparable to the state-of-the-art results achieved for other AV corpora.

OpenAV: Dataset for Audio-Visual Voice Control of a Computer for Hand Disabled People

TODO List

Abstract

OpenAV Dataset Description

Snapshots of the Speakers

Web-Service for Recording

Audio-Visual Neural Network Model Architecture

Conclusion