Audio-Visual Command Recognition Based on Regulated Transformer
and Spatio-Temporal Fusion Strategy
for Driver Assistive Systems

1 St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg, Russia 2 ITMO University, St. Petersburg, Russia

TODO List

Abstract

This article presents a research methodology for audio-visual speech recognition (AVSR) in driver assistive systems. These systems necessitate ongoing interaction with drivers while driving through voice control for safety reasons. The article introduces a novel audio-visual speech command recognition transformer (AVCRFormer) specifically designed for robust AVSR. We propose (1) a multimodal fusion strategy based on spatio-temporal fusion of audio and video feature matrices, (2) a regulated transformer based on iterative model refinement module with multiple encoders, (3) a classifier ensemble strategy based on multiple decoders. The spatio-temporal fusion strategy preserves contextual information of both modalities and achieves their synchronization. An iterative model refinement module can bridge the gap between acoustic and visual data by leveraging their impact on speech recognition accuracy. The proposed multi-prediction strategy demonstrates superior performance compared to traditional single-prediction strategy, showcasing the model's adaptability across diverse audio-visual contexts. The transformer proposed has achieved the highest values of speech command recognition accuracy, reaching 98.87% and 98.81% on the RUSAVIC and LRW corpora, respectively. This research has significant implications for advancing human-computer interaction. The capabilities of AVCRFormer extend beyond AVSR, making it a valuable contribution to the intersection of audio-visual processing and artificial intelligence.

RUSAVIC Corpus

Corpus Parameters

Parameter Value
Number of speakers 20
Amount of voice commands 62
Command repetitions by each speaker 10
Number of annotated videos ≈ 22,350
Duration of audio-visual data ≈ 6 h 54 m
Clean speech percentage 95%
Video data format mp4
Frame rate 60 FPS
Data volume ≈ 840 GB

Phrases

Phrase Phrase
1 Call 32 Reset route
2 Dial a number 33 End route
3 Send message 34 Find a hospital
4 Send e-mail 35 Find a petrol station
5 Answer the call 36 Find a pharmacy
6 End call 37 Find a bank
7 Radio 38 Find a coffee shop
8 Music 39 Find a restaurant
9 Play 40 Find an airport
10 Pause 41 Find a bus station

Phrases in Russian

Phrase Phrase
1 Позвонить 32 Сбросить маршрут
2 Набрать номер 33 Завершить маршрут
3 Отправить сообщение 34 Найти больницу
4 Отправить e-mail 35 Найти заправку
5 Ответить на звонок 36 Найти аптеку
6 Завершить вызов 37 Найти банк
7 Радио 38 Найти кофейню
8 Музыка 39 Найти ресторан
9 Воспроизвести 40 Найти аэропорт
10 Пауза 41 Найти автовокзал

Method Overview

method

Architecture of the Encoders and Decoders

method

AVSRFormer Application for Driver Assistive Systems

method

Examples of Heatmaps for Samples from the RUSAVIC Corpus

method

Evaluation Results

#

Method

Accuracy, %
RUSAVIC, V LRW, V
1 Pan et.al (2022) 85.00
2 Zhang et.al (2020) 85.02
3 Martinez et.al (2020) 85.30
4 Kim et.al (2021) 85.40
5 Feng et.al (2020) 88.40
6 Ma et.al (2021) 88.50
7 Kim et.al (2022) 88.50
8 Ivanko et.al (2022) 88.70
9 Koumparoulis et.al (2022) 89.52
10 Ma et.al (2022) 94.10
11 Axyonov et.al (2023) 83.46
#

Method

Accuracy, %
RUSAVIC, A / V / AV LRW, A / V / AV
12 Petridis et.al (2018) 97.70 / 82.00 / 98.00
13 Miao et.al (2020) — / 82.80 / 98.30
14 Ryumin et.al (2023) 96.07 / 87.16 / 98.76
15 Axyonov et.al (2024) 95.61 / 83.46 / 98.65
16 Our with Single Prediction (2024) 94.89 / 82.31 / 98.23 94.45 / 88.36 / 96.60
17 Our with Multi Prediction (2024) 95.75 / 84.03 / 98.87 97.25 / 89.57 / 98.81