Audio-Visual Occlusion-Robust Gender Recognition and Age Estimation Approach Based on Multi-Task Cross-Modal Attention

1 St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg, Russia
2 ULM University, Ulm, Germany
3 ITMO University, St. Petersburg, Russia

Abstract

Gender recognition and age estimation are essential tasks within soft biometric systems, where identifying these characteristics supports a wide range of applications. In real-world scenarios, challenges such as partial facial occlusion complicate these tasks by obscuring crucial voice and facial characteristics. These challenges highlight the importance of development of robust and efficient approaches for gender recognition and age estimation. In this study, we develop a novel audio-visual Occlusion-Robust Gender Recognition and Age Estimation (ORAGEN) approach. The proposed approach is based on intermediate features of unimodal transformer-based models and two Multi-Task Cross-Modal Attention (MTCMA) blocks, which predict gender, age, and protective mask type using voice and facial characteristics. We conduct detailed cross-corpus experiments on the TIMIT, aGender, CommonVoice, LAGENDA, IMDB-Clean, AFEW, VoxCeleb2, and BRAVE-MASKS corpora. The proposed unimodal models outperform State-of-the-Art approaches for gender recognition and age estimation. We investigate the impact of various protective mask types on the performance of audio-visual gender recognition and age estimation. The results show that the current large-scale data are still insufficient for a robust gender recognition and age estimation in partial facial occlusion conditions. On the Test subset of the VoxCeleb2 corpus, the proposed approach showed Unweighted Average Recall (UAR) of 99.51%, Mean Absolute Error (MAE) of 5.42, and UAR of 100% for gender recognition, age estimation, and protective mask type recognition, respectively, while on the Test subset of the BRAVE-MASKS corpus, it showed UAR=96.63%, MAE=7.52, and UAR=95.87%, for the same tasks. These results indicate that using data of people wearing protective masks, as well as including the protective mask type recognition task, yields performance gains on all tasks considered. ORAGEN can be integrated into the OCEAN-AI framework for optimizing HR processes, as well as into expert systems with practical applications in various domains including forensics, healthcare, and industrial safety.

Pipeline of the Proposed Approach

Proposed approach

Transformer layer and transformer block have different architecture. FCL refers to Fully-Connected Layer, SDPSA to Scaled Dot-Product Self-Attention, MTCMA to Multi-Task Cross-Modal Attention, VAD to Voice Activity Detection, NM to “No mask” class, TM to “Tissue mask”, MM to “Medical Mask”, PM to “Protective mask (FFP2/FFP3)”, PFS to “Protective face shield”, and R to “Respirator”

Architecture of the Proposed Model for Audio-Visual Multi-Task Fusion

Audio-visual model

MTCMA refers to Multi-Task Cross-Modal Attention. MLP to Multi Layer Perceptron. GELU to Gaussian Error Linear Unit.

Research Data

Corpus Modalities Language Speaker Distribution Audio/Images Distribution File Format Annotation
TIMIT A English 192 females, 438 males, Aged 20-75 y.o., Mean is 29.9 y.o., Std. dev. is 7.9 y.o. Total dur.: 6 hours, Num. of utt.: 6300, Avg. utt. dur.: 3.1 s. PCM WAV, 16 kHz, Mono/16-bits Gender, Age
aGender* A German 335 females, 329 males, 106 children, Aged 7-80 y.o., Mean is 39.3 y.o., Std. dev. is 21.4 y.o. Total dur.: 38.1 hours, Num. of utt.: 53074, Avg. utt. dur.: 2.5 s. PCM WAV, 8 kHz, Mono/8-bits Gender, Age
CommonVoice A English, German, Russian 4352 females, 16828 males, Aged 13-99 y.o. Total dur.: 5203 hours, Num. of utt.: 3583K, Avg. utt. dur.: 5.2 s. MPEG, 48 kHz Gender, Age group
LAGENDA V N/A 42735 females, 41457 males, Aged 0-95 y.o., Mean is 36.8 y.o., Std. dev. is 21.6 y.o. 67K images JPG, Various resolutions Gender, Age
IMDB-Clean V N/A 127K females, 158K males, Aged 1-95 y.o., Mean is 37.1 y.o., Std. dev. is 12.8 y.o. 296K images JPG, Various resolutions Gender, Age
AFEW* A, V English 451 females, 705 males, Aged 5-76 y.o., Mean is 35.2 y.o., Std. dev. is 13.4 y.o. Total dur.: 0.8 hours, Num. of utt.: 1156, Avg. utt. dur.: 2.5 s. MPEG, 16 kHz, MPEG-4, 720x568, 25 FPS Gender, Age
VoxCeleb2 A, V English 5333 females, 8888 males, Aged 10-95 y.o., Mean is 40.8 y.o., Std. dev. is 14.1 y.o. Total dur.: 238 hours, Num. of utt.: 109K, Avg. utt. dur.: 7.8 s. AAC, 16 kHz, MPEG-4, 224x224, 25 FPS Gender, Age
BRAVE-MASKS A, V Russian 15 females, 15 males, Aged 19-86 y.o., Mean is 40.8 y.o., Std. dev. is 19.0 y.o. Total dur.: 21 hours, Num. of utt.: 14940, Avg. utt. dur.: 5.0 s. PCM WAV, 48 kHz, Mono/16-bits, MPEG-4, 3840x2160, 30/60 FPS Protective mask type, Gender, Age

Metadata statistics of the corpora used. Std. dev. refers to standard deviation, avg. to average, utt. to utterance, dur. to duration, N/A to not available. * means that distribution of the Test subset is not available.

Exeperimental Results

Model Train subset TIMIT aGender CommonVoice VoxCeleb2 BRAVE-MASKS
Dev. Test
Gender recognition (UAR, %)
Burkhardt et al. (2023) All corpora 98.60 86.15 92.20 89.50 84.16 85.15
W2V2-based All corpora 98.60 87.17 92.59 90.00 85.14 86.22
HuBERT-based All corpora 97.97 86.99 92.22 89.83 84.95 86.21
Age estimation (MAE)
Burkhardt et al. (2023) All corpora 7.10 10.80 11.15 10.31 13.96 13.09
W2V2-based All corpora 6.90 10.60 10.47 9.91 11.65 11.89
HuBERT-based All corpora 7.00 10.92 11.32 10.34 13.01 12.23

Audio-based experimental results. "All corpora" includes only the TIMIT, aGender, and CommonVoice corpora. Dev. refers to development.

Model Train subset LAGENDA AFEW IMDB-Clean VoxCeleb2 BRAVE-MASKS
Dev. Test Dev. Test
Gender recognition (UAR, %)
Kuprashevich et al. (2023) IMDB-Clean 91.11 94.60 99.70 99.40 98.35 87.84 93.39
SDPSA-based All corpora 92.89 95.16 98.49 98.37 98.37 88.12 94.44
GSA-based All corpora 92.72 95.41 98.70 98.46 98.25 89.83 90.16
Age estimation (MAE)
Kuprashevich et al. (2023) IMDB-Clean 5.40 6.09 3.48 4.28 7.32 9.60 9.22
SDPSA-based All corpora 5.18 5.62 5.23 5.47 5.97 8.86 8.71
GSA-based All corpora 5.05 5.59 5.03 5.43 6.59 9.65 9.42

Video-based experimental results. "All corpora" includes only LAGENDA, AFEW, and IMDB-Clean corpora. Dev. refers to development.

Method VoxCeleb2 BRAVE-MASKS
Gender recognition (UAR, %)
Audio-based 90.00 86.22
Video-based 98.37 94.44
Early fusion 98.90 94.80
Intermediate fusion 99.11 94.95
Late fusion 99.02 97.21
Age estimation (MAE)
Audio-based 9.91 11.89
Video-based 5.97 8.71
Early fusion 5.80 9.00
Intermediate fusion 5.68 8.73
Late fusion 6.51 7.23

Experimental results on audio-visual multi-task gender recognition (classification task, 2 classes) and age estimation (regression task) on the Test subsets of the VoxCeleb2 and BRAVE-MASKS corpora.

Multi-Task Cross-Modal Attention Visualization

Cross-modal attention visualization

Protective Mask Type Influence on Gender Recognition and Age Estimation Performance

Data All masks No mask Tissue mask Medical mask Protective mask Respirator (FFP2/FFP3) Protective face shield
Gender recognition (UAR, %)
Audio-visual 94.95 98.33 96.38 97.69 98.02 90.56 89.00
Audio-only 93.98 98.33 96.77 94.75 97.80 88.61 89.00
Video-only 93.06 97.67 96.11 97.69 94.74 89.91 89.00
Age estimation (MAE)
Audio-visual 8.73 4.83 7.74 11.12 10.44 15.27 5.63
Audio-only 11.91 10.93 12.09 15.39 12.04 17.44 8.85
Video-only 8.98 5.92 8.78 12.29 11.06 19.87 6.62

Influence of particular protective mask type on gender recognition (classification task, 2 classes) and age estimation (regression task) performance on the Test subset of the BRAVE-MASKS corpus.

Possible Application of the Proposed Approach for Optimizing HR Processes

Possible application