Automatic emotion recognition methods are critical to human-computer interaction. However, current methods suffer from limited applicability due to their tendency to overfit on single-corpus datasets. This overfitting reduces real-world effectiveness of the methods when faced with new unseen corpora. We propose the first multi-corpus multimodal emotion recognition method with high generalizability evaluated through a leave-one-corpus-out protocol. The method uses three fine-tuned encoders per modality (audio, video, and text) and a decoder employing context-independent gated attention to combine features from all three modalities. The research is conducted on four benchmark corpora: MOSEI, MELD, IEMOCAP, and AFEW. The proposed method achieves the state-of-the-art results on the corpora and establishes the first baselines for multi-corpus studies. We demonstrate that due to the MELD rich emotional expressiveness across three modalities, the models trained on it exhibit the best generalization ability when applied to corpora used. We also reveal that the AFEW annotation better correlates with the annotations of MOSEI, MELD, IEMOCAP and shows the best cross-corpus performance as it is consistent with the widely-accepted concepts of basic emotions.
Corpus | Dur., h | Dur. range, sec (Min / Max / Mean) | Train | Val. | Test | Classes of emotions | Class imbalance (Mean / STD) | Evaluation measures |
---|---|---|---|---|---|---|---|---|
IEMOCAP | 9:24 | 0.6 / 34.1 / 4.6 | 5229 | 581 | 1623 | 6 | 959.7 / 344.4 | Acc / F1 / UAR |
MOSEI | 45:51 | 0.2 / 108.9 / 7.2 | 16216 | 1835 | 4625 | 6 | 3734.3 / 2441.1 | Acc / wAcc / F1 / UAR |
MELD | 12:4 | 0.2 / 304.9 / 3.2 | 9989 | 1109 | 2610 | 7 | 1426.9 / 1427.0 | Acc / F1 / UAR |
AFEW | 0:48 | 0.5 / 6.3 / 2.5 | 773 | 383 | — | 7 | 110.4 / 31.1 | Acc / F1 / UAR |
Modality | Neutral | Happy | Sad | Angry | Excited | Frustrated | UAR | |
---|---|---|---|---|---|---|---|---|
IEMOCAP | ||||||||
A+V+T | 76.6 | 51.7 | 78.0 | 78.8 | 75.3 | 69.6 | 71.6 | |
Setup 1. Modality feature nullification | ||||||||
w/o A | 69.5 | 42.7 | 74.3 | 60.0 β | 72.6 | 68.5 | 64.6 | |
w/o V | 57.0 β | 58.0 β | 69.0 | 72.4 | 51.5 β | 77.4 β | 64.2 | |
w/o T | 82.8 β | 14.7 β | 66.1 β | 77.6 | 72.9 | 3.4 β | 52.9 β | |
Setup 2. Modality exclusion | ||||||||
w/o A | 69.3 | 50.3 | 77.1 | 74.1 | 70.2 | 66.1 | 67.9 | |
w/o V | 66.9 β | 48.3 β | 73.9 β | 75.9 | 65.6 | 66.7 | 66.2 | |
w/o T | 71.9 | 64.3 β | 85.7 β | 56.5 β | 62.2 β | 54.9 β | 65.9 β | |
Modality | Neutral | Happy / Joy | Sad | Angry | Surprise | Fear | Disgust | UAR |
MOSEI | ||||||||
A+V+T | — | 77.4 | 57.3 | 56.9 | 48.4 | 54.0 | 68.9 | 60.5 |
Setup 1. Modality feature nullification | ||||||||
w/o A | — | 72.8 | 52.7 β | 69.7 β | 44.3 β | 52.2 | 68.1 | 60.0 |
w/o V | — | 71.6 β | 53.2 | 55.7 | 50.7 β | 51.7 | 72.5 β | 59.2 |
w/o T | — | 72.2 | 62.9 β | 45.8 β | 45.0 | 49.9 β | 56.3 β | 55.4 β |
Setup 2. Modality exclusion | ||||||||
w/o A | — | 75.1 | 63.1 β | 60.3 β | 55.2 β | 65.7 β | 68.8 | 64.7 β |
w/o V | — | 70.0 β | 58.8 β | 52.8 β | 44.1 β | 57.1 β | 63.2 | 57.7 |
w/o T | — | 73.7 | 60.4 β | 56.4 | 49.8 β | 44.2 β | 60.7 β | 57.5 β |
Modality | Neutral | Happy | Sad | Angry | Surprise | Fear | Disgust | UAR |
MELD | ||||||||
A+V+T | 68.7 | 65.9 | 53.8 | 59.7 | 69.0 | 44.0 | 42.6 | 57.7 |
Setup 1. Modality feature nullification | ||||||||
w/o A | 64.1 | 67.4 β | 52.4 | 48.7 | 71.5 β | 40.0 | 30.9 β | 53.6 |
w/o V | 58.2 | 65.9 | 59.1 β | 38.6 β | 70.5 β | 40.0 | 35.3 | 52.5 |
w/o T | 37.9 β | 2.7 β | 10.6 β | 84.9 β | 1.1 β | 20.0 β | 41.2 | 28.3 β |
Setup 2. Modality exclusion | ||||||||
w/o A | 67.6 | 64.9 | 44.7 | 49.9 | 66.5 | 32.0 | 42.6 | 52.6 |
w/o V | 66.9 | 37.8 β | 43.3 β | 54.2 | 49.8 | 20.0 | 25.0 | 42.4 |
w/o T | 31.8 β | 43.3 | 55.3 β | 46.4 β | 13.9 β | 16.0 β | 23.5 β | 32.9 β |
Modality | Neutral | Happy | Sad | Angry | Surprise | Fear | Disgust | UAR |
AFEW | ||||||||
A+V+T | 84.1 | 88.7 | 74.2 | 70.3 | 48.9 | 60.9 | 53.7 | 68.7 |
Setup 1. Modality feature nullification | ||||||||
w/o A | 76.2 | 90.3 β | 53.2 | 43.8 β | 51.1 β | 54.3 | 46.3 | 59.3 |
w/o V | 49.2 β | 40.3 β | 35.5 β | 54.7 | 26.7 β | 50.0 | 29.3 β | 40.8 β |
w/o T | 60.3 | 74.4 | 75.8 β | 75.0 β | 42.2 | 32.6 β | 34.1 | 56.8 |
Setup 2. Modality exclusion | ||||||||
w/o A | 68.3 | 61.3 | 41.9 β | 50.0 β | 53.3 β | 45.7 | 51.2 | 53.1 |
w/o V | 60.3 β | 37.1 β | 41.9 β | 70.3 | 20.0 β | 43.5 β | 14.6 β | 41.1 β |
w/o T | 69.8 | 79.0 | 74.2 | 70.3 | 62.2 β | 43.5 β | 43.9 | 63.3 |
Training corpus | Test subset | Average | |||
---|---|---|---|---|---|
IEMOCAP | MOSEI | MELD | AFEW | ||
IEMOCAP | 71.3 | 15.5 | 22.9 | 34.1 | 24.7 |
MOSEI | 38.2 | 38.5 | 34.3 | 44.3 | 38.9 |
MELD | 42.3 | 30.4 | 62.0 | 42.1 | 38.3 |
AFEW | 47.3 | 28.4 | 33.3 | 79.3 | 37.6 |
Average | 42.6 (Ξ 28.7) | 24.8 (Ξ 13.7) | 30.2 (Ξ 31.8) | 40.2 (Ξ 39.1) | — |
Encoder | Test subset | |||||||
---|---|---|---|---|---|---|---|---|
W/o IEMOCAP | LOCO IEMOCAP | W/o MOSEI | LOCO MOSEI | W/o MELD | LOCO MELD | W/o AFEW | LOCO AFEW | |
IEMOCAP | — | — | 59.7 | 31.7 | 49.6 | 32.8 | 50.7 | 42.9 β |
MOSEI | 43.0 β | 44.5 | — | — | 43.2 β | 33.0 | 49.4 | 48.6 |
MELD | 50.1 | 44.4 | 59.9 | 31.6 | — | — | 53.1 | 49.0 |
AFEW | 48.2 | 44.1 | 59.8 | 28.5 β | 52.3 | 32.2 | — | — |
Average | 47.1 | 44.3 | 59.8 | 30.6 | 48.4 | 32.7 | 51.1 | 46.8 |
Method | Year | Corpus | Modality | wAcc | wF1 | Acc | F1 |
---|---|---|---|---|---|---|---|
Le et al. | 2023 | MOSEI | A+V+T | 67.8 | — | — | 47.6 |
MAGDRA | 2024 | — | — | 48.8 | 56.3 | ||
TAILOR | 2022 | — | — | 48.8 | 56.9 | ||
CARAT | 2024 | 66.4 | 78.8 | 49.4 | 58.1 | ||
Ours (w/o WL) | 2024 | 61.4 | 80.4 | 49.7 | 54.0 | ||
Ours (WL) | 2024 | 69.3 | 77.7 | 46.2 | 53.4 | ||
Method | Year | Corpus | Modality | Acc | wF1 | mF1 | UAR |
TelME | 2024 | IEMOCAP | A+V+T | — | 70.5 | — | 68.6 |
CORECT | 2023 | 69.9 | 70.0 | — | 70.9 | ||
Yao et al. | 2024 | 71.2 | 71.2 | — | — | ||
MΒ³Net | 2023 | 72.5 | 72.5 | 71.5 | — | ||
Ours (w/o WL) | 2024 | 71.9 | 71.7 | 70.5 | 70.2 | ||
Ours (WL) | 2024 | 72.9 | 72.8 | 72.0 | 71.6 | ||
SDT | 2023 | MELD | A+V+T | 66.6 | 67.5 | 49.8 | 48.0 |
MΒ³Net | 2023 | 68.3 | 67.1 | 51.0 | — | ||
TelME | 2024 | — | 67.4 | 51.4 | 50.0 | ||
Yao et al. | 2024 | 67.0 | 66.2 | — | — | ||
Ours (w/o WL) | 2024 | 68.8 | 67.7 | 50.8 | 48.8 | ||
Ours (WL) | 2024 | 64.8 | 65.7 | 54.5 | 57.7 | ||
Nguyen et al. | 2019 | AFEW | A+V | 62.3 | — | — | — |
Zhou et al. | 2019 | 65.5 | — | — | — | ||
Abdrahimov et al. | 2022 | 67.8 | — | — | 62.0 | ||
Ours (w/o WL) | 2024 | A+V+T | 70.2 | 69.6 | 67.2 | 67.3 | |
Ours (WL) | 2024 | 70.8 | 70.5 | 68.7 | 68.7 |
In this paper, we first propose the multi-corpus MER method designed to exhibit high generalizability to new unseen data. Our method incorporates three red encoders to extract features from audio, video, and text modalities. To capture the context of audio and video signals, our method extracts statistic information from both modalities, enabling context-independent analysis of signals. The gated attention mechanism efficiently fuses information from all three modalities. The method has been evaluated on IEMOCAP, MOSEI, MELD, and AFEW in both single- and multi-corpus setups. The new baselines have been obtained on all four corpora. Despite the results obtained, our method, developed through multi-corpus learning and cross-corpus testing, does not reach the performance levels of models trained on a single corpus. Our findings suggest that this difference may be due to overfitting of single-corpus models to the training corpus. Additionally, the cross-corpus setup faces challenges such as discrepancies in emotional expressiveness between corpora and inconsistencies in their annotations. These two challenges should become the focus of further research.