Multi-Lingual Approach for Multi-Modal Emotion and Sentiment Recognition Based on Triple Fusion

1 St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg, Russia
2 ITMO University, St. Petersburg, Russia

Abstract

Affective states recognition is a challenging task that requires a large amount of input data, such as audio, video, and text. Current multi-modal approaches are often single-task and corpus-specific, resulting in overfitting, poor generalization across corpora, and reduced real-world performance. In this work, we address these limitations by: (1) multi-lingual training on corpora that include Russian (RAMAS) and English (MELD, CMU-MOSEI) speech; (2) multi-task learning for joint emotion and sentiment recognition; and (3) a novel Triple Fusion strategy that employs cross-modal integration at both hierarchical unimodal and fused multi-modal feature levels, enhancing intra- and inter-modal relationships of different affective states and modalities. Additionally, to optimize performance of the approach proposed, we compare temporal encoders (Transformer-based, Mamba, xLSTM) and fusion strategies (double and triple fusion strategies with and without a label encoder) to comprehensively understand their capabilities and limitations. On the Test subset of the CMU-MOSEI corpus, the proposed approach showed mean weighted F1-score (mWF) of 88.6%, and weighted F1-score (WF) of 84.8% for emotion and sentiment recognition, respectively. On the Test subset of the MELD corpus, the proposed approach showed WF of 49.6% and WF of 60.0%, respectively. On the Test subset of the RAMAS corpus, the proposed approach showed WF of 71.8% and WF of 90.0%, respectively. We compare the performance of the approach proposed with that of the SOTA ones.

Pipeline of the Proposed Approach

Proposed approach

General pipeline of the proposed approach

Uni-modal Affective States Recognition

Audio model

Audio-based model for affective states recognition

Video model

Video-based model for affective states recognition

Text model

Text-based model for affective states recognition

Multi-Modal Affective States Recognition

Triple Fusion Strategy

Triple Fusion Strategy

Double Fusion Strategy

Double Fusion Strategy

Label Encoder

Backbone of double and triple fusion strategies with label encoders

Research Data

Corpus Language, speech type Speaker distribution Duration Audio Format Video Format
RAMAS Russian, acted 5 women, 5 men,
Aged 18-28 years
Total dur.: 5 hours,
Num. of utt.: 1792,
Avg. utt. dur.: 9.67 s.,
Avg. num. of words in utt.: 14
PCM WAV, 44.1 kHz,
Mono/16-bits
MPEG-4, 1920x1080,
25 FPS
CMU-MOSEI English, in-the-wild 430 women, 570 men Total dur.: 46 hours,
Num. of utt.: 22788,
Avg. utt. dur.: 7.24 s.,
Avg. num. of words in utt.: 8
PCM WAV, 16 kHz,
Mono/16-bits
MPEG-4, 1280x720,
30 FPS
MELD English, acted 6 main speakers,
3 women, 3 men
Total dur.: 12 hours,
Num. of utt.: 13706,
Avg. utt. dur.: 3.18 s.,
Avg. num. of words in utt.: 19
AAC, 48 kHz MPEG-4, 1280x720,
24 FPS

Metadata statistics of the corpora used. Avg. refers to average, utt. to utterance, dur. to duration, N/A to not available.

Experimental Results

Temporal Encoder Emotion Sentiment
A7 UAR7 WF7 MF7 A3 UAR3 WF3 MF3 Average/δ
Single-Corpus trained model on RAMAS
EW2V 40.7/- 28.7/- 35.6/- 28.4/- 73.3/- 49.6/- 75.2/- 50.9/- 47.8/-
Transformer 52.3/11.6 41.7/13.1 51.3/15.7 43.0/14.6 75.2/1.9 50.9/1.3 75.8/0.6 51.3/0.4 55.2/7.4
Mamba 48.4/7.8 38.6/9.9 46.9/11.3 39.8/11.3 73.6/0.4 49.9/0.3 74.1/-1.0 50.2/-0.7 52.7/4.9
xLSTM 49.6/8.9 39.4/10.7 49.7/14.1 42.1/13.6 74.8/1.6 50.7/1.0 76.5/1.4 51.8/0.9 54.3/6.5
Single-Corpus trained model on MELD
EW2V 47.6/- 15.3/- 33.8/- 11.6/- 49.1/- 38.6/- 42.9/- 34.3/- 34.1/-
Transformer 46.1/-1.5 18.5/3.3 38.1/4.3 16.4/4.9 50.2/1.1 42.8/4.3 47.9/5.0 42.9/8.6 37.9/3.7
Mamba 46.8/-0.8 18.8/3.5 39.2/5.4 18.1/6.5 50.2/1.1 42.0/3.5 47.3/4.5 41.7/7.4 38.0/3.9
xLSTM 48.1/0.5 17.9/2.6 38.1/4.3 16.1/4.5 51.5/2.5 41.6/3.0 46.9/4.0 40.1/5.8 37.5/3.4
Single-Corpus trained model on CMU-MOSEI
mA6 mWA6 mWF6 mMF6 A3 UAR3 WF3 MF3
EW2V 80.2/- 55.3/- 76.4/- 54.4/- 57.9/- 51.0/- 56.1/- 50.7/- 60.3/-
Transformer 80.6/0.4 56.7/1.4 77.4/1.0 56.3/1.9 58.7/0.8 52.1/1.1 57.0/0.9 51.7/1.0 61.3/1.1
Mamba 80.6/0.4 56.2/0.9 77.2/0.7 55.6/1.2 59.8/1.9 52.2/1.1 57.4/1.3 51.7/1.1 61.3/1.1
xLSTM 80.8/0.5 55.7/0.3 76.9/0.4 54.7/0.3 61.0/3.1 53.2/2.1 58.0/1.9 51.8/1.2 61.5/1.2
Multi-Corpus trained model using Transformer encoder
Corpus A7 UAR7 WF7 MF7 A3 UAR3 WF3 MF3
RAMAS 42.6/-9.7 35.7/-6.0 37.9/-13.3 31.8/-11.2 70.9/-4.3 64.2/13.3 72.2/-3.5 54.5/3.2 51.2/-3.9
MELD 41.2/-4.9 19.1/0.5 36.7/-1.3 18.0/1.6 46.6/-3.6 40.8/-2.0 45.3/-2.6 41.0/-1.8 36.1/-1.8
mA6 mWA6 mWF6 mMF6 A3 UAR3 WF3 MF3
CMU-MOSEI 78.0/-2.6 57.0/0.3 75.9/-1.5 56.1/-0.2 57.3/-1.4 51.8/-0.3 56.6/-0.4 51.9/0.2 60.6/-0.7

Experimental results obtained for the audio-based affective states recognition. Performance measure superscript shows the number of classes.

Temporal Encoder Emotion Sentiment
A7 UAR7 WF7 MF7 A3 UAR3 WF3 MF3 Average/δ
Single-Corpus trained model on RAMAS
BiLSTM 75.6/-- 70.0/-- 75.8/-- 67.5/-- 94.2/-- 79.9/-- 94.4/-- 76.7/-- 79.3/--
Transformer 74.4/-1.2 65.0/-5.0 74.9/-0.9 64.6/-2.9 94.2/-- 88.0/8.1 94.7/0.3 79.0/2.3 79.4/0.1
Mamba 71.3/4.3 64.1/5.9 71.6/4.2 62.0/-5.5 90.7/-3.5 77.6/-2.3 92.1/-2.3 69.5/-7.2 74.9/-4.4
xLSTM 69.8/-5.8 63.0/-7.0 69.8/-6.0 60.0/-7.5 89.1/-5.1 76.5/-3.4 90.2/-4.2 69.1/-7.6 73.4/-5.9
Single-Corpus trained model on MELD
BiLSTM 28.8/-- 23.9/-- 29.4/-- 20.2/-- 47.9/-- 47.7/-- 48.2/-- 46.7/-- 36.6/--
Transformer 28.1/-0.7 25.9/2.0 30.6/1.2 22.2/2.0 48.1/0.2 48.6/0.9 48.6/0.4 47.2/0.5 37.4/0.8
Mamba 29.8/1.0 23.8/-0.1 32.2/2.8 20.1/-0.1 50.2/2.3 48.1/0.4 50.2/2.0 48.0/1.3 37.8/1.2
xLSTM 28.2/-0.6 25.4/1.5 29.1/-0.3 21.0/0.8 48.2/0.3 48.8/1.1 48.7/0.5 47.2/0.5 37.1/0.5
Single-Corpus trained model on CMU-MOSEI
mA6 mWA6 mWF6 mMF6 A3 UAR3 WF3 MF3
BiLSTM 78.8/-- 55.5/-- 75.0/-- 54.4/-- 48.9/-- 42.6/-- 47.7/-- 42.5/-- 55.9/--
Transformer 80.5/1.7 54.1/-1.4 75.9/0.9 52.5/-1.9 49.9/1.0 45.2/2.6 50.0/2.3 45.3/2.8 56.7/0.8
Mamba 80.1/1.3 54.2/-1.3 75.9/0.9 52.7/-1.7 48.3/-0.6 44.0/1.4 48.0/0.3 43.3/0.8 55.8/-0.1
xLSTM 78.7/-0.1 55.4/-0.1 76.2/1.2 55.3/0.9 45.6/-3.3 42.3/-0.3 45.9/-1.8 42.0/-0.5 55.2/-0.7
Multi-Corpus trained model using Transformer encoder
Corpus A7 UAR7 WF7 MF7 A3 UAR3 WF3 MF3
RAMAS 71.7/-2.7 69.0/4.0 71.8/-3.1 64.9/0.3 90.7/-3.5 77.6/-10.4 90.8/-3.9 75.8/-3.2 76.5/-2.9
MELD 35.8/7.7 19.4/-6.5 35.1/4.5 19.4/-2.8 43.4/-4.7 42.0/-6.6 43.9/-4.7 41.4/-5.8 35.1/-2.3
mA6 mWA6 mWF6 mMF6 A3 UAR3 WF3 MF3
CMU-MOSEI 68.7/-11.8 56.0/1.9 71.7/-4.2 55.1/2.6 46.1/-3.8 40.5/-4.7 45.4/-4.6 40.4/-4.9 53.0/-3.7

Experimental results obtained for the video-based affective states recognition. Performance measure superscript shows the number of classes.

Temporal Encoder Emotion Sentiment
A7 UAR7 WF7 MF7 A3 UAR3 WF3 MF3 Average/δ
Single-Corpus trained model on RAMAS
LMHA 14.0/- 21.4/- 4.1/- 12.9/- 50.0/- 50.0/- 33.7/- 44.3/- 28.8/-
Transformer 24.4/10.4 30.6/9.2 22.0/17.9 22.8/9.9 53.5/3.5 52.4/2.4 56.9/23.2 41.4/-2.9 38.0/9.2
Mamba 28.7/14.7 30.6/9.2 29.0/24.9 27.7/14.8 55.4/5.4 53.7/3.7 58.2/24.5 42.7/-1.6 40.8/12.0
xLSTM 27.1/13.1 30.9/9.5 24.6/20.5 25.4/12.5 51.9/1.9 43.2/-6.8 56.2/22.5 39.4/-4.9 37.3/8.5
Single-Corpus trained model on MELD
LMHA 8.0/- 14.3/- 1.2/- 2.1/- 42.6/- 37.2/- 38.1/- 31.6/- 21.9/-
Transformer 52.4/44.4 36.3/22.0 53.7/52.5 35.1/33.0 63.3/20.7 59.0/21.8 61.4/23.3 57.8/26.2 52.4/30.5
Mamba 46.8/38.8 35.1/20.8 50.2/49.0 33.1/31.0 62.8/20.2 57.9/20.7 61.9/23.8 58.2/26.6 50.8/28.9
xLSTM 42.6/34.6 38.4/24.1 46.7/45.5 33.3/31.2 59.2/16.6 58.1/20.9 59.9/21.8 57.1/25.5 49.4/27.5
Single-Corpus trained model on CMU-MOSEI
mA6 mWA6 mWF6 mMF6 A3 UAR3 WF3 MF3
LMHA 80.0/- 50.2/- 71.9/- 46.1/- 49.0/- 33.3/- 32.3/- 21.9/- 48.1/-
Transformer 80.2/0.2 56.6/6.4 77.1/5.2 55.8/9.7 60.9/11.9 58.1/24.8 61.2/28.9 57.6/35.7 63.4/15.4
Mamba 79.7/-0.3 53.8/3.6 75.3/3.4 51.8/5.7 57.3/8.3 55.7/22.4 57.5/25.2 54.4/32.5 60.7/12.6
xLSTM 80.1/0.1 55.8/5.6 76.4/4.5 55.5/9.4 56.8/7.8 55.4/22.1 57.2/24.9 54.3/32.4 61.4/13.4
Multi-Corpus trained model using Transformer encoder
Corpus A7 UAR7 WF7 MF7 A3 UAR3 WF3 MF3
RAMAS 24.4/0.0 26.1/-4.5 23.0/1.0 20.1/-2.7 54.7/1.2 45.1/-7.3 55.1/-1.8 39.6/-1.8 36.0/-2.0
MELD 52.0/-0.4 39.1/2.8 53.5/-0.2 36.5/1.4 62.5/-0.8 60.7/1.7 62.7/1.3 60.0/2.2 53.4/1.0
mA6 mWA6 mWF6 mMF6 A3 UAR3 WF3 MF3
CMU-MOSEI 63.8/-16.4 61.8/5.2 71.3/-5.8 57.5/1.7 63.1/2.2 54.7/-3.4 60.3/-0.9 54.8/-2.8 60.9/-2.5

Experimental results obtained for the text-based affective states recognition. Performance measure superscript shows the number of classes.

Fusion Strategy Emotion Sentiment
A7 UAR7 WF7 MF7 A3 UAR3 WF3 MF3 Average/δ
RAMAS corpus
DFS AV 70.9/-0.8 66.3/-2.7 71.8/0.0 63.1/-1.8 86.0/-4.7 82.5/4.9 87.4/-3.4 68.9/-6.9 74.6/-1.9
DFS AT 68.6/-3.1 63.1/-5.9 69.6/-2.2 60.7/-4.2 86.0/-4.7 82.5/4.9 87.8/-3.0 68.2/-7.6 73.3/-3.2
DFS VT 47.3/-24.4 38.0/-31.0 41.6/-30.2 31.0/-33.9 79.8/-10.9 62.1/-15.5 81.6/-9.2 58.5/-17.3 55.0/-21.5
TFS AVT 71.3/-0.4 65.3/-3.7 71.8/0.0 62.8/-2.1 89.1/-1.6 84.6/7.0 90.0/-0.8 73.0/-2.8 76.0/-0.5
DFSLE AV 70.5/-1.2 68.1/-0.9 70.6/-1.2 63.9/-1.0 90.7/0.0 85.6/8.0 91.4/0.6 74.8/-1.0 77.0/0.4
DFSLE AT 39.1/-32.6 35.2/-33.8 26.5/-45.3 18.3/-46.6 74.8/-15.9 58.7/-18.9 77.3/-13.5 54.8/-21.0 48.1/-28.4
DFSLE VT 40.7/-31.0 30.6/-38.4 29.1/-42.7 20.2/-44.7 77.1/-13.6 60.3/-17.3 79.4/-11.4 56.5/-19.3 49.3/-27.3
TFSLE AVT 47.7/-24.0 41.4/-27.6 37.6/-34.2 29.2/-35.7 81.8/-8.9 63.5/-14.1 83.3/-7.5 60.0/-15.8 55.5/-21.0
TFS A 46.9/-24.8 43.5/-25.5 44.1/-27.7 36.4/-28.5 74.4/-16.3 58.5/-19.1 76.6/-14.2 54.5/-21.3 54.4/-22.2
TFS V 72.9/1.2 69.8/0.8 72.8/1.0 65.0/0.1 89.5/-1.2 84.8/7.2 90.4/-0.4 73.3/-2.5 77.3/0.8
TFS T 26.0/-45.7 21.4/-47.6 17.2/-54.6 11.7/-53.2 57.0/-33.7 46.7/-30.9 59.0/-31.8 42.3/-33.5 35.1/-41.4
TFS AV 72.5/0.8 67.5/-1.5 72.8/1.0 65.0/0.1 89.9/-0.8 85.1/7.5 90.5/-0.3 75.1/-0.7 77.3/0.8
TFS AT 48.1/-23.6 42.6/-26.4 47.8/-24.0 39.8/-25.1 75.2/-15.5 59.0/-18.6 77.5/-13.3 55.1/-20.7 55.6/-20.9
TFS VT 72.1/0.4 68.8/-0.2 72.5/0.7 63.9/-1.0 89.1/-1.6 84.6/7.0 90.4/-0.4 71.4/-4.4 76.6/0.1
MELD corpus
DFS AV 49.9/14.1 31.4/12.0 49.6/14.5 31.5/12.1 59.2/15.8 55.9/13.9 58.8/14.9 56.0/14.6 49.0/14.0
DFS AT 51.0/15.2 30.9/11.5 50.2/15.1 30.9/11.5 59.3/15.9 57.1/15.1 59.4/15.5 57.0/15.6 49.5/14.4
DFS VT 51.9/16.1 25.9/6.5 46.9/11.8 23.8/4.4 60.3/16.9 60.0/18.0 60.6/16.7 58.3/16.9 48.5/13.4
TFS AVT 49.3/13.5 28.7/9.3 48.1/13.0 29.0/9.6 58.4/15.0 55.1/13.1 58.3/14.4 55.4/14.0 47.8/12.7
DFSLE AV 41.4/5.6 23.2/3.8 40.8/5.7 23.2/3.8 51.0/7.6 48.1/6.1 50.9/7.0 48.0/6.6 40.8/5.8
DFSLE AT 51.9/16.1 25.2/5.8 45.4/10.3 21.0/1.6 62.7/19.3 58.7/16.7 62.1/18.2 59.2/17.8 48.3/13.2
DFSLE VT 51.9/16.1 25.5/6.1 46.0/10.9 21.7/2.3 61.0/17.6 58.7/16.7 60.8/16.9 58.0/16.6 47.9/12.9
TFSLE AVT 52.7/16.9 25.1/5.7 46.3/11.2 22.5/3.1 60.4/17.0 58.5/16.5 60.2/16.3 57.7/16.3 47.9/12.9
TFS A 41.3/5.5 22.6/3.2 39.9/4.8 22.1/2.7 48.4/5.0 45.1/3.1 48.4/4.5 45.1/3.7 39.1/4.0
TFS V 37.6/1.8 17.9/-1.5 35.3/0.2 18.0/-1.4 42.4/-1.0 40.1/-1.9 42.0/-1.9 39.1/-2.3 34.1/-1.0
TFS T 55.1/19.3 26.7/7.3 49.3/14.2 25.8/6.4 62.6/19.2 58.1/16.1 61.4/17.5 57.8/16.4 49.6/14.6
TFS AV 36.5/0.7 21.5/2.1 37.1/2.0 21.1/1.7 47.5/4.1 44.7/2.7 47.4/3.5 44.4/3.0 37.5/2.5
TFS AT 54.0/18.2 33.0/13.6 52.7/17.6 33.2/13.8 60.5/17.1 57.7/15.7 60.4/16.5 57.3/15.9 51.1/16.1
TFS VT 47.2/11.4 25.9/6.5 45.2/10.1 26.3/6.9 55.5/12.1 50.1/8.1 54.3/10.4 50.5/9.1 44.4/9.3
mA6 mWA6 mWF6 mMF6 A3 UAR3 WF3 MF3
CMU-MOSEI corpus
DFS AV 77.5/8.8 57.3/1.3 76.0/4.3 57.2/2.1 57.9/11.8 53.3/12.8 58.2/12.8 53.7/13.3 61.4/8.4
DFS AT 78.2/9.5 57.2/1.2 76.2/4.5 56.9/1.8 59.3/13.2 53.2/12.7 58.6/13.2 53.5/13.1 61.7/8.7
DFS VT 79.7/11.0 56.8/0.8 76.6/4.9 56.1/1.0 60.1/14.0 54.3/13.8 59.7/14.3 54.9/14.5 62.3/9.3
TFS AVT 77.7/9.0 57.3/1.3 76.2/4.5 57.3/2.2 58.7/12.6 53.5/13.0 58.7/13.3 54.0/13.6 61.7/8.7
DFSLE AV 77.7/9.0 55.5/-0.5 75.5/3.8 55.4/0.3 58.2/12.1 52.9/12.4 58.0/12.6 53.4/13.0 60.8/7.8
DFSLE AT 79.3/10.6 56.3/0.3 76.0/4.3 55.3/0.2 60.6/14.5 55.0/14.5 60.2/14.8 55.4/15.0 62.3/9.3
DFSLE VT 79.3/10.6 56.7/0.7 76.1/4.4 55.6/0.5 60.7/14.6 53.6/13.1 59.4/14.0 54.2/13.8 61.9/8.9
TFSLE AVT 79.5/10.8 55.8/-0.2 75.9/4.2 54.7/-0.4 61.0/14.9 53.9/13.4 59.7/14.3 54.5/14.1 61.9/8.9
TFS A 77.8/9.1 58.2/2.2 76.1/4.4 57.1/2.0 58.4/12.3 54.1/13.6 58.8/13.4 54.5/14.1 61.9/8.9
TFS V 70.7/2.0 57.2/1.2 72.3/0.6 56.1/1.0 46.7/0.6 40.5/0.0 45.4/0.0 40.4/0.0 53.7/0.7
TFS T 79.0/10.3 55.7/-0.3 75.4/3.7 54.7/-0.4 61.3/15.2 55.5/15.0 60.8/15.4 56.3/15.9 62.3/9.3
TFS AV 75.6/6.9 57.9/1.9 75.2/3.5 57.6/2.5 56.8/10.7 50.1/9.6 55.9/10.5 50.6/10.2 60.0/7.0
TFS AT 77.5/8.8 57.7/1.7 76.2/4.5 57.7/2.6 59.3/13.2 54.2/13.7 59.2/13.8 54.7/14.3 62.1/9.1
TFS VT 76.3/7.6 56.6/0.6 75.4/3.7 56.7/1.6 51.2/5.1 48.3/7.8 52.1/6.7 48.3/7.9 58.1/5.1

Experimental results obtained for the multi-modal affective states recognition. Performance measure superscript shows the number of classes. The δ values are calculated relative to the performance of the video-based model. DFS refers to Double Fusion Strategy. DFSLE to Double Fusion Strategy with Label Encoder. TFS refers to Triple Fusion Strategy. TFSLE to Triple Fusion Strategy with Label Encoder. A to acoustic features. V to visual features. T to text (linguistic) features.

Approach Corpus MT Emotion Sentiment
UAR6 WF6 A3 WF3
Ryumina et al. (2023) RAMAS + 82.8 92.3 93.8 93.4
Ours + 72.5 68.2 89.1 90.0
A7 WF7 A3 WF3
Zhang et al. (2023) MELD + 41.2 41.2 67.3 67.2
Hwang & Kim (2024) - 66.7 65.9 - -
Tu et al. (2024) - 67.9 67.0 - -
Hu et al. (2024) - 74.9 68.8 - -
Zhang et al. (2025) - - 68.7 - -
Ours + 51.3 49.6 60.4 60.0
mWA6 mWF6 A2 WF2
Akhtar et al. (2019) CMU-MOSEI + 62.8 78.6 80.5 78.8
Chauhan et al. (2019) + 63.0 79.0 80.4 78.2
Sangwan et al. (2019) + 63.2 79.1 80.2 78.3
Hwang & Kim (2024) - - - 87.5 87.5
Zheng et al. (2024) - - - 85.9 86.0
He et al. (2025) - - - 88.0 87.5
Li et al. (2023) - - - 86.2 85.9
Ours + 55.9 88.6 73.7 84.8

Comparison of the proposed approach with other multi-modal SOTA approaches. MT refers to multi-task. Performance measure superscript shows the number of classes.