Multi-Lingual Approach for Multi-Modal Emotion and Sentiment Recognition Based on Triple Fusion

Maxim Markitantov^1,*, Elena Ryumina^1,*, Anastasia Dvoynikova¹, Alexey Karpov^1,2,

¹ St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg, Russia
² ITMO University, St. Petersburg, Russia

Code Models (coming soon)

Abstract

Affective states recognition is a challenging task that requires a large amount of input data, such as audio, video, and text. Current multi-modal approaches are often single-task and corpus-specific, resulting in overfitting, poor generalization across corpora, and reduced real-world performance. In this work, we address these limitations by: (1) multi-lingual training on corpora that include Russian (RAMAS) and English (MELD, CMU-MOSEI) speech; (2) multi-task learning for joint emotion and sentiment recognition; and (3) a novel Triple Fusion strategy that employs cross-modal integration at both hierarchical unimodal and fused multi-modal feature levels, enhancing intra- and inter-modal relationships of different affective states and modalities. Additionally, to optimize performance of the approach proposed, we compare temporal encoders (Transformer-based, Mamba, xLSTM) and fusion strategies (double and triple fusion strategies with and without a label encoder) to comprehensively understand their capabilities and limitations. On the Test subset of the CMU-MOSEI corpus, the proposed approach showed mean weighted F1-score (mWF) of 88.6%, and weighted F1-score (WF) of 84.8% for emotion and sentiment recognition, respectively. On the Test subset of the MELD corpus, the proposed approach showed WF of 49.6% and WF of 60.0%, respectively. On the Test subset of the RAMAS corpus, the proposed approach showed WF of 71.8% and WF of 90.0%, respectively. We compare the performance of the approach proposed with that of the SOTA ones.

Pipeline of the Proposed Approach

General pipeline of the proposed approach

Uni-modal Affective States Recognition

Audio-based model for affective states recognition

Video-based model for affective states recognition

Text-based model for affective states recognition

Multi-Modal Affective States Recognition

Triple Fusion Strategy

Double Fusion Strategy

Backbone of double and triple fusion strategies with label encoders

Research Data

Corpus	Language, speech type	Speaker distribution	Duration	Audio Format	Video Format
RAMAS	Russian, acted	5 women, 5 men, Aged 18-28 years	Total dur.: 5 hours, Num. of utt.: 1792, Avg. utt. dur.: 9.67 s., Avg. num. of words in utt.: 14	PCM WAV, 44.1 kHz, Mono/16-bits	MPEG-4, 1920x1080, 25 FPS
CMU-MOSEI	English, in-the-wild	430 women, 570 men	Total dur.: 46 hours, Num. of utt.: 22788, Avg. utt. dur.: 7.24 s., Avg. num. of words in utt.: 8	PCM WAV, 16 kHz, Mono/16-bits	MPEG-4, 1280x720, 30 FPS
MELD	English, acted	6 main speakers, 3 women, 3 men	Total dur.: 12 hours, Num. of utt.: 13706, Avg. utt. dur.: 3.18 s., Avg. num. of words in utt.: 19	AAC, 48 kHz	MPEG-4, 1280x720, 24 FPS

Metadata statistics of the corpora used. Avg. refers to average, utt. to utterance, dur. to duration, N/A to not available.

Experimental Results

Temporal Encoder	Emotion				Sentiment
	A⁷/δ	UAR⁷/δ	WF⁷/δ	MF⁷/δ	A³/δ	UAR³/δ	WF³/δ	MF³/δ	Average/δ
Single-Corpus trained model on RAMAS
EW2V	40.7/-	28.7/-	35.6/-	28.4/-	73.3/-	49.6/-	75.2/-	50.9/-	47.8/-
Transformer	52.3/11.6	41.7/13.1	51.3/15.7	43.0/14.6	75.2/1.9	50.9/1.3	75.8/0.6	51.3/0.4	55.2/7.4
Mamba	48.4/7.8	38.6/9.9	46.9/11.3	39.8/11.3	73.6/0.4	49.9/0.3	74.1/-1.0	50.2/-0.7	52.7/4.9
xLSTM	49.6/8.9	39.4/10.7	49.7/14.1	42.1/13.6	74.8/1.6	50.7/1.0	76.5/1.4	51.8/0.9	54.3/6.5
Single-Corpus trained model on MELD
EW2V	47.6/-	15.3/-	33.8/-	11.6/-	49.1/-	38.6/-	42.9/-	34.3/-	34.1/-
Transformer	46.1/-1.5	18.5/3.3	38.1/4.3	16.4/4.9	50.2/1.1	42.8/4.3	47.9/5.0	42.9/8.6	37.9/3.7
Mamba	46.8/-0.8	18.8/3.5	39.2/5.4	18.1/6.5	50.2/1.1	42.0/3.5	47.3/4.5	41.7/7.4	38.0/3.9
xLSTM	48.1/0.5	17.9/2.6	38.1/4.3	16.1/4.5	51.5/2.5	41.6/3.0	46.9/4.0	40.1/5.8	37.5/3.4
Single-Corpus trained model on CMU-MOSEI
	mA⁶/δ	mWA⁶/δ	mWF⁶/δ	mMF⁶/δ	A³/δ	UAR³/δ	WF³/δ	MF³/δ
EW2V	80.2/-	55.3/-	76.4/-	54.4/-	57.9/-	51.0/-	56.1/-	50.7/-	60.3/-
Transformer	80.6/0.4	56.7/1.4	77.4/1.0	56.3/1.9	58.7/0.8	52.1/1.1	57.0/0.9	51.7/1.0	61.3/1.1
Mamba	80.6/0.4	56.2/0.9	77.2/0.7	55.6/1.2	59.8/1.9	52.2/1.1	57.4/1.3	51.7/1.1	61.3/1.1
xLSTM	80.8/0.5	55.7/0.3	76.9/0.4	54.7/0.3	61.0/3.1	53.2/2.1	58.0/1.9	51.8/1.2	61.5/1.2
Multi-Corpus trained model using Transformer encoder
Corpus	A⁷/δ	UAR⁷/δ	WF⁷/δ	MF⁷/δ	A³/δ	UAR³/δ	WF³/δ	MF³/δ
RAMAS	42.6/-9.7	35.7/-6.0	37.9/-13.3	31.8/-11.2	70.9/-4.3	64.2/13.3	72.2/-3.5	54.5/3.2	51.2/-3.9
MELD	41.2/-4.9	19.1/0.5	36.7/-1.3	18.0/1.6	46.6/-3.6	40.8/-2.0	45.3/-2.6	41.0/-1.8	36.1/-1.8
	mA⁶/δ	mWA⁶/δ	mWF⁶/δ	mMF⁶/δ	A³/δ	UAR³/δ	WF³/δ	MF³/δ
CMU-MOSEI	78.0/-2.6	57.0/0.3	75.9/-1.5	56.1/-0.2	57.3/-1.4	51.8/-0.3	56.6/-0.4	51.9/0.2	60.6/-0.7

Experimental results obtained for the audio-based affective states recognition. Performance measure superscript shows the number of classes.

Temporal Encoder	Emotion				Sentiment
	A⁷/δ	UAR⁷/δ	WF⁷/δ	MF⁷/δ	A³/δ	UAR³/δ	WF³/δ	MF³/δ	Average/δ
Single-Corpus trained model on RAMAS
BiLSTM	75.6/--	70.0/--	75.8/--	67.5/--	94.2/--	79.9/--	94.4/--	76.7/--	79.3/--
Transformer	74.4/-1.2	65.0/-5.0	74.9/-0.9	64.6/-2.9	94.2/--	88.0/8.1	94.7/0.3	79.0/2.3	79.4/0.1
Mamba	71.3/4.3	64.1/5.9	71.6/4.2	62.0/-5.5	90.7/-3.5	77.6/-2.3	92.1/-2.3	69.5/-7.2	74.9/-4.4
xLSTM	69.8/-5.8	63.0/-7.0	69.8/-6.0	60.0/-7.5	89.1/-5.1	76.5/-3.4	90.2/-4.2	69.1/-7.6	73.4/-5.9
Single-Corpus trained model on MELD
BiLSTM	28.8/--	23.9/--	29.4/--	20.2/--	47.9/--	47.7/--	48.2/--	46.7/--	36.6/--
Transformer	28.1/-0.7	25.9/2.0	30.6/1.2	22.2/2.0	48.1/0.2	48.6/0.9	48.6/0.4	47.2/0.5	37.4/0.8
Mamba	29.8/1.0	23.8/-0.1	32.2/2.8	20.1/-0.1	50.2/2.3	48.1/0.4	50.2/2.0	48.0/1.3	37.8/1.2
xLSTM	28.2/-0.6	25.4/1.5	29.1/-0.3	21.0/0.8	48.2/0.3	48.8/1.1	48.7/0.5	47.2/0.5	37.1/0.5
Single-Corpus trained model on CMU-MOSEI
	mA⁶/δ	mWA⁶/δ	mWF⁶/δ	mMF⁶/δ	A³/δ	UAR³/δ	WF³/δ	MF³/δ
BiLSTM	78.8/--	55.5/--	75.0/--	54.4/--	48.9/--	42.6/--	47.7/--	42.5/--	55.9/--
Transformer	80.5/1.7	54.1/-1.4	75.9/0.9	52.5/-1.9	49.9/1.0	45.2/2.6	50.0/2.3	45.3/2.8	56.7/0.8
Mamba	80.1/1.3	54.2/-1.3	75.9/0.9	52.7/-1.7	48.3/-0.6	44.0/1.4	48.0/0.3	43.3/0.8	55.8/-0.1
xLSTM	78.7/-0.1	55.4/-0.1	76.2/1.2	55.3/0.9	45.6/-3.3	42.3/-0.3	45.9/-1.8	42.0/-0.5	55.2/-0.7
Multi-Corpus trained model using Transformer encoder
Corpus	A⁷/δ	UAR⁷/δ	WF⁷/δ	MF⁷/δ	A³/δ	UAR³/δ	WF³/δ	MF³/δ
RAMAS	71.7/-2.7	69.0/4.0	71.8/-3.1	64.9/0.3	90.7/-3.5	77.6/-10.4	90.8/-3.9	75.8/-3.2	76.5/-2.9
MELD	35.8/7.7	19.4/-6.5	35.1/4.5	19.4/-2.8	43.4/-4.7	42.0/-6.6	43.9/-4.7	41.4/-5.8	35.1/-2.3
	mA⁶/δ	mWA⁶/δ	mWF⁶/δ	mMF⁶/δ	A³/δ	UAR³/δ	WF³/δ	MF³/δ
CMU-MOSEI	68.7/-11.8	56.0/1.9	71.7/-4.2	55.1/2.6	46.1/-3.8	40.5/-4.7	45.4/-4.6	40.4/-4.9	53.0/-3.7

Experimental results obtained for the video-based affective states recognition. Performance measure superscript shows the number of classes.

Temporal Encoder	Emotion				Sentiment
	A⁷/δ	UAR⁷/δ	WF⁷/δ	MF⁷/δ	A³/δ	UAR³/δ	WF³/δ	MF³/δ	Average/δ
Single-Corpus trained model on RAMAS
LMHA	14.0/-	21.4/-	4.1/-	12.9/-	50.0/-	50.0/-	33.7/-	44.3/-	28.8/-
Transformer	24.4/10.4	30.6/9.2	22.0/17.9	22.8/9.9	53.5/3.5	52.4/2.4	56.9/23.2	41.4/-2.9	38.0/9.2
Mamba	28.7/14.7	30.6/9.2	29.0/24.9	27.7/14.8	55.4/5.4	53.7/3.7	58.2/24.5	42.7/-1.6	40.8/12.0
xLSTM	27.1/13.1	30.9/9.5	24.6/20.5	25.4/12.5	51.9/1.9	43.2/-6.8	56.2/22.5	39.4/-4.9	37.3/8.5
Single-Corpus trained model on MELD
LMHA	8.0/-	14.3/-	1.2/-	2.1/-	42.6/-	37.2/-	38.1/-	31.6/-	21.9/-
Transformer	52.4/44.4	36.3/22.0	53.7/52.5	35.1/33.0	63.3/20.7	59.0/21.8	61.4/23.3	57.8/26.2	52.4/30.5
Mamba	46.8/38.8	35.1/20.8	50.2/49.0	33.1/31.0	62.8/20.2	57.9/20.7	61.9/23.8	58.2/26.6	50.8/28.9
xLSTM	42.6/34.6	38.4/24.1	46.7/45.5	33.3/31.2	59.2/16.6	58.1/20.9	59.9/21.8	57.1/25.5	49.4/27.5
Single-Corpus trained model on CMU-MOSEI
	mA⁶/δ	mWA⁶/δ	mWF⁶/δ	mMF⁶/δ	A³/δ	UAR³/δ	WF³/δ	MF³/δ
LMHA	80.0/-	50.2/-	71.9/-	46.1/-	49.0/-	33.3/-	32.3/-	21.9/-	48.1/-
Transformer	80.2/0.2	56.6/6.4	77.1/5.2	55.8/9.7	60.9/11.9	58.1/24.8	61.2/28.9	57.6/35.7	63.4/15.4
Mamba	79.7/-0.3	53.8/3.6	75.3/3.4	51.8/5.7	57.3/8.3	55.7/22.4	57.5/25.2	54.4/32.5	60.7/12.6
xLSTM	80.1/0.1	55.8/5.6	76.4/4.5	55.5/9.4	56.8/7.8	55.4/22.1	57.2/24.9	54.3/32.4	61.4/13.4
Multi-Corpus trained model using Transformer encoder
Corpus	A⁷/δ	UAR⁷/δ	WF⁷/δ	MF⁷/δ	A³/δ	UAR³/δ	WF³/δ	MF³/δ
RAMAS	24.4/0.0	26.1/-4.5	23.0/1.0	20.1/-2.7	54.7/1.2	45.1/-7.3	55.1/-1.8	39.6/-1.8	36.0/-2.0
MELD	52.0/-0.4	39.1/2.8	53.5/-0.2	36.5/1.4	62.5/-0.8	60.7/1.7	62.7/1.3	60.0/2.2	53.4/1.0
	mA⁶/δ	mWA⁶/δ	mWF⁶/δ	mMF⁶/δ	A³/δ	UAR³/δ	WF³/δ	MF³/δ
CMU-MOSEI	63.8/-16.4	61.8/5.2	71.3/-5.8	57.5/1.7	63.1/2.2	54.7/-3.4	60.3/-0.9	54.8/-2.8	60.9/-2.5

Experimental results obtained for the text-based affective states recognition. Performance measure superscript shows the number of classes.

Fusion Strategy	Emotion				Sentiment
	A⁷/δ	UAR⁷/δ	WF⁷/δ	MF⁷/δ	A³/δ	UAR³/δ	WF³/δ	MF³/δ	Average/δ
RAMAS corpus
DFS AV	70.9/-0.8	66.3/-2.7	71.8/0.0	63.1/-1.8	86.0/-4.7	82.5/4.9	87.4/-3.4	68.9/-6.9	74.6/-1.9
DFS AT	68.6/-3.1	63.1/-5.9	69.6/-2.2	60.7/-4.2	86.0/-4.7	82.5/4.9	87.8/-3.0	68.2/-7.6	73.3/-3.2
DFS VT	47.3/-24.4	38.0/-31.0	41.6/-30.2	31.0/-33.9	79.8/-10.9	62.1/-15.5	81.6/-9.2	58.5/-17.3	55.0/-21.5
TFS AVT	71.3/-0.4	65.3/-3.7	71.8/0.0	62.8/-2.1	89.1/-1.6	84.6/7.0	90.0/-0.8	73.0/-2.8	76.0/-0.5
DFSLE AV	70.5/-1.2	68.1/-0.9	70.6/-1.2	63.9/-1.0	90.7/0.0	85.6/8.0	91.4/0.6	74.8/-1.0	77.0/0.4
DFSLE AT	39.1/-32.6	35.2/-33.8	26.5/-45.3	18.3/-46.6	74.8/-15.9	58.7/-18.9	77.3/-13.5	54.8/-21.0	48.1/-28.4
DFSLE VT	40.7/-31.0	30.6/-38.4	29.1/-42.7	20.2/-44.7	77.1/-13.6	60.3/-17.3	79.4/-11.4	56.5/-19.3	49.3/-27.3
TFSLE AVT	47.7/-24.0	41.4/-27.6	37.6/-34.2	29.2/-35.7	81.8/-8.9	63.5/-14.1	83.3/-7.5	60.0/-15.8	55.5/-21.0
TFS A	46.9/-24.8	43.5/-25.5	44.1/-27.7	36.4/-28.5	74.4/-16.3	58.5/-19.1	76.6/-14.2	54.5/-21.3	54.4/-22.2
TFS V	72.9/1.2	69.8/0.8	72.8/1.0	65.0/0.1	89.5/-1.2	84.8/7.2	90.4/-0.4	73.3/-2.5	77.3/0.8
TFS T	26.0/-45.7	21.4/-47.6	17.2/-54.6	11.7/-53.2	57.0/-33.7	46.7/-30.9	59.0/-31.8	42.3/-33.5	35.1/-41.4
TFS AV	72.5/0.8	67.5/-1.5	72.8/1.0	65.0/0.1	89.9/-0.8	85.1/7.5	90.5/-0.3	75.1/-0.7	77.3/0.8
TFS AT	48.1/-23.6	42.6/-26.4	47.8/-24.0	39.8/-25.1	75.2/-15.5	59.0/-18.6	77.5/-13.3	55.1/-20.7	55.6/-20.9
TFS VT	72.1/0.4	68.8/-0.2	72.5/0.7	63.9/-1.0	89.1/-1.6	84.6/7.0	90.4/-0.4	71.4/-4.4	76.6/0.1
MELD corpus
DFS AV	49.9/14.1	31.4/12.0	49.6/14.5	31.5/12.1	59.2/15.8	55.9/13.9	58.8/14.9	56.0/14.6	49.0/14.0
DFS AT	51.0/15.2	30.9/11.5	50.2/15.1	30.9/11.5	59.3/15.9	57.1/15.1	59.4/15.5	57.0/15.6	49.5/14.4
DFS VT	51.9/16.1	25.9/6.5	46.9/11.8	23.8/4.4	60.3/16.9	60.0/18.0	60.6/16.7	58.3/16.9	48.5/13.4
TFS AVT	49.3/13.5	28.7/9.3	48.1/13.0	29.0/9.6	58.4/15.0	55.1/13.1	58.3/14.4	55.4/14.0	47.8/12.7
DFSLE AV	41.4/5.6	23.2/3.8	40.8/5.7	23.2/3.8	51.0/7.6	48.1/6.1	50.9/7.0	48.0/6.6	40.8/5.8
DFSLE AT	51.9/16.1	25.2/5.8	45.4/10.3	21.0/1.6	62.7/19.3	58.7/16.7	62.1/18.2	59.2/17.8	48.3/13.2
DFSLE VT	51.9/16.1	25.5/6.1	46.0/10.9	21.7/2.3	61.0/17.6	58.7/16.7	60.8/16.9	58.0/16.6	47.9/12.9
TFSLE AVT	52.7/16.9	25.1/5.7	46.3/11.2	22.5/3.1	60.4/17.0	58.5/16.5	60.2/16.3	57.7/16.3	47.9/12.9
TFS A	41.3/5.5	22.6/3.2	39.9/4.8	22.1/2.7	48.4/5.0	45.1/3.1	48.4/4.5	45.1/3.7	39.1/4.0
TFS V	37.6/1.8	17.9/-1.5	35.3/0.2	18.0/-1.4	42.4/-1.0	40.1/-1.9	42.0/-1.9	39.1/-2.3	34.1/-1.0
TFS T	55.1/19.3	26.7/7.3	49.3/14.2	25.8/6.4	62.6/19.2	58.1/16.1	61.4/17.5	57.8/16.4	49.6/14.6
TFS AV	36.5/0.7	21.5/2.1	37.1/2.0	21.1/1.7	47.5/4.1	44.7/2.7	47.4/3.5	44.4/3.0	37.5/2.5
TFS AT	54.0/18.2	33.0/13.6	52.7/17.6	33.2/13.8	60.5/17.1	57.7/15.7	60.4/16.5	57.3/15.9	51.1/16.1
TFS VT	47.2/11.4	25.9/6.5	45.2/10.1	26.3/6.9	55.5/12.1	50.1/8.1	54.3/10.4	50.5/9.1	44.4/9.3
	mA⁶/δ	mWA⁶/δ	mWF⁶/δ	mMF⁶/δ	A³/δ	UAR³/δ	WF³/δ	MF³/δ
CMU-MOSEI corpus
DFS AV	77.5/8.8	57.3/1.3	76.0/4.3	57.2/2.1	57.9/11.8	53.3/12.8	58.2/12.8	53.7/13.3	61.4/8.4
DFS AT	78.2/9.5	57.2/1.2	76.2/4.5	56.9/1.8	59.3/13.2	53.2/12.7	58.6/13.2	53.5/13.1	61.7/8.7
DFS VT	79.7/11.0	56.8/0.8	76.6/4.9	56.1/1.0	60.1/14.0	54.3/13.8	59.7/14.3	54.9/14.5	62.3/9.3
TFS AVT	77.7/9.0	57.3/1.3	76.2/4.5	57.3/2.2	58.7/12.6	53.5/13.0	58.7/13.3	54.0/13.6	61.7/8.7
DFSLE AV	77.7/9.0	55.5/-0.5	75.5/3.8	55.4/0.3	58.2/12.1	52.9/12.4	58.0/12.6	53.4/13.0	60.8/7.8
DFSLE AT	79.3/10.6	56.3/0.3	76.0/4.3	55.3/0.2	60.6/14.5	55.0/14.5	60.2/14.8	55.4/15.0	62.3/9.3
DFSLE VT	79.3/10.6	56.7/0.7	76.1/4.4	55.6/0.5	60.7/14.6	53.6/13.1	59.4/14.0	54.2/13.8	61.9/8.9
TFSLE AVT	79.5/10.8	55.8/-0.2	75.9/4.2	54.7/-0.4	61.0/14.9	53.9/13.4	59.7/14.3	54.5/14.1	61.9/8.9
TFS A	77.8/9.1	58.2/2.2	76.1/4.4	57.1/2.0	58.4/12.3	54.1/13.6	58.8/13.4	54.5/14.1	61.9/8.9
TFS V	70.7/2.0	57.2/1.2	72.3/0.6	56.1/1.0	46.7/0.6	40.5/0.0	45.4/0.0	40.4/0.0	53.7/0.7
TFS T	79.0/10.3	55.7/-0.3	75.4/3.7	54.7/-0.4	61.3/15.2	55.5/15.0	60.8/15.4	56.3/15.9	62.3/9.3
TFS AV	75.6/6.9	57.9/1.9	75.2/3.5	57.6/2.5	56.8/10.7	50.1/9.6	55.9/10.5	50.6/10.2	60.0/7.0
TFS AT	77.5/8.8	57.7/1.7	76.2/4.5	57.7/2.6	59.3/13.2	54.2/13.7	59.2/13.8	54.7/14.3	62.1/9.1
TFS VT	76.3/7.6	56.6/0.6	75.4/3.7	56.7/1.6	51.2/5.1	48.3/7.8	52.1/6.7	48.3/7.9	58.1/5.1

Experimental results obtained for the multi-modal affective states recognition. Performance measure superscript shows the number of classes. The δ values are calculated relative to the performance of the video-based model. DFS refers to Double Fusion Strategy. DFSLE to Double Fusion Strategy with Label Encoder. TFS refers to Triple Fusion Strategy. TFSLE to Triple Fusion Strategy with Label Encoder. A to acoustic features. V to visual features. T to text (linguistic) features.

Approach	Corpus	MT	Emotion		Sentiment
			UAR⁶	WF⁶	A³	WF³
Ryumina et al. (2023)	RAMAS	+	82.8	92.3	93.8	93.4
Ours		+	72.5	68.2	89.1	90.0
			A⁷	WF⁷	A³	WF³
Zhang et al. (2023)	MELD	+	41.2	41.2	67.3	67.2
Hwang & Kim (2024)		-	66.7	65.9	-	-
Tu et al. (2024)		-	67.9	67.0	-	-
Hu et al. (2024)		-	74.9	68.8	-	-
Zhang et al. (2025)		-	-	68.7	-	-
Ours		+	51.3	49.6	60.4	60.0
			mWA⁶	mWF⁶	A²	WF²
Akhtar et al. (2019)	CMU-MOSEI	+	62.8	78.6	80.5	78.8
Chauhan et al. (2019)		+	63.0	79.0	80.4	78.2
Sangwan et al. (2019)		+	63.2	79.1	80.2	78.3
Hwang & Kim (2024)		-	-	-	87.5	87.5
Zheng et al. (2024)		-	-	-	85.9	86.0
He et al. (2025)		-	-	-	88.0	87.5
Li et al. (2023)		-	-	-	86.2	85.9
Ours		+	55.9	88.6	73.7	84.8

Comparison of the proposed approach with other multi-modal SOTA approaches. MT refers to multi-task. Performance measure superscript shows the number of classes.