Project Page

MAVAGEN: Multimodal Avatar Generation Framework for Personalized Human-Computer Interaction

Abstract

Digital-avatar systems still provide limited control over emotionally expressive behavior in human-computer interaction, especially in LLM-based chatbots and virtual assistants with personalized visual embodiments. To address this problem, we propose MAVAGEN, a multimodal avatar generation framework for synthesizing upper-body digital avatars with personalized appearance and controllable emotional expression. The user specifies the desired gender and age, and provides a short text input from which the target emotional state is inferred. MAVAGEN retrieves an identity image from the HaGRIDv2-1M corpus and generates an avatar clip with synchronized facial expressions, hand gestures, and expressive speech.

The framework uses six feature streams: textual features, emotion-distribution features, landmark-based pose features, depth-geometry features, RGB-appearance features, and acoustic features. In quantitative evaluation against recent human animation methods, MAVAGEN achieves the best overall avatar quality, with FID 48.20, FVD 592.00, SSIM 0.741, Sync-C 7.40, HKC 0.929, HKV 25.30, CSIM 0.563, EmoAcc 0.88, and MOS 6.97.

Method

MAVAGEN consists of an LLM-based chatbot and an avatar generation module. The assistant collects avatar attributes, estimates the target emotion from the user message, retrieves a matching identity image, extracts landmark and depth representations, generates expressive speech, and passes aligned multimodal embeddings into a diffusion-based generator.

Pipeline of the proposed MAVAGEN framework

Pipeline of the proposed MAVAGEN framework.

Reference Images

Input reference images used for the qualitative video examples.

Reference image for qualitative example 1

Example 1

Reference image for qualitative example 2

Example 2

Video Comparison

Qualitative comparison across the evaluated avatar generation methods.

Example 1

AnimateAnyone

MimicMotion

EchoMimicV2

MAVAGEN

Ours

Example 2

AnimateAnyone

MimicMotion

EchoMimicV2

MAVAGEN

Ours

Quantitative Results

Lower FID, FVD, E-FID, and Sync-D are better. Higher SSIM, PSNR, Sync-C, HKC, HKV, CSIM, EmoAcc, and MOS are better.

FID

48.20

Frame realism

FVD

592.00

Temporal coherence

Sync-C

7.40

Audio-visual synchrony

EmoAcc

0.88

Target emotion agreement

MOS

6.97

Subjective quality

Method FID ↓ FVD ↓ SSIM ↑ PSNR ↑ E-FID ↓ Sync-D ↓ Sync-C ↑ HKC ↑ HKV ↑ CSIM ↑ EmoAcc ↑ MOS ↑
AnimateAnyone 60.10 1030.12 0.726 20.40 3.900 14.10 0.950 0.805 23.70 0.380 -- 2.64 ± 1.44
MimicMotion 55.20 635.40 0.705 19.10 2.700 8.10 1.450 0.902 24.70 0.520 -- 3.59 ± 2.20
EchoMimicV2 50.10 605.30 0.736 21.90 2.240 7.10 7.150 0.921 25.20 0.555 -- 5.26 ± 2.16
MAVAGEN 48.20 592.00 0.741 21.95 2.250 6.85 7.40 0.929 25.30 0.563 0.88 6.97 ± 2.35
Distribution of subjective MOS ratings for compared human animation methods

Distribution of subjective MOS ratings for the compared human animation methods. Higher scores indicate better perceived generation quality.

Compared with EchoMimicV2, MAVAGEN improves frame and sequence quality, slightly improves audio-visual synchrony and motion consistency, introduces EmoAcc as an auxiliary task-specific measure of emotional agreement, and receives the highest MOS score in the subjective evaluation.

Citation

Journal: Multimodal Technologies and Interaction (MTI), MDPI.

@article{axyonov2026mavagen,
  title={MAVAGEN: Multimodal Avatar Generation Framework for Personalized Human-Computer Interaction},
  author={Axyonov, Alexandr and Ryumina, Elena and Ryumin, Dmitry and Karpov, Alexey},
  journal={Multimodal Technologies and Interaction},
  publisher={MDPI},
  year={2026}
}