Digital-avatar systems still provide limited control over emotionally expressive behavior in human-computer interaction, especially in LLM-based chatbots and virtual assistants with personalized visual embodiments. To address this problem, we propose MAVAGEN, a multimodal avatar generation framework for synthesizing upper-body digital avatars with personalized appearance and controllable emotional expression. The user specifies the desired gender and age, and provides a short text input from which the target emotional state is inferred. MAVAGEN retrieves an identity image from the HaGRIDv2-1M corpus and generates an avatar clip with synchronized facial expressions, hand gestures, and expressive speech.
The framework uses six feature streams: textual features, emotion-distribution features, landmark-based pose features, depth-geometry features, RGB-appearance features, and acoustic features. In quantitative evaluation against recent human animation methods, MAVAGEN achieves the best overall avatar quality, with FID 48.20, FVD 592.00, SSIM 0.741, Sync-C 7.40, HKC 0.929, HKV 25.30, CSIM 0.563, EmoAcc 0.88, and MOS 6.97.
MAVAGEN consists of an LLM-based chatbot and an avatar generation module. The assistant collects avatar attributes, estimates the target emotion from the user message, retrieves a matching identity image, extracts landmark and depth representations, generates expressive speech, and passes aligned multimodal embeddings into a diffusion-based generator.
Pipeline of the proposed MAVAGEN framework.
Qualitative comparison across the evaluated avatar generation methods.
Lower FID, FVD, E-FID, and Sync-D are better. Higher SSIM, PSNR, Sync-C, HKC, HKV, CSIM, EmoAcc, and MOS are better.
48.20
Frame realism
592.00
Temporal coherence
7.40
Audio-visual synchrony
0.88
Target emotion agreement
6.97
Subjective quality
| Method | FID ↓ | FVD ↓ | SSIM ↑ | PSNR ↑ | E-FID ↓ | Sync-D ↓ | Sync-C ↑ | HKC ↑ | HKV ↑ | CSIM ↑ | EmoAcc ↑ | MOS ↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AnimateAnyone | 60.10 | 1030.12 | 0.726 | 20.40 | 3.900 | 14.10 | 0.950 | 0.805 | 23.70 | 0.380 | -- | 2.64 ± 1.44 |
| MimicMotion | 55.20 | 635.40 | 0.705 | 19.10 | 2.700 | 8.10 | 1.450 | 0.902 | 24.70 | 0.520 | -- | 3.59 ± 2.20 |
| EchoMimicV2 | 50.10 | 605.30 | 0.736 | 21.90 | 2.240 | 7.10 | 7.150 | 0.921 | 25.20 | 0.555 | -- | 5.26 ± 2.16 |
| MAVAGEN | 48.20 | 592.00 | 0.741 | 21.95 | 2.250 | 6.85 | 7.40 | 0.929 | 25.30 | 0.563 | 0.88 | 6.97 ± 2.35 |
Distribution of subjective MOS ratings for the compared human animation methods. Higher scores indicate better perceived generation quality.
Compared with EchoMimicV2, MAVAGEN improves frame and sequence quality, slightly improves audio-visual synchrony and motion consistency, introduces EmoAcc as an auxiliary task-specific measure of emotional agreement, and receives the highest MOS score in the subjective evaluation.
@article{axyonov2026mavagen,
title={MAVAGEN: Multimodal Avatar Generation Framework for Personalized Human-Computer Interaction},
author={Axyonov, Alexandr and Ryumina, Elena and Ryumin, Dmitry and Karpov, Alexey},
journal={Multimodal Technologies and Interaction},
publisher={MDPI},
year={2026}
}