With the rapid development of artificial intelligence technology, the field of digital human video generation has ushered in an important milestone. The OmniAvatar system, jointly developed by Zhejiang University and Alibaba Group, has successfully realized the generation of a natural and smooth full-body dynamic video with only a static photo and a piece of audio, opening up new possibilities for virtual digital human technology.

Innovations in digital human technology: from "talking heads" to "full-body performances"
Bottlenecks in traditional methods
For a long time, audio-driven portrait video generation techniques have focused on animating facial areas, often referred to as "Talking Head" techniques. While this approach can achieve basic mouth synchronization, it has the following significant limitations:
- Limited range of motion: only drives facial expression changes, not coordinated body movements
- Insufficient synchronization accuracy: Complex speech content and mouth shape matching needs to be improved
- Limited ability to control: Difficulty in achieving fine control of movement, mood, and context through textual cues
OmniAvatar's Innovative Breakthroughs
OmniAvatar, an efficient audio-driven system based on LoRA (Low-Rank Adaptation) technology, successfully breaks through the constraints of traditional methods. The system is capable of taking three inputs: a still photo of a person, an audio file, and a text prompt, and then generating a complete video with natural body movements.

Core Strengths Comparison:
Technical characteristics | Traditional methods | OmniAvatar |
---|---|---|
Animation Scope | Face area only | Total body coordination |
audio synchronization | Basic Mouth Matching | High-precision audio and video alignment |
Control flexibility | Single Audio Driver | Dual control of audio + text |
video duration | Clip generation | Support long video continuous output |
identity consistency | Prone to drift | Stable retention of character traits |
Core Technology Architecture: The Perfect Integration of Three Innovative Technologies
Pixel-by-pixel multi-level audio embedding
Traditional audio embedding methods typically employ a cross-attention mechanism that simply blends audio features with visual features.OmniAvatar employs a more refined strategy:
Technological Innovation Points:
- Extracting high-quality audio features using Wav2Vec2 models
- Designed specialized Audio Pack module for feature compression and alignment
- Embedding audio information in a pixel-by-pixel fashion in multiple timing layers of a diffusion model
- Significantly enhances the precision of mouth synchronization and the naturalness of body movements

LoRA fine-tuning strategy
To achieve efficient training while maintaining model generation capabilities, OmniAvatar employs LoRA fine-tuning:
Implementing the program:
- Inserting low-rank matrices only in the attention and feedforward network layers of the Transformer model
- Avoids the risk of overfitting that may be associated with full volume model training
- Significantly improved audio-video alignment compared to a solution that completely freezes the base model
- Significantly reduced training costs and time consumption
Long video generation mechanism
OmniAvatar has designed a unique solution for the identity drift and coherence issues that are common in long video generation:
Technical points:
- Introducing the reference image latent as an identity anchoring mechanism
- Ensure video timing consistency with frame overlap strategy
- Implementing a progressive frame segment generation algorithm
- Effectively solves the problem of color drift and cumulative error in long videos

Performance: leading experimental results across the board
Assessment systems and data sets
OmniAvatar has been thoroughly tested on multiple authoritative datasets using an industry-recognized evaluation metrics system:
Training data: A carefully filtered AVSpeech dataset containing 1320 hours of video content and about 770,000 short video samples
Test data: HDTF high quality face video dataset + AVSpeech test set
Evaluation dimensions:
Evaluation category | Specific indicators | Assessment objectives |
---|---|---|
image quality | FID, IQA, ASE | Realism and clarity of generated images |
video quality | FVD | Fluency and coherence of video sequences |
Synchronization accuracy | Sync-C, Sync-D | How well the audio matches the mouthpiece |
Comparison of experimental results
Facial animation performance: On both HDTF and AVSpeech test sets, OmniAvatar achieves the best results in two key metrics: image quality and mouth synchronization. Compared with well-known methods such as SadTalker and MultiTalk, the generated videos show higher realism and more natural expression changes.


Full body animation ability: This is where OmniAvatar's most outstanding advantage lies. Experimental results show that the system is currently the only model that can generate coordinated and natural upper and lower body movements while maintaining high-precision mouth synchronization. Compared with competing methods such as HunyuanAvatar and FantasyTalking, OmniAvatar successfully solves the industry pain point of "head movement without moving".


Verification of ablation experiments
Through detailed ablation experiments, the research team verified the effectiveness of the individual technology components:
- The advantages of the LoRA strategy are clear: an optimal balance between training efficiency and generation quality
- Multi-layer embedding is effective: Better capture of temporal features and semantic hierarchy compared to single-layer embedding methods
- Parameter adjustment effects: Appropriate CFG parameters can enhance the synchronization effect, but too high can lead to overly exaggerated expressions
Case Studies
The technical challenge
While OmniAvatar has made significant progress, it still faces a number of technical challenges:
Technical limitations:
- Long video stability: Inherited a color drift issue with base models in long video generation
- multiplayer interaction: Control of multi-character scenarios needs to be strengthened
- real time performance: High inference latency, difficult to meet real-time application requirements
- speaker recognition: Identity differentiation in multi-speaker scenarios needs to be improved
Directions for Development: Future technical optimizations will focus on improving the stability of long videos, enhancing the control of multiplayer interactions, optimizing inference speed to meet real-time application requirements, and improving speaker recognition accuracy.
concluding remarks
OmniAvatar represents an important milestone in audio-driven digital human technology. Its breakthroughs in full-body animation generation, mouth synchronization accuracy, and text control capability have laid a solid foundation for the industrialized application of digital human technology. With the continuous improvement and optimization of the technology, we have reason to believe that a more intelligent and natural digital human interaction experience will soon become a reality.
Project open source address:https://github.com/Omni-Avatar/OmniAvatar
Link to paper:https://arxiv.org/abs/2506.18866v1
Project home page:https://omni-avatar.github.io/