The line between real video and AI generated video is getting blurrier, spurred on OmniHuman- the latest release from ByteDance.
The new AI system is able to generate realistic videos of human movement and speech based on a single photograph.
What is OmniHuman?
OmniHuman is a new AI system from ByteDance that turns a photograph into multiple realistic videos of people speaking, singing and moving.
OmniHuman is able to generate full-body videos that match perfectly to speech. This massively surpasses previous AI models that could only animate upper bodies.
Read: Meet Friend: The AI Wearable That's Always Listening
OmniHuman can also generate singing. This includes supporting various music genres as well as creating multiple body poses and singing forms.
How does OmniHuman work?
More specifically OminiHuman is a deep learning-based video generation model.
Deep learning is a type of artificial intelligence that imitates the way the human brain works to process information.
Deep learning is built on artificial neural networks, which are inspired by the structure of the brain, to learn from data. These networks are made up of layers of interconnected nodes that process information and learn to recognize patterns.
Read: What is Google Veo? Inside the AI Video Generator
OmniHuman was trained on over 18,700 hours of human video data. This data included multiple inputs such as text, audio and body movements.
This combination, known internally as their “omni-conditions” training strategy is what lets the model learn more holistically than previous models and is where OmniHuman gets its name,
What is ByteDance?
ByteDance is a Chinese internet tech company that is best known for developing the very popular video-sharing app TikTok.
ByteDance has faced controversy and scrutiny over data privacy concerns, censorship issues, and its connections to the Chinese government.
Read: What is Sora AI and Is It Safe to Use?
Though ByteDance states all images from OmniHuman demos were sources from ‘public sources or generated by models’ there is currently no disclosure on the origin of the ‘18,700 hours of human video data’ that trained the model.