What is V-JEPA 2? Inside Meta’s AI Model That Thinks Before It Acts

Whether it's self-driving cars to household assistants it's clear the next hurdle AI must clear is interacting with the real, physical world.

Meta’s latest innovation, V-JEPA 2, takes us one step closer to a world enhanced by advanced machine intelligence.

We’ve got you covered with this comprehensive guide to V-JEPA 2, Meta's world model that thanks before it acts including how to use it and if it’s safe.

What is V-JEPA 2?

V-JEPA 2 is a state of the art AI model from Meta. The acronym title stands for Video Joint Embedding Predictive Architecture 2.

It’s trained on videos that help robotics and AI agents to not only better understand the physical world but predict how elements grounded in reality will respond to actions they take.

This is important as it is vital to building AI agents that think things through before they act.

Read: What Happened to the Metaverse? How Zuck's VR Dream Died

Ultimately the introduction of V-JEPA 2 represents progress towards one of Meta’s key goals : advanced machine intelligence.

Similar to artificial general intelligence, advanced machine intelligence is a high-level AI system that possesses the capability to perform complex cognitive tasks with or without human-like reasoning.

While AGI specifically aims to replicate human thinking and learning across any task, AMI strives for systems that can learn from data, comprehend context, anticipate future scenarios, and make independent, rational decisions.

Read: Google DeepMind To Power Physical Robots With New Gemini Robotics Models

V-JEPA 2 is a “world model”. This means that it builds an internal representation of its environment that allows it to not only understand but make predictions and plans about the environment.

It’s trained on massive amounts of video data that teach V-Jeps 2 physical rules of how objects move, how forces affect them, how people interact with them and how events unfold in sequence.

This allows the model to develop a “common sense” about the physical world.

Through this, it learns what will happen if certain actions happen, and can make predictions for similar actions.

V-JEPA 2 builds upon Meta's existing JEPA framework. JEPA stands for Joint Embedding Predictive Architecture.

This powerful architecture uses an encoder in order to transform video into meaningful embeddings and a predictor that uses these embeddings for forecasting.

The model is 1.2 billion parameters meaning that it has 1.2 billion adjustable values that it learns and optimizes for throughout its training process.