EgoTwin: Dreaming Body and View in First Person

Introduction

Given a static human pose and an initial scene observation, our goal is to generate synchronized sequences of
egocentric video and human motion, guided by the textual description.

Challenges

Viewpoint Alignment (相机视角对齐头部运动)

Throughout the sequence, the camera trajectory captured in egocentric video must precisely align with the head
trajectory derived from human motion.

Causal Interplay (视频帧对应人体运动)

At each time step, the current visual frame provides spatial context that shapes human motion synthesis; conversely, the
newly generated motion influences subsequent video frames.

e.g. Opening door:
视频帧中的门位置影响下一步的动作（=开门），下一步动作代表身体姿态和摄像头位置如何变化，进而影响视频帧的生成。

Video Generation

Methods

Early Work: Augment UNet-based text-to-image (T2I) models with temporal modeling layers
Recent Work: Transformer-based architectures
improved temporal consistency and generation quality.

Camera Control

Inject camera parameters into pretrained video diffusion models
rely on known camera trajectories

EgoTwin: not known

Must maintain consistency with other synthesized content that is strongly correlated to the underlying camera
motion.（自行推断摄像头信息，摄像头强相关内容一致性）

Motion Generation

i.e. Generating realistic and diverse human motions from text.

Methods

Early Work: Temporal VAE
Recent Work
Diffusion models
operate on continuous vectors, latent space of a VAE or from raw motion sequences
Autoregressive models
discretize motion into tokens using vector quantization techniques
Generative masked models
hybrid

EgoTwin

observe the scene only once from the initial human pose

Multimodal Generation

Others
Audio-video
Text and images
Motion and frame-level language description
EgoTwin
Joint modeling of human motion and its corresponding egocentric views

Methodology

Problem Definition

Input

initial human pose $P^0 \in \mathbb{R}^{J \times 3}$
egocentric observation $I^0 \in \mathbb{R}^{H \times W \times 3}$
textual description

Output

a human pose sequence $P^{1:N_m} \in \mathbb{R}^{N_m \times J \times 3}$
an egocentric view sequence $I^{1:N_v} \in \mathbb{R}^{N_v \times H \times W \times 3}$

$J$ : number of joints
$N_m$ : number of motion frames
$N_v$ : number of video frames

Framework

$\text{"Text"} \to c \in \mathbb{R}^{L_t \times D_t}$

$\text{"Video"} : I^0 \in \mathbb{R}^{H \times W \times 3} \to z_v \in \mathbb{R}^{(N_v/4+1) \times H/8 \times W/8 \times C_v} \to X_v \in \mathbb{R}^{N_v \times D_v}$

Pose: overparameterized canonical pose representation

i.e. $(\dot{r}^a, \dot{r}^{xz}, r^y, j^p, j^v, j^r, c^f)$

Not suitable for the task

Reason:
Mathematically, recovering the head joint pose requires integrating root velocities to obtain the root pose, then applying forward kinematics (FK) to propagate transformations through the kinematic chain to the head joint.

Too complex!

Head-centric motion representation

Explicitly exposes egocentric information

Motion VAE

Using 1D causal convolutions
Encoder and decoder symmetrically structured, each comprising two stages of 2× downsampling or upsampling, interleaved
with ResNet blocks

VAE loss separately for the 3D head ( $h^p, \dot{h}^p$ ), 6D head ( $h^r, \dot{h}^r$ ), 3D joint ( $j^p, j^v$ ),
and 6D joint ( $j^r$ ) components:

\mathcal{L}_{VAE} = \frac{1}{4} \sum_c \big( \mathcal{L}^{(c)}_{rec} + \lambda_{KL} \mathcal{L}^{(c)}_{KL} \big), \quad c \in \{\text{head}_{3D}, \text{head}_{6D}, \text{joint}_{3D}, \text{joint}_{6D}\}

Finally:

$\text{"Pose"} : P^0 \in \mathbb{R}^{J \times 3} \to Z_m \in \mathbb{R}^{(N_m/4+1) \times C_m} \to X_m \in \mathbb{R}^{L_m \times D_m}$

Diffusion Transformer

Text and video branch: CogVideoX (shared weights between text and video)
Motion: lower-level parts

Interaction Mechanism

Sinusoidal positional encodings for both video and motion tokens
3D rotary position embeddings (RoPE) for video tokens

Not enough: Each video frame must be temporally aligned with the corresponding motion frame.

Structured joint attention mask

$N_m = 2N_v$

Rewrite $I^i$ as the observation $O^i$ , and $(P^{2i+1}, P^{2i+2})$
as the (chunked) action $A^i$ , where $i \in [0, N_v-1]$

${O^i, A^i} \to O^{i+1}$
${O^i, O^{i+1}} \to A^{i+1}$

Rules:

$O^i$ 行 → 只能看 $A^{i-1}$ （上一动作造成当前观察）
$A^i$ 列 → 允许看 $O^i$ 与 $O^{i+1}$ （前后观察决定当前动作）
文本 ↔ 任意模态保持全开
同模态自注意力全开
其余所有跨模态注意力全部阻断

Asynchronous Diffusion

From Kimi:
如果强制 $t_v = t_m$ ，就会出现：

对运动：每 62.5 ms 就踩一次大步长 → 过度去噪，高频细节（手抖、脚步着地）被抹平；
对视频：每 125 ms 才踩一次小步长 → 欠去噪，帧间闪烁、运动模糊。

结论：同一噪声调度在不同帧率下会同时“毒死”两个模态；只有让 $t_v、t_m$ 独立采样，才能让每种帧率用适合自己的步长。

Video denoiser

$\epsilon^v_\theta (z_v^{t_v}, z_m^{t_m}, c, t_v, t_m)$

Motion denoiser

$\epsilon^m_\theta (z_m^{t_m}, z_v^{t_v}, c, t_m, t_v)$

$t_v, t_m$ : timesteps for video and motion, respectively
$z_v^{t_v}$ : noised video latent at timestep $t_v$
$z_m^{t_m}$ : noised motion latent at timestep $t_m$

\mathcal{L}_{DiT} = \mathbb{E}_{\epsilon_v, \epsilon_m, c, t_v, t_m} \left[ \| \epsilon_v - \epsilon^v_\theta (z_v^{t_v}, z_m^{t_m}, c, t_v, t_m) \|_2^2 + \| \epsilon_m - \epsilon^m_\theta (z_m^{t_m}, z_v^{t_v}, c, t_m, t_v) \|_2^2 \right]

Training

Stage 1: Motion VAE Training
Stage 2: T2M Pretraining
Pretrained on the text-to-motion task using only text and motion embeddings as input
Omit the much longer video embeddings
CFG with scale 10%
Stage 3: Joint Text-Video-Motion Training
Learn the joint distribution of video and motion conditioned on text
CFG with scale 10%

Sampling Strategy

$\text{"Text"} \to \text{"Video, Motion"}$
$\text{"Text, Motion"} \to \text{"Video"}$
$\text{"Text, Video"} \to \text{"Motion"}$

CFG for TM2V:

\hat{\epsilon}^v_\theta(z_v^t, z_m^0, c, t, 0) = \underbrace{\epsilon^v_\theta (z_v^t, z_m^T, \phi, t, T)}_{\text{unconditional}} + \underbrace{w_t \big( \epsilon^v_\theta (z_v^t, z_m^T, c, t, T) - \epsilon^v_\theta (z_v^t, z_m^T, \phi, t, T) \big)}_{\text{text guidance}} + \underbrace{w_v \big( \epsilon^v_\theta (z_v^t, z_m^0, c, t, 0) - \epsilon^v_\theta (z_v^t, z_m^T, c, t, T) \big)}_{\text{motion guidance}}

EgoTwin: Dreaming Body and View in First Person

EgoTwin: Dreaming Body and View in First Person

Introduction

Introduction

Challenges

Viewpoint Alignment (相机视角对齐头部运动)

Causal Interplay (视频帧对应人体运动)

Video Generation

Methods

Camera Control

Motion Generation

Methods

Multimodal Generation

Methodology

Problem Definition

Input

Output

Framework

Head-centric motion representation

Motion VAE

Diffusion Transformer

Interaction Mechanism

Structured joint attention mask

Asynchronous Diffusion

Training

Sampling Strategy

Mixture of Contexts for Long Video Generation

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

TLP's Blog

EgoTwin: Dreaming Body and View in First Person

Introduction

Introduction

Challenges

Viewpoint Alignment (相机视角对齐头部运动)

Causal Interplay (视频帧对应人体运动)

Related Work

Video Generation

Methods

Camera Control

Motion Generation

Methods

Multimodal Generation

Methodology

Problem Definition

Input

Output

Framework

Modal Tokenization

Head-centric motion representation

Motion VAE

Diffusion Transformer

Interaction Mechanism

Global-level cross-modal consistency

Structured joint attention mask

Asynchronous Diffusion

Training

Sampling Strategy

Mixture of Contexts for Long Video Generation

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

TLP's Blog