EgoTwin: Dreaming Body and View in First Person

Introduction

Introduction

intro

Given a static human pose and an initial scene observation, our goal is to generate synchronized sequences of
egocentric video and human motion, guided by the textual description.

Challenges

Viewpoint Alignment (相机视角对齐头部运动)

Throughout the sequence, the camera trajectory captured in egocentric video must precisely align with the head
trajectory derived from human motion.

Causal Interplay (视频帧对应人体运动)

At each time step, the current visual frame provides spatial context that shapes human motion synthesis; conversely, the
newly generated motion influences subsequent video frames.

e.g. Opening door:
视频帧中的门位置影响下一步的动作(=开门),下一步动作代表身体姿态和摄像头位置如何变化,进而影响视频帧的生成。

Related Work

Video Generation

Methods

  • Early Work: Augment UNet-based text-to-image (T2I) models with temporal modeling layers
  • Recent Work: Transformer-based architectures
  • improved temporal consistency and generation quality.

Camera Control

  • Inject camera parameters into pretrained video diffusion models
  • rely on known camera trajectories

EgoTwin: not known

  • Must maintain consistency with other synthesized content that is strongly correlated to the underlying camera
    motion.(自行推断摄像头信息,摄像头强相关内容一致性)

Motion Generation

i.e. Generating realistic and diverse human motions from text.

Methods

  • Early Work: Temporal VAE
  • Recent Work
  • Diffusion models
  • operate on continuous vectors, latent space of a VAE or from raw motion sequences
  • Autoregressive models
  • discretize motion into tokens using vector quantization techniques
  • Generative masked models
  • hybrid

EgoTwin

  • observe the scene only once from the initial human pose

Multimodal Generation

  • Others
  • Audio-video
  • Text and images
  • Motion and frame-level language description
  • EgoTwin
  • Joint modeling of human motion and its corresponding egocentric views

Methodology

Problem Definition

Input

  • initial human pose P0RJ×3P^0 \in \mathbb{R}^{J \times 3}
  • egocentric observation I0RH×W×3I^0 \in \mathbb{R}^{H \times W \times 3}
  • textual description

Output

  • a human pose sequence P1:NmRNm×J×3P^{1:N_m} \in \mathbb{R}^{N_m \times J \times 3}
  • an egocentric view sequence I1:NvRNv×H×W×3I^{1:N_v} \in \mathbb{R}^{N_v \times H \times W \times 3}

JJ: number of joints
NmN_m: number of motion frames
NvN_v: number of video frames

Framework

framework

"Text"cRLt×Dt\text{"Text"} \to c \in \mathbb{R}^{L_t \times D_t}

"Video":I0RH×W×3zvR(Nv/4+1)×H/8×W/8×CvXvRNv×Dv\text{"Video"} : I^0 \in \mathbb{R}^{H \times W \times 3} \to z_v \in \mathbb{R}^{(N_v/4+1) \times H/8 \times W/8 \times C_v} \to X_v \in \mathbb{R}^{N_v \times D_v}


Pose: overparameterized canonical pose representation

i.e. (r˙a,r˙xz,ry,jp,jv,jr,cf)(\dot{r}^a, \dot{r}^{xz}, r^y, j^p, j^v, j^r, c^f)

Not suitable for the task

test

Reason:
Mathematically, recovering the head joint pose requires integrating root velocities to obtain the root pose, then applying forward kinematics (FK) to propagate transformations through the kinematic chain to the head joint.

Too complex!


Head-centric motion representation

head

  • Explicitly exposes egocentric information

Motion VAE

  • Using 1D causal convolutions
  • Encoder and decoder symmetrically structured, each comprising two stages of 2× downsampling or upsampling, interleaved
    with ResNet blocks

VAE loss separately for the 3D head (hp,h˙ph^p, \dot{h}^p), 6D head (hr,h˙rh^r, \dot{h}^r), 3D joint (jp,jvj^p, j^v),
and 6D joint (jrj^r) components:

LVAE=14c(Lrec(c)+λKLLKL(c)),c{head3D,head6D,joint3D,joint6D}\mathcal{L}_{VAE} = \frac{1}{4} \sum_c \big( \mathcal{L}^{(c)}_{rec} + \lambda_{KL} \mathcal{L}^{(c)}_{KL} \big), \quad c \in \{\text{head}_{3D}, \text{head}_{6D}, \text{joint}_{3D}, \text{joint}_{6D}\}

Finally:

"Pose":P0RJ×3ZmR(Nm/4+1)×CmXmRLm×Dm\text{"Pose"} : P^0 \in \mathbb{R}^{J \times 3} \to Z_m \in \mathbb{R}^{(N_m/4+1) \times C_m} \to X_m \in \mathbb{R}^{L_m \times D_m}

Diffusion Transformer

CogVideoX

  • Text and video branch: CogVideoX (shared weights between text and video)
  • Motion: lower-level parts

Interaction Mechanism

Global-level cross-modal consistency

  • Sinusoidal positional encodings for both video and motion tokens
  • 3D rotary position embeddings (RoPE) for video tokens

Not enough: Each video frame must be temporally aligned with the corresponding motion frame.


Structured joint attention mask

Nm=2NvN_m = 2N_v

Rewrite IiI^i as the observation OiO^i, and (P2i+1,P2i+2)(P^{2i+1}, P^{2i+2})
as the (chunked) action AiA^i, where i[0,Nv1]i \in [0, N_v-1]

  • Oi,AiOi+1{O^i, A^i} \to O^{i+1}
  • Oi,Oi+1Ai+1{O^i, O^{i+1}} \to A^{i+1}

Rules:

  • OiO^i 行 → 只能看 Ai1A^{i-1}(上一动作造成当前观察)
  • AiA^i 列 → 允许看 OiO^iOi+1O^{i+1}(前后观察决定当前动作)
  • 文本 ↔ 任意模态保持全开
  • 同模态自注意力全开
  • 其余所有跨模态注意力全部阻断

attention


Asynchronous Diffusion

From Kimi:
如果强制 tv=tmt_v = t_m,就会出现:

  • 对运动:每 62.5 ms 就踩一次大步长 → 过度去噪,高频细节(手抖、脚步着地)被抹平;
  • 对视频:每 125 ms 才踩一次小步长 → 欠去噪,帧间闪烁、运动模糊。

结论:同一噪声调度在不同帧率下会同时“毒死”两个模态;只有让 tvtmt_v、t_m 独立采样,才能让每种帧率用适合自己的步长。


Video denoiser

ϵθv(zvtv,zmtm,c,tv,tm)\epsilon^v_\theta (z_v^{t_v}, z_m^{t_m}, c, t_v, t_m)

Motion denoiser

ϵθm(zmtm,zvtv,c,tm,tv)\epsilon^m_\theta (z_m^{t_m}, z_v^{t_v}, c, t_m, t_v)

  • tv,tmt_v, t_m: timesteps for video and motion, respectively
  • zvtvz_v^{t_v}: noised video latent at timestep tvt_v
  • zmtmz_m^{t_m}: noised motion latent at timestep tmt_m

LDiT=Eϵv,ϵm,c,tv,tm[ϵvϵθv(zvtv,zmtm,c,tv,tm)22+ϵmϵθm(zmtm,zvtv,c,tm,tv)22]\mathcal{L}_{DiT} = \mathbb{E}_{\epsilon_v, \epsilon_m, c, t_v, t_m} \left[ \| \epsilon_v - \epsilon^v_\theta (z_v^{t_v}, z_m^{t_m}, c, t_v, t_m) \|_2^2 + \| \epsilon_m - \epsilon^m_\theta (z_m^{t_m}, z_v^{t_v}, c, t_m, t_v) \|_2^2 \right]

Training

  • Stage 1: Motion VAE Training
  • Stage 2: T2M Pretraining
  • Pretrained on the text-to-motion task using only text and motion embeddings as input
  • Omit the much longer video embeddings
  • CFG with scale 10%
  • Stage 3: Joint Text-Video-Motion Training
  • Learn the joint distribution of video and motion conditioned on text
  • CFG with scale 10%

Sampling Strategy

  • "Text""Video, Motion"\text{"Text"} \to \text{"Video, Motion"}
  • "Text, Motion""Video"\text{"Text, Motion"} \to \text{"Video"}
  • "Text, Video""Motion"\text{"Text, Video"} \to \text{"Motion"}

CFG for TM2V:

ϵ^θv(zvt,zm0,c,t,0)=ϵθv(zvt,zmT,ϕ,t,T)unconditional+wt(ϵθv(zvt,zmT,c,t,T)ϵθv(zvt,zmT,ϕ,t,T))text guidance+wv(ϵθv(zvt,zm0,c,t,0)ϵθv(zvt,zmT,c,t,T))motion guidance\hat{\epsilon}^v_\theta(z_v^t, z_m^0, c, t, 0) = \underbrace{\epsilon^v_\theta (z_v^t, z_m^T, \phi, t, T)}_{\text{unconditional}} + \underbrace{w_t \big( \epsilon^v_\theta (z_v^t, z_m^T, c, t, T) - \epsilon^v_\theta (z_v^t, z_m^T, \phi, t, T) \big)}_{\text{text guidance}} + \underbrace{w_v \big( \epsilon^v_\theta (z_v^t, z_m^0, c, t, 0) - \epsilon^v_\theta (z_v^t, z_m^T, c, t, T) \big)}_{\text{motion guidance}}


电波交流