Reconstructing Hands in 3D with Transformers

Introduction

The key to HaMeR’s success lies in scaling up the techniques for hand mesh recovery. More specifically, we scale
both the training data and the deep network architecture
used for 3D hand reconstruction.

Training Data
- multiple available sources of data with hand annotations, including both studio/controlled datasets with 3D ground truth
- in-the-wild datasets annotated with 2D keypoint locations
Network: a large-scale transformer architecture

Benchmarking

HInt

annotating hands from diverse image sources
- 2D hand keypoints annotations
- videos from YouTube
- egocentric captures
controlled condition❌
in-the-wild✔️

3D hand pose and shape estimation

regress MANO parameters from images
- FrankMocap
the vertices of the MANO
mesh - better align with evidence - failure with occlusions and truncations

Hand datasets

3D

FreiHAND Captured in a multi-camera setting and focuses on different hand poses as well as hands interacting with objects.
HO-3D and DexYCB Captured in a controlled setting with multiple cameras but focuses more specifically on cases where hands interact with objects.
InterHand2.6M Captured in a studio with a focus on two interacting hands.
Hand pose datasets Captured in the Panoptic studio and offer3D hand annotations.
AssemblyHands Annotated3D hand poses for synchronized images from Assembly101, where participants assemble and disassemble take-apart toys in a multi-camera setting.

2D

COCO-WholeBody Provides hand annotations for the people in the COCO dataset.
Halpe Annotates hands in the HICO-DET dataset . Both of them source images from image datasets that contain very few egocentric images or transitionary moments.
HInt
- images from both egocentric and third-person video datasets
- more natural interactions with the world

Technical approach

MANO parametric hand model

input

pose $\theta \in \mathbb R^{48}$
shape $\beta \in \mathbb R^{10}$

function

the mesh of hand, $M(\theta, \beta) \in \mathbb R^{V \times 3}, V=778$ vertices
joints of the hand, $X \in \mathbb{R}^{K \times 3}, K=21$ joints

Hand mesh recovery

$f:$ image pixels( $I$ ) to MANO parameters
regressor also estimates camera parameters $\pi \Rightarrow t \in \mathbb{R}^3$ $π \Rightarrow t \in R^{3}$
- project the 3D mesh and joints to 2D keypoints
- $x=\pi(X)=\prod_K (X+t),$ given camera intrinsics $K$

final mapping: $f(I)=\Theta, \Theta=\{\theta, \beta, \pi\}$

Architecture

transformer head

decoder that processes a single token while cross-attending to the ViT output tokens
output $\Theta$

Losses

$L_{3D}=||\theta-\theta^*||^2_2+||\beta-\beta^*||^2_2+||X-X^*||_1$

$L_{2D}=||x-x^*||_1$

discriminator $D_k$ $D_{k}$ , only 2D keypoints available
- hand shape
- hand pose $\theta$
- each joint angle, separately

$L_{adv}=\sum\limits_{k} (D_k(\Theta)-1)^2$

Training data

2.7M training examples

4x larger than FrankMocap
mostly in controlled environments
%5 in-the-wild images

HInt: $\underline{\text{H}}$ and $\underline{\text{Int}}$ eractions in the wild

annotates 2D hand keypoint locations and occlusion labels for each keypoint
built off
- Hand23
- Epic-Kitchens
- Ego4D
first to provide “occlusion” annotations for 2D keypoints

Experiments

3D pose accuracy

dataset

FreiHAND and HO3Dv2: controlled multi-camera environments and 3D ground truth annotations

metrics

PA-MPJPE and AUCJ (3D joints evaluation)
PA-MPVPE, AUCV , F@5mm and F@15mm (3D mesh evaluation)

2D pose accuracy

dataset

HInt

metrics

reprojection accuracy

Ablation analysis

Effect of large scale data and deep model
Training with HInt

Reconstructing Hands in 3D with Transformers

Reconstructing Hands in 3D with Transformers

Introduction

Benchmarking

3D hand pose and shape estimation

Hand datasets

3D

2D

Technical approach

MANO parametric hand model

Hand mesh recovery

Architecture

Losses

Training data

HInt: $\underline{\text{H}}$ and $\underline{\text{Int}}$ eractions in the wild

Experiments

3D pose accuracy

2D pose accuracy

Ablation analysis

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Mixture of Contexts for Long Video Generation

TLP's Blog

Reconstructing Hands in 3D with Transformers

Introduction

Benchmarking

Related work

3D hand pose and shape estimation

Hand datasets

3D

2D

Technical approach

MANO parametric hand model

Hand mesh recovery

Architecture

Losses

Training data

HInt: H‾\underline{\text{H}}H​and Int‾\underline{\text{Int}}Int​eractions in the wild

Experiments

3D pose accuracy

2D pose accuracy

Ablation analysis

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Mixture of Contexts for Long Video Generation

TLP's Blog

HInt: $\underline{\text{H}}$ and $\underline{\text{Int}}$ eractions in the wild