root@6971019e2498:/workspace# watch nvidia-smi
Thu Jun 05, 2025, 07:11:26 PM
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 On | 00000000:61:00.0 Off | Off | | 85% 78C P0 425W / 450W | 22107MiB / 24564MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 8432 C python 22105MiB | +-----------------------------------------------------------------------------------------+
Model training in progress...

It's a Tiny World (Model)

Playable World Models

Inspired by recent advances in world model architectures, particularly the Latent Autoregressive Flow-Matching (LARF) approach for reinforcement learning environments. This implementation explores similar principles for interactive, playable world generation.

Project Overview

Stage 1: Data Infrastructure & Collection

Initial evaluation of existing datasets (XS-VID) wasn't good enough nor had any actions. To address this limitation, I developed a Android app for data collection, recording both high quality video and all significant sensor data.

Custom interactive data collection framework demonstrating controlled environment generation

Stage 2: Variational Autoencoder Architecture

Implemented VAE with multi-objective optimization: L1 reconstruction loss for pixel fidelity, KL divergence for latent space regularization, LPIPS for human visual perception alignment, and DINO semantic feature consistency. This approach addresses the dilemma between reconstruction quality and generation capability through semantic regularization of latent representations.

VAE reconstruction results showing latent space compression and semantic preservation

Stage 3: Transformer + Flow Matching Integration

Developed causal transformer architecture operating in compressed latent space with flow matching module for next-frame prediction. The transformer applies alternating temporal and spatial attention mechanisms with rotary position embeddings, while the flow matching head implements velocity field objectives to transform noise-conditioned embeddings into samples from the target distribution. This enables efficient autoregressive generation with reduced error accumulation through learned noise embeddings.

Initial transformer + flow matching experiments demonstrating temporal coherence in latent space

Inference Pipeline

encode → predict → flow → integrate → decode