Inspired by recent advances in world model architectures, particularly the Latent Autoregressive Flow-Matching (LARF) approach for reinforcement learning environments. This implementation explores similar principles for interactive, playable world generation.
Initial evaluation of existing datasets (XS-VID) wasn't good enough nor had any actions. To address this limitation, I developed a Android app for data collection, recording both high quality video and all significant sensor data.
Custom interactive data collection framework demonstrating controlled environment generation
Implemented VAE with multi-objective optimization: L1 reconstruction loss for pixel fidelity, KL divergence for latent space regularization, LPIPS for human visual perception alignment, and DINO semantic feature consistency. This approach addresses the dilemma between reconstruction quality and generation capability through semantic regularization of latent representations.
VAE reconstruction results showing latent space compression and semantic preservation
Developed causal transformer architecture operating in compressed latent space with flow matching module for next-frame prediction. The transformer applies alternating temporal and spatial attention mechanisms with rotary position embeddings, while the flow matching head implements velocity field objectives to transform noise-conditioned embeddings into samples from the target distribution. This enables efficient autoregressive generation with reduced error accumulation through learned noise embeddings.
Initial transformer + flow matching experiments demonstrating temporal coherence in latent space
encode → predict → flow → integrate → decode