Video generation models as world simulators
We discover large-scale coaching of generative models on video information. Specifically, we prepare text-conditional diffusion models collectively on movies and pictures of variable durations, resolutions and side ratios. We leverage a transformer structure that operates on spacetime patches of video and picture latent codes. Our largest mannequin, Sora, is able to producing a minute of excessive constancy video. Our outcomes recommend that scaling video generation models is a promising path in the direction of constructing normal goal simulators of the bodily world.