Google introduced Lumiere – a neural network for generating realistic videos
Google introduced the Lumiere neural network for creating short realistic videos. For this, the space-time diffusion model (Space-Time-U-Net, STUNet) is used.
Lumiere generates five-second videos with a text description. The main difference between the Space-Time-U-Net model used in Lumiere is that it creates a video in a single pass. Other neural networks for this first generate key frames and then fill the time between them. Due to this, there is often a mismatch of personnel among themselves.
A dataset of 30 million videos with a text description was used to train the video text transformation model. All videos in the dataset were 128×128 pixels in resolution, 80 frames long, and 16 frames per second. At the output, the neural network generates five-second videos of 1024×1024 pixels.
Lumiere has several modes of operation. The neural network can generate a video based on the text description, animate the image, adhere to the specified style and change only the specified fragments of the video.
Video engineers published a detailed study of the new generation method on the arxive.org portal. On the website of the Lumiere project, you can familiarize yourself with examples of the work of the neural network.