Pyramid Flow – a neural network with open source code for video generation was released

Pyramid Flow – a neural network with open source code for video generation was released

Researchers from Peking University, China’s Kuaishou Technology and Beijing University of Posts and Communications have unveiled an open-source Pyramid Flow machine learning model. It is intended for creating videos.

The neural network can generate 10-second videos with a resolution of 768p and a frame rate of 24 FPS. Text-to-video and image-to-video modes are supported. The developers note that they trained the model on open datasets for 20,000 GPU hours. Nvidia A100 graphics accelerators were used for training.

Developer tests show that Pyramid Flow outperforms existing open source video generation models, including Kling and Gen-3 Alpha.

The project team also conducted comparative tests in which more than 20 people participated. The graphs show that respondents most often note the smoothness of videos generated using Pyramid Flow.

The models are published on the Hugging Face site, there is also a demo application on the portal. Instructions for getting started can be found on GitHub.

Model inference in text-to-video mode looks like this:

prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"

with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
    frames = model.generate(
        prompt=prompt,
        num_inference_steps=[20, 20, 20],
        video_num_inference_steps=[10, 10, 10],
        height=768,     
        width=1280,
        temp=16,                    # temp=16: 5s, temp=31: 10s
        guidance_scale=9.0,         # The guidance for the first frame, set it to 7 for 384p variant
        video_guidance_scale=5.0,   # The guidance for the other video latent
        output_type="pil",
        save_memory=True,           # If you have enough GPU memory, set it to `False` to improve vae decoding speed
    )

export_to_video(frames, "./text_to_video_sample.mp4", fps=24)

Model inference in image-to-video mode:

image = Image.open('assets/the_great_wall.jpg').convert("RGB").resize((1280, 768))
prompt = "FPV flying over the Great Wall"

with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
    frames = model.generate_i2v(
        prompt=prompt,
        input_image=image,
        num_inference_steps=[10, 10, 10],
        temp=16,
        video_guidance_scale=4.0,
        output_type="pil",
        save_memory=True,           # If you have enough GPU memory, set it to `False` to improve vae decoding speed
    )

export_to_video(frames, "./image_to_video_sample.mp4", fps=24)

Related posts