Google VideoPoet: A New AI Tool That Can Generate Videos From Text

Saurabh Mhatre
3 min readDec 27, 2023

--

Article image

Google has recently introduced VideoPoet, a new artificial intelligence tool that can generate high-quality videos from text and other inputs. VideoPoet is a large language model (LLM) that is trained on a massive dataset of videos, images, audio, and text from the internet and other sources. It can perform various video generation tasks, such as text-to-video, image-to-video, video stylization, video inpainting and outpainting, and video-to-audio.

VideoPoet is based on a simple idea: to convert any autoregressive language model into a video generator. Autoregressive language models are powerful AI systems that can generate coherent and diverse text and code, such as GPT-3 and Codex. However, they operate on discrete tokens, which are not suitable for video generation. To overcome this challenge, VideoPoet uses multiple tokenizers that can encode and decode video, image, and audio clips into sequences of discrete tokens. These tokenizers are based on existing models, such as MAGVIT V2 for video and image and SoundStream for audio.

By using these tokenizers, VideoPoet can learn the relationship between different modalities and generate videos that are both coherent and visually appealing. For example, given a text prompt, such as "a dog listening to music with headphones", VideoPoet can produce a video clip that matches the description.

Example

Similarly, given an image, such as a teddy bear, VideoPoet can animate it and add sound effects.

Teddy Bear

VideoPoet can also edit existing videos, such as inpainting missing regions or outpainting beyond the original frame.

Moreover, VideoPoet can stylize videos according to the text input, such as "a horse galloping through Van Gogh's 'starry night'".

Example

To showcase VideoPoet's capabilities, the Google Research team has produced a short movie composed of many short clips generated by the model. The movie tells the story of Rookie the Raccoon, a traveling adventurer who explores different places and meets new friends.

Sample Video

The script for the movie was written by another AI system, called Bard, which can generate creative text prompts. The movie demonstrates how VideoPoet can generate videos with large and complex motions, such as a raccoon riding a bike, a dragon breathing fire, or a train driving through a fantasy landscape.

VideoPoet is a remarkable achievement in the field of video generation, as it shows how language models can be extended to other modalities and tasks. VideoPoet is also a versatile and flexible tool, as it can generate videos from various inputs and outputs, and integrate many capabilities into a single model. VideoPoet opens up new possibilities for creating and editing videos, as well as telling visual stories with AI.

Video overview:

Video overview

I hope you enjoyed reading this article. 😊

Check out my Youtube channel for more content:

SaurabhNative-Youtube

Source:
(1) VideoPoet – Google Research. https://sites.research.google/videopoet/.
(2) VideoPoet: A large language model for zero-shot video generation. http://blog.research.google/2023/12/videopoet-large-language-model-for-zero.html

--

--

Saurabh Mhatre
Saurabh Mhatre

Written by Saurabh Mhatre

Senior Frontend Developer with 9+ years industry experience. Content creator on Youtube and Medium. LinkedIn/Twitter/Instagram: @SaurabhNative

Responses (3)