Seaweed

A video generation foundational model

Introducing Seaweed

Seaweed, short for "Seed-Video," is a research initiative aimed at building a foundational model for video generation. It includes diffusion transformers with approximately 7 billion (7B) parameters, trained using an average of 1,000 GPUs. Seaweed learns world representation from massive amounts of multi-modal data such as video, image, and text. It allows for creating videos of various resolutions, aspect ratios, and durations from text descriptions. In this article, we present its generated videos and highlight its hallmark capability as a foundational model capable of supporting a wide range of downstream applications.

Our model is highly adept at generating lifelike human characters that exhibit a diverse array of actions, gestures, and emotions.

Seaweed excels at generating a wide variety of landscapes. With intricate detail and dynamic composition, it can create visually stunning environments that enhance storytelling.

Seaweed is capable of generating highly realistic animals and lively beings with remarkable detail and accuracy.

Seaweed can also generate stunning fantasy scenes, including futuristic environments, mystical landscapes, and otherworldly settings. It can bring complex, imaginative worlds to life, complete with intricate details and striking visuals.

We demonstrate our model's generative capabilities through a short film. All the videos are generated and the only manually added components are the background music and the ending titles.

Watch a generated short film

Generate videos from images

Our video generation model offers enhanced controls that allow users to precisely create the content they envision. By providing an image as the first frame, users can direct the model to generate the rest of the video with consistent motion and style. This grants users full control over the visual aesthetics, making it ideal for applications where accuracy and creative direction are crucial.

Creator: Yuanjing
Creator: Jiahong Huang
Creator: Baiyi Li
Creator: Yuanjing

We demonstrate our model's image-to-video generation capability, where the same input image can produce diverse videos based on different prompts.

Heart
Cheer
Teddy
Wave

Our model can also condition on both the first and last frames, allowing it to generate interesting transition videos for greater creative control.

Generate videos by references

Our model can also be finetuned to generate videos based on reference images, offering flexible input options for users. Whether it's a human reference image, an object reference image, or a combination of multiple reference images, the model can synthesize them into dynamic video sequences.

Learn more about Phantom

Human-centric video generation

Seaweed is adapted to generate content conditioned on audio inputs by Omnihuman, enabling the creation of realistic human characters that perfectly match the voice in the audio. The model ensures synchronized lip movements and body gestures that align with the tone and timing of the audio, creating a seamless and lifelike interaction.

Generate audio with video

Seaweed is also capable of generating both audio and video together. The audio generated is synced to reflect the action, scene, tone, rhythm, and style of the video. The audio complements and elevates the visual storytelling, providing a seamless multimedia experience.

Long-shot generation

Seaweed supports natively generating a single shot lasting 20 seconds without any extension technique. With the extension, it can generate videos up to a minute long.

Consistent storytelling

Seaweed is capable of generating consistent, multi-shot, long-form stories, maintaining continuity across scenes and shots. Users can provide both a global text description for the overarching narrative and fine-grained text descriptions for each individual shot.

Learn more about Long-Context-Tuning Learn more about VideoAuteur
Shot 1

Overview of the forest.

Shot 2

Close-up shot of the trees.

Shot 3

The boy follows the girl.

Shot 4

Cut to the front of the girl.

Shot 5

The boy follows.

Shot 6

Another overview of the forest.

Shot 7

The girl talks to the boy.

Shot 8

The boy becomes serious.

Shot 9

The girl becomes nervous.

Shot 10

The girl walks forward.

Shot 11

Drone view.

Shot 12

Wide angle ground view.

Shot 13

Camera dolly in.

Shot 14

A house appears in front.

Shot 15

Close-up shot of the house.

Shot 16

Close-up view of the characters.

Shot 17

The door. Dolly out.

Shot 18

The boy tries to open the door.

Shot 19

Inside the empty room.

Shot 20

The characters walk inside.

Shot 21

They look around in the house.

Shot 22

They walk into a new room.

Shot 23

An old bookshelf.

Shot 24

Close-up shot on the shelf.

Shot 25

The characters walk to a table.

Shot 26

A glowing ball floating on a map.

Shot 27

They look at each other.

Learn more about Long-Context-Tuning Learn more about VideoAuteur

High-resolution generation

Seaweed natively supports generating videos up to 1280x720 resolution. The result can also be further upsampled to 2K QHD (2560x1440) resolution. The super-resolution module can be separately applied to existing videos for upsampling and restoration.

Learn more about SeedVR
480p
1440p

Real-time generation

Seaweed can also generate videos in real-time at 1280x720 resolution and 24fps. This is particularly valuable for real-time and interactive applications, where immediate video generation is essential.

Learn more about Seaweed-APT

Camera control

Seaweed can natively support camera control through text descriptions, offering a range of options for different camera movements. Users can specify camera actions, such as panning, tilting, zooming, or more complex movements like tracking and orbiting.

Orbiting
Tracking
Dolly-in
Dolly-out
Pan-left
Pan-right
Tilt-up
Tilt-down

World exploration

Seaweed can be utilized for modeling precise camera control through defined trajectories, providing not only enhanced creative direction but also an interactive way for users to explore the simulated world. With its real-time generation capability, Seaweed also serves as a foundational model for advanced research in world simulation.

Learn more about CameraCtrl-II

Enhanced Physically-Consistent Generation

Seaweed can also be post-trained on synthetic video rendered via computer-generated imagery (CGI), enabling it to enhance the physical consistency in video generation while preserving photorealism. Below, we showcase generated videos with superior 3D consistency and precise human pose integrity in complex actions, alongside the synthetic videos used for training.

Research

Alphabetical order
* denotes individuals who held the role of point of contact

    Research Model

  • Ceyuan Yang*
  • Fei Xiao
  • Feng Cheng
  • Hao Chen
  • Haoyuan Guo*
  • Meng Wei
  • Peihao Zhu
  • Qi Zhao
  • Shanchuan Lin*
  • Yang Zhao*
  • Zhijie Lin*
  • Zhiwu Qing

    Research Data

  • Fangyuan Kong
  • Feilong Zuo*
  • Jiangqiao Yan
  • Liangke Gui
  • Lu Qi*
  • Sen Wang*
  • Sheng Bi
  • Siyu Zhang
  • Tuyen Hoang
  • Xuejiao Zeng*
  • Zhibei Ma*
  • Ziyan Yang

Research Lead

Infrastructure

Alphabetical order
* denotes individuals who held the role of point of contact

Feng Ling, Huafeng Kuang, Huixia Li*, Jerry Duncan, Jiashi Li*, Junda Zhang, Junru Zheng, Li Sun, Manlin Zhang, Renfei Sun, Rui Wang*, Shu Liu*, Xiaojie Li, Xin Xia, Xuefeng Xiao*, Xuyan Chi, Yanghua Peng*, Yuxi Ren*, Zhongkai Zhao, Zuquan Song

Contributors

Alphabetical order

Bingchuan Li, Chao Liang, Chongyang Ma, Deyao Zhu, Gaojie Lin, Gen Li, Haibin Huang, Hao He, Jianwen Jiang, Jianyi Wang, Jiaqi Yang, Jiawei Liu, Junfei Xiao, Lijie Liu, Qian He, Siyu Zhou, Tianxiang Ma, Xiaobin Zhuang, Xiaohui Shen, Xinglong Wu, Yuping Wang, Yuwei Guo, Yuxuan Wang, Zerong Zheng, Zhuo Chen, Zhuowei Chen