diffusion video

1. TUNING
2. ANIMATION
3. 4D CONTROL
- 3.1. GIF
- 3.2. INTERACTIVE
4. SEMANTICALLY DEFORMED
5. BY INPUT
6. EXTRA PRIORS
7. GENERATED VIDEO ENHANCEMENT
- 7.1. TRICKS
- 7.2. USING MODEL
8. OTHERS EDITING VIDEO

parent: stable_diffusion
SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces
- no longer exponential for more frames

1. TUNING

Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation (dataset alleviation)
- prevent loss of image details and the noise prediction biases during the denoising process
- adds noise then denoises the noisy latent with proper rectification to alleviate the noise prediction biases
Attention Prompt Tuning: Parameter-Efficient Adaptation of Pre-Trained Models for Action Recognition
- efficient prompt tuning for video applications such as action recognition

2. ANIMATION

Keyframer: Empowering Animation Design using Large Language Models
- animating static images (SVGs) with natural language
Animated Stickers: Bringing Stickers to Life with Video Diffusion (animated emojis)

3. 4D CONTROL

BIGGER COHERENCE FACE ARTIST EDITING
DiffDreamer: Consistent Single-view Perpetual View Generation with Conditional Diffusion Models
- landscape(mountains) fly overs
DisCo: Disentangled Control for Referring Human Dance Generation in Real World
- human dance(movement) images and videos (using skelleton rigs)

3.1. GIF

Hotshot-XL, text-to-GIF model for Stable Diffusion XL
Generative Image Dynamics, interactive gifs(looping dynamic videos)
- frequency-coordinated diffusion sampling process
- neural stochastic motion texture
Pix2Gif: Motion-Guided Diffusion for GIF Generation
- transformed feature map (motion) remains within the same space as the target, thus consistency-coherence
dynamicrafter: generative frame interpolation and looping video generation (320x512)
Explorative Inbetweening of Time and Space
- bounded generation of a pre-trained image-to-video model without any tuning and optimization
- two images that capture a subject motion, translation between different viewpoints, or looping

3.2. INTERACTIVE

3.2.1. COLORIZATION

Learning Inclusion Matching for Animation Paint Bucket Colorization
- for hand-drawn cel animation
- comprehend the inclusion relationships between segments
- paint based on previous frame

3.2.2. DRAG

=DragGAN=: Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold
- dragging as input primitive, using pairs of points, excellent results, stylegan derivative
DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models
- moving, resizing, appearance replacement, dragging
StableDrag: Stable Dragging for Point-based Image Editing
- models: StableDrag-GAN and StableDrag-Diff
- confidence-based latent enhancement strategy for motion supervision

3.2.2.1. DRAG DIFFUSION

DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
- RotationDrag: Point-based Image Editing with Rotated Diffusion Features
  - utilizing the feature map to rotate-move images
DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
- control trajectories in different granularities
Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation
- superior control and semantic retention, reducing the optimization time 50% compared to DragDiffusion

3.2.3. HEAD POSE

5.3.1.1
Control4D: Dynamic Portrait Editing by Learning 4D GAN from 2D Diffusion-based Editor
- 4d gan, 2D diffusion, consistent 4D, =best one=
- change face of video
AniPortraitGAN: Animatable 3D Portrait Generation from 2D Image Collections
- facial expression, head pose, and shoulder movements
- trained on unstructured 2D images
MagiCapture: High-Resolution Multi-Concept Portrait Customization
- generate high-resolution portrait images given a handful of random selfies
DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis
- input: unposed portrait image, retains identity and facial expression
Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation
- novel view synthesis; input: single image and morphable mesh for desired facial expression (emotion)

4. SEMANTICALLY DEFORMED

VideoLDM: hd, but still semantically deformed (nvidia)

4.1. SEMANTICAL FIELD

TokenFlow: Consistent Diffusion Features for Consistent Video Editing
- consistency in edited video can be obtained by enforcing consistency in the diffusion feature space
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
- video to video, frame consistency
- aggregating the entire video and then using deformation field on one image =best one=
S2DM: Sector-Shaped Diffusion Models for Video Generation =best=
- explore the use of optical flow as temporal conditions
- prompt correctness while keeping semantical consistenc, can integrate with another temporal conditions
- decouple the generation of temporal features from semantic-content features

4.2. SD BASED

Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation
- temporal shift module that can leverage the spatial unet as is
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
- compatible with existing diffusion =best one=
- hierarchical cross-frame constraints applied to enforce coherence
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
- inflated sd model into video
- FROZEN SD
  - Fate/Zero: Fusing Attentions(MIT) for Zero-shot Text-based Video Editing
    - most fluid one, without training
    - RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models =best=
      - employs novel noise shuffling strategy to leverage temporal interactions (coherence)
      - guidance with ControlNet
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis
- doesnt strictly adhere to optical flow
- first frame = supplementary reference in the diffusion model
- works seamlessly with existing I2I models

4.2.1. I2VGEN-XL

I2VGen-XL MS-Image2Video non commercial good consistency and continuity, animate image
- built on sd; designed UNet to perform spatiotemporal modeling in the latent space;
- pre-trained on video and images
- I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
  - utilizing static images as a form taining guidance

4.2.2. ANIMATEDIFF

=best one=
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
- insert motion module into frozen(normal sd) text-to-image model
examples: (nsfw) video1 video2 video3 video4>>96101928 notnsfw: video1>>96052859 sword and sun>>96155685
- best one: video6 video7 video8 video9
current state ways: https://banodoco.ai/Animatediff more insight
- techniques:
  - Animatediff-cli-prompt-travel+Upscale: https://twitter.com/toyxyz3/status/1695134607317012749
  - Controlling AnimatedDiff using starting and ending frames (from Twitter user @TDS_95514874)
AnimateZero: Video Diffusion Models are Zero-Shot Image Animators
- T2I generation is more controllable and efficient compared to T2V
- we can transform pre-trained T2V models into I2V models
LongAnimateDiff: now 64 frames
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling (FreeNoise-AnimateDiff)
- removed the semantic flickering
AnimateDiff-Lightning: fast text-to-video model; can generate videos ten times than ANIMATEDIFF

4.2.2.1. DIFFDIRECTOR

DiffDirector: AnimateDiff-MotionDirector, MotionDirector Train a MotionLoRA and run it on any compatible AnimateDiff UI

4.2.2.2. PIA

PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
- motion controllability by text: temporal alignment layers (TA) out of token

4.2.2.3. ANIMATELCM

AnimateLCM: decouples the distillation of image generation priors and motion generation priors

4.2.2.4. CMD

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition =best=
- content-motion latent diffusion model (CMD)
  - autoencoder that succinctly encodes a video as a combination of image and a low-dimensional motion latent representation
  - pretrained image diffusion model plus lightweight diffusion motion model

4.3. 3D SD

VideoCrafter: Open Diffusion Models for High-Quality Video Generation and Editing (A Toolkit for Text-to-Video)
- has loras and controlnet, 3d unet; deeper lesson
VideoFusion: damo/text-to-video-synthesis, summary tiny, paper
- https://rentry.org/f34hy license change commit

5. BY INPUT

5.1. VIDEO COHERENCE

BIGGER COHERENCE from normal sd image generation
InstructVideo: Instructing Video Diffusion Models with Human Feedback
- recast reward fine-tuning as editing: process corrupted video rated by image reward model

5.2. VCHITECT

Vchitect: LaVie(Text2Video), 1(Image2Video)

5.3. IMAGES

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction =best=
- SEINE: images of different scenes as inputs, plus text-based control, generates transition videos
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors (prompt and image) =best=
AtomoVideo: High Fidelity Image-to-Video Generation =best=
- from input images, motion intensity and consistency; compatible with sd models without specific tuning
- pre-trained sd, add 1D temporal convolution, temporal attention

5.3.1. DANCING

CLOTH
PixelDance: Make Pixels Dance: High-Dynamic Video Generation
- synthesizing videos with complex scenes and intricate motions
- incorporates image instructions (not just text instructions)
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
- video diffusion model to encode temporal information
6.2.3
Text2AC-Zero: Consistent Synthesis of Animated Characters using 2D Diffusion
- zero shot on existing t2i, no training or fine-tuning
- pixel-wise guidance to steer the diffusion to minimizes visual discrepancies
DreaMoving: A Human Dance Video Generation Framework based on Diffusion Models
- Video ControlNet for motion-controlling and a Content Guider for identity preserving
Motionshop: An application of replacing the human motion in the video with a virtual 3D human
- segment retarget and and inpaint (with light awareness)
Diffutoon: High-Resolution Editable Toon Shading via Diffusion Models
- aiming to directly render(turn) photorealistic videos into anime styles; keeping consistency
AnaMoDiff: 2D Analogical Motion Diffusion via Disentangled Denoising
- best trade-off between motion analogy and identity preservation
MagicDance: Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer =best=
- real people references

5.3.1.1. TALKING FACES

DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models
- inputs: songs, speech in multiple languages, noisy audio, and out-of-domain portraits
EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation
- emotion input, different emotional intensities by adjusting the fine-grained emotion
HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting
- generating animatable avatars from textual prompts, visually appealing
PEGASUS: Personalized Generative 3D Avatars with Composable Attributes
- disentangled controls while preserving the identity, realistic
- trained using synthetic data at first

5.4. VIDEO INPUT

MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation
- edit one frame, then propagate to all
Hierarchical Masked 3D Diffusion Model for Video Outpainting
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
- Zero shot and EBsynth come together for a new vid2vid

5.5. BY PROMPT

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
- DDIM enhanced with motion dynamics, after cross-frame attention to protect identity
- Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models (vid2vid zero)
Edit-A-Video: Single Video Editing with Object-Aware Consistency
video-p2p cross attention control (more coherance than instruct-pix2pix) (Adobe)
- VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing (temporal smoothness)
StableVideo: Text-driven Consistency-aware Diffusion Video Editing (14 gb vram)
- temporal dependency = consistent appearance for the edited objects =best one=
FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling Video Diffusion Via Noise Rescheduling =best=
- reschedule a sequence of noises peforming window-based function = longer videos conditioned on multiple texts

5.5.1. MODELS

Stable Video Diffusion
- loras for camara control, multiview generation
MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation =best=
- more coherent movements

5.5.2. LATENT OF BOTH IMAGES AND VIDEO

Phenaki
- C-vit is the video encoder, Vivit repo
- single images are treated like videos
Photorealistic Video Generation with Diffusion Models
- compress images and videos within a unified latent space
4.2.1

5.5.3. WITH ARCHITECTURE STRUCTURE

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
- first pixel-based t2v generation then latent-based upscaling

5.5.3.1. CASCADED

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
- cascaded video latent diffusion models, temporal interpolation model
- incorporation of simple temporal self-attentions with rotary positional encoding, captures correlations inherent in video =best one=
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models =best=
- utilizing static images as a form of crucial guidance
  - guarantee coherent semantics by using two hierarchical encoders

6. EXTRA PRIORS

DRAG DIFFUSION IDENTITY IN VIDEO
Dual-Stream Diffusion Net for Text-to-Video Generation
- two diffusion streams, video content and motion branches = video variations; continuous with no flickers

6.1. STYLECRAFTER

StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter =best=
- high-quality stylized videos that align with the content of the texts
- train a style control adapter from image dataset then transfer to video

6.2. MOTION

MCDiff: Motion-Conditioned Diffusion Model for Controllable Video Synthesis
VideoComposer: Compositional Video Synthesis with Motion Controllability models (temporal consistency)
- motion vector from as control signal
MotionDirector: Motion Customization of Text-to-Video Diffusion Models
- dual-path LoRAs architecture to decouple the learning of appearance and motion
4.2.2
- LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation (8~16 videos = 1 Motion)
  - expand pretrained 2D T2I convolution layers to temporal-spatial motion learning layers
  - shared-noise sampling = improve the stability of videos
MotionDirector: Motion Customization of Text-to-Video Diffusion Models =best=
DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
- desired subject and a few videos of target motion (subject, motion learning on top of video model)

6.2.1. SVD

AnimateAnyghing: Fine Grained Open Domain Image Animation with Motion Guidance (anything)
- finetuning stable diffusion video

6.2.2. CONTROLER

MagicStick: Controllable Video Editing via Control Handle Transformations
- keyframe transformations can easily propagate to other frames to provide generation guidance
- inflate image model and ControlNet to temporal dimension, train lora to fit the specific scenes
Customizing Motion in Text-to-Video Diffusion Models
- map depicted motion to a new unique token, and can invoke the motion in combination with other motions
Peekaboo: Interactive Video Generation via Masked-Diffusion
- based on masking attention, control size and position
DIFFDIRECTOR ANIMATELCM
Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators
- a guidance loss that encourages the sample to have the desired motion
TrailBlazer: Trajectory Control for Diffusion-Based Video Generation
- pre-trained model without further model training (bounding boxes to guide)
Boximator: Generating Rich and Controllable Motions for Video Synthesis
- hard box and soft box
- plug-in for existing video diffusion models, training only a module
Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts
- locally aware and not moving the entire scene
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
- camera pose control, parameterizing the camera trajector
- AnimateDiff more more

6.2.2.1. MOTION FROM VIDEO

Spectral Motion Alignment for Video Motion Transfer using Diffusion Models
- aligns motion vectors using Fourier and wavelet transforms
- maintaining computational efficiency and compatibility with other customizations
Motion Inversion for Video Customization
- Motion Embeddings: temporally coherent derived from a given video
- less than 10 minutes of training time

6.2.2.2. DRAGANYTHING

DragAnything: Motion Control for Anything using Entity Representation
- trajectory-based is more userfriendly; control of motion for diverse entities

6.2.2.3. DEFINE CAMERA MOVEMENT

LivePhoto: Real Image Animation with Text-guided Motion Control
- motion-related textual instructions: actions, camera movements, new contents
- motion intensity estimation module(control signal)
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation
- independently control camera and object motion, determined by camera poses and trajectories
- using drawn lines
- motionctrl for svd, comfy
Icon Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion
- define camera movement and then object motion using bounding box

6.2.3. REFERENENET

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation
- ReferenceNet(controlnet), to merge detail features via spatial attention (temporal modeling for inter-frame transitions between video frames)
- Moore-AnimateAnyone (over sd 1.5)

6.3. LONG VIDEO

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
- coarse-to-fine process, iteratively complete the middle frames
sparseformer
- Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers
  - autorregresive with patches
FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline
- keyframes synthesis to figure the storyline of a video, then interpolation
1

7. GENERATED VIDEO ENHANCEMENT

optical flow background removal

7.1. TRICKS

script cinema inspired
grid of frames

7.2. USING MODEL

MS-Vid2Vid
- enhance the resolution and spatiotemporal continuity of text-generated videos and image-generated videos

8. OTHERS EDITING VIDEO

VIVE3D: Viewpoint-Independent Video Editing using 3D-Aware GANs
MoonShot: Towards Controllable Video Generation and Editing with Multimodal Conditions
- zero-shot subject customized, controlnet only, video transformation
ActAnywhere: Subject-Aware Video Background Generation
- input: segmented subject and contextual image input
STABLEIDENTITY inserting identity

8.1. VIDEO INPAINT

Anything in Any Scene Photorealistic Video Object Insertion (realism, lighting realism, and photorealism)
InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions
- use human-painting, drag and drop, as prior to inpainting generation, dynamic interaction,
Place Anything into Any Video
- using just a photograph of the object, looks like enhanced VR
Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
- add or remove objects, semantically change objects, insert stock photos into videos

8.1.1. OUTPAINTER

Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation =best=
- input-specific adaptation and pattern-aware outpainting

8.2. VIDEO EXCHANGE

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
- exploits semantic point correspondences,
- only a small number of semantic points are necessary to align the subject’s motion trajectory and modify its shape

8.3. CONTROLNET VIDEO

Stable Video Diffusion Temporal Controlnet

8.4. FRAME INTERPOLATION

MA-VFI: Motion-Aware Video Frame Interpolation
BlazeBVD: Make Scale-Time Equalization Great Again for Blind Video Deflickering
- illumination histograms that precisely capture flickering and local exposure variation
- to restore faithful and consistent texture affected by lighting changes; 10 times faster

diffusion video

Table of Contents

1. TUNING

2. ANIMATION

3. 4D CONTROL

3.1. GIF

3.2. INTERACTIVE

3.2.1. COLORIZATION

3.2.2. DRAG

3.2.2.1. DRAG DIFFUSION

3.2.3. HEAD POSE

4. SEMANTICALLY DEFORMED

4.1. SEMANTICAL FIELD

4.2. SD BASED

4.2.1. I2VGEN-XL

4.2.2. ANIMATEDIFF

4.2.2.1. DIFFDIRECTOR

4.2.2.2. PIA

4.2.2.3. ANIMATELCM

4.2.2.4. CMD

4.3. 3D SD

5. BY INPUT

5.1. VIDEO COHERENCE

5.2. VCHITECT

5.3. IMAGES

5.3.1. DANCING

5.3.1.1. TALKING FACES

5.4. VIDEO INPUT

5.5. BY PROMPT

5.5.1. MODELS

5.5.2. LATENT OF BOTH IMAGES AND VIDEO

5.5.3. WITH ARCHITECTURE STRUCTURE

5.5.3.1. CASCADED

6. EXTRA PRIORS

6.1. STYLECRAFTER

6.2. MOTION

6.2.1. SVD

6.2.2. CONTROLER

6.2.2.1. MOTION FROM VIDEO

6.2.2.2. DRAGANYTHING

6.2.2.3. DEFINE CAMERA MOVEMENT

6.2.3. REFERENENET

6.3. LONG VIDEO

7. GENERATED VIDEO ENHANCEMENT

7.1. TRICKS

7.2. USING MODEL

8. OTHERS EDITING VIDEO

8.1. VIDEO INPAINT

8.1.1. OUTPAINTER

8.2. VIDEO EXCHANGE

8.3. CONTROLNET VIDEO

8.4. FRAME INTERPOLATION