diffusion video
Table of Contents
- parent: stablediffusion
- SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces
- no longer exponential for more frames
1. TUNING
- Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation (dataset alleviation)
- prevent loss of image details and the noise prediction biases during the denoising process
- adds noise then denoises the noisy latent with proper rectification to alleviate the noise prediction biases
- Attention Prompt Tuning: Parameter-Efficient Adaptation of Pre-Trained Models for Action Recognition
- efficient prompt tuning for video applications such as action recognition
2. ANIMATION
3. 4D CONTROL
- BIGGER COHERENCE FACE ARTIST EDITING
- DiffDreamer: Consistent Single-view Perpetual View Generation with Conditional Diffusion Models
- landscape(mountains) fly overs
- DisCo: Disentangled Control for Referring Human Dance Generation in Real World
- human dance(movement) images and videos (using skelleton rigs)
3.1. GIF
- Hotshot-XL, text-to-GIF model for Stable Diffusion XL
- Generative Image Dynamics, interactive gifs(looping dynamic videos)
- frequency-coordinated diffusion sampling process
- neural stochastic motion texture
- Pix2Gif: Motion-Guided Diffusion for GIF Generation
- transformed feature map (motion) remains within the same space as the target, thus consistency-coherence
- dynamicrafter: generative frame interpolation and looping video generation (320x512)
- Explorative Inbetweening of Time and Space
- bounded generation of a pre-trained image-to-video model without any tuning and optimization
- two images that capture a subject motion, translation between different viewpoints, or looping
3.2. INTERACTIVE
3.2.1. COLORIZATION
- Learning Inclusion Matching for Animation Paint Bucket Colorization
- for hand-drawn cel animation
- comprehend the inclusion relationships between segments
- paint based on previous frame
3.2.2. DRAG
=DragGAN=
: Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold- dragging as input primitive, using pairs of points, excellent results, stylegan derivative
- DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models
- moving, resizing, appearance replacement, dragging
- StableDrag: Stable Dragging for Point-based Image Editing
- models: StableDrag-GAN and StableDrag-Diff
- confidence-based latent enhancement strategy for motion supervision
3.2.2.1. DRAG DIFFUSION
- DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
- RotationDrag: Point-based Image Editing with Rotated Diffusion Features
- utilizing the feature map to rotate-move images
- RotationDrag: Point-based Image Editing with Rotated Diffusion Features
- DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models
- DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
- control trajectories in different granularities
- Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation
- superior control and semantic retention, reducing the optimization time 50% compared to DragDiffusion
3.2.3. HEAD POSE
- 5.3.1.1
- Control4D: Dynamic Portrait Editing by Learning 4D GAN from 2D Diffusion-based Editor
- 4d gan, 2D diffusion, consistent 4D,
=best one=
- change face of video
- 4d gan, 2D diffusion, consistent 4D,
- AniPortraitGAN: Animatable 3D Portrait Generation from 2D Image Collections
- facial expression, head pose, and shoulder movements
- trained on unstructured 2D images
- MagiCapture: High-Resolution Multi-Concept Portrait Customization
- generate high-resolution portrait images given a handful of random selfies
- DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis
- input: unposed portrait image, retains identity and facial expression
- Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation
- novel view synthesis; input: single image and morphable mesh for desired facial expression (emotion)
4. SEMANTICALLY DEFORMED
- VideoLDM: hd, but still semantically deformed (nvidia)
4.1. SEMANTICAL FIELD
- TokenFlow: Consistent Diffusion Features for Consistent Video Editing
- consistency in edited video can be obtained by enforcing consistency in the diffusion feature space
- CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
- video to video, frame consistency
- aggregating the entire video and then using deformation field on one image
=best one=
- S2DM: Sector-Shaped Diffusion Models for Video Generation
=best=
- explore the use of optical flow as temporal conditions
- prompt correctness while keeping semantical consistenc, can integrate with another temporal conditions
- decouple the generation of temporal features from semantic-content features
4.2. SD BASED
- Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation
- temporal shift module that can leverage the spatial unet as is
- Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
- compatible with existing diffusion
=best one=
- hierarchical cross-frame constraints applied to enforce coherence
- compatible with existing diffusion
- Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
- inflated sd model into video
- FROZEN SD
- Fate/Zero: Fusing Attentions(MIT) for Zero-shot Text-based Video Editing
- most fluid one, without training
- RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models
=best=
- employs novel noise shuffling strategy to leverage temporal interactions (coherence)
- guidance with ControlNet
- Fate/Zero: Fusing Attentions(MIT) for Zero-shot Text-based Video Editing
- FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis
- doesnt strictly adhere to optical flow
- first frame = supplementary reference in the diffusion model
- works seamlessly with existing I2I models
4.2.1. I2VGEN-XL
- I2VGen-XL MS-Image2Video non commercial good consistency and continuity, animate image
- built on sd; designed UNet to perform spatiotemporal modeling in the latent space;
- pre-trained on video and images
- I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
- utilizing static images as a form taining guidance
4.2.2. ANIMATEDIFF
=best one=
- AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
- insert motion module into frozen(normal sd) text-to-image model
- examples: (nsfw) video1 video2 video3 video4>>96101928 notnsfw: video1>>96052859 sword and sun>>96155685
- current state ways: https://banodoco.ai/Animatediff more insight
- techniques:
- Animatediff-cli-prompt-travel+Upscale: https://twitter.com/toyxyz3/status/1695134607317012749
- Controlling AnimatedDiff using starting and ending frames (from Twitter user @TDS95514874)
- techniques:
- AnimateZero: Video Diffusion Models are Zero-Shot Image Animators
- T2I generation is more controllable and efficient compared to T2V
- we can transform pre-trained T2V models into I2V models
- LongAnimateDiff: now 64 frames
- FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling (FreeNoise-AnimateDiff)
- removed the semantic flickering
- AnimateDiff-Lightning: fast text-to-video model; can generate videos ten times than ANIMATEDIFF
4.2.2.1. DIFFDIRECTOR
- DiffDirector: AnimateDiff-MotionDirector, MotionDirector Train a MotionLoRA and run it on any compatible AnimateDiff UI
4.2.2.2. PIA
4.2.2.3. ANIMATELCM
- AnimateLCM: decouples the distillation of image generation priors and motion generation priors
4.2.2.4. CMD
- Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition
=best=
- content-motion latent diffusion model (CMD)
- autoencoder that succinctly encodes a video as a combination of image and a low-dimensional motion latent representation
- pretrained image diffusion model plus lightweight diffusion motion model
- content-motion latent diffusion model (CMD)
4.3. 3D SD
- VideoCrafter: Open Diffusion Models for High-Quality Video Generation and Editing (A Toolkit for Text-to-Video)
- has loras and controlnet, 3d unet; deeper lesson
- VideoFusion: damo/text-to-video-synthesis, summary tiny, paper
5. BY INPUT
5.1. VIDEO COHERENCE
- BIGGER COHERENCE from normal sd image generation
- InstructVideo: Instructing Video Diffusion Models with Human Feedback
- recast reward fine-tuning as editing: process corrupted video rated by image reward model
5.3. IMAGES
- SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
=best=
- SEINE: images of different scenes as inputs, plus text-based control, generates transition videos
- DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors (prompt and image)
=best=
- AtomoVideo: High Fidelity Image-to-Video Generation
=best=
- from input images, motion intensity and consistency; compatible with sd models without specific tuning
- pre-trained sd, add 1D temporal convolution, temporal attention
5.3.1. DANCING
- CLOTH
- PixelDance: Make Pixels Dance: High-Dynamic Video Generation
- synthesizing videos with complex scenes and intricate motions
- incorporates image instructions (not just text instructions)
- MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
- video diffusion model to encode temporal information
- 6.2.3
- Text2AC-Zero: Consistent Synthesis of Animated Characters using 2D Diffusion
- zero shot on existing t2i, no training or fine-tuning
- pixel-wise guidance to steer the diffusion to minimizes visual discrepancies
- DreaMoving: A Human Dance Video Generation Framework based on Diffusion Models
- Video ControlNet for motion-controlling and a Content Guider for identity preserving
- Motionshop: An application of replacing the human motion in the video with a virtual 3D human
- segment retarget and and inpaint (with light awareness)
- Diffutoon: High-Resolution Editable Toon Shading via Diffusion Models
- aiming to directly render(turn) photorealistic videos into anime styles; keeping consistency
- AnaMoDiff: 2D Analogical Motion Diffusion via Disentangled Denoising
- best trade-off between motion analogy and identity preservation
- MagicDance: Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer
=best=
- real people references
5.3.1.1. TALKING FACES
- DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models
- inputs: songs, speech in multiple languages, noisy audio, and out-of-domain portraits
- EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation
- emotion input, different emotional intensities by adjusting the fine-grained emotion
- HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting
- generating animatable avatars from textual prompts, visually appealing
- PEGASUS: Personalized Generative 3D Avatars with Composable Attributes
- disentangled controls while preserving the identity, realistic
- trained using synthetic data at first
5.4. VIDEO INPUT
- MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation
- edit one frame, then propagate to all
- Hierarchical Masked 3D Diffusion Model for Video Outpainting
- FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
- Zero shot and EBsynth come together for a new vid2vid
5.5. BY PROMPT
- Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
- DDIM enhanced with motion dynamics, after cross-frame attention to protect identity
- Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models (vid2vid zero)
- Edit-A-Video: Single Video Editing with Object-Aware Consistency
- video-p2p cross attention control (more coherance than instruct-pix2pix) (Adobe)
- VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing (temporal smoothness)
- StableVideo: Text-driven Consistency-aware Diffusion Video Editing (14 gb vram)
- temporal dependency = consistent appearance for the edited objects
=best one=
- temporal dependency = consistent appearance for the edited objects
- FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling Video Diffusion Via Noise Rescheduling
=best=
- reschedule a sequence of noises peforming window-based function = longer videos conditioned on multiple texts
5.5.1. MODELS
- Stable Video Diffusion
- loras for camara control, multiview generation
- MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation
=best=
- more coherent movements
5.5.2. LATENT OF BOTH IMAGES AND VIDEO
- Phenaki
- Photorealistic Video Generation with Diffusion Models
- compress images and videos within a unified latent space
- 4.2.1
5.5.3. WITH ARCHITECTURE STRUCTURE
- Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
- first pixel-based t2v generation then latent-based upscaling
5.5.3.1. CASCADED
- LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
- cascaded video latent diffusion models, temporal interpolation model
- incorporation of simple temporal self-attentions with rotary positional encoding, captures correlations inherent in video
=best one=
- I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
=best=
- utilizing static images as a form of crucial guidance
- guarantee coherent semantics by using two hierarchical encoders
- utilizing static images as a form of crucial guidance
6. EXTRA PRIORS
- DRAG DIFFUSION IDENTITY IN VIDEO
- Dual-Stream Diffusion Net for Text-to-Video Generation
- two diffusion streams, video content and motion branches = video variations; continuous with no flickers
6.1. STYLECRAFTER
- StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter
=best=
- high-quality stylized videos that align with the content of the texts
- train a style control adapter from image dataset then transfer to video
6.2. MOTION
- MCDiff: Motion-Conditioned Diffusion Model for Controllable Video Synthesis
- VideoComposer: Compositional Video Synthesis with Motion Controllability models (temporal consistency)
- motion vector from as control signal
- MotionDirector: Motion Customization of Text-to-Video Diffusion Models
- dual-path LoRAs architecture to decouple the learning of appearance and motion
- 4.2.2
- LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation (8~16 videos = 1 Motion)
- expand pretrained 2D T2I convolution layers to temporal-spatial motion learning layers
- shared-noise sampling = improve the stability of videos
- LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation (8~16 videos = 1 Motion)
- MotionDirector: Motion Customization of Text-to-Video Diffusion Models
=best=
- DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
- desired subject and a few videos of target motion (subject, motion learning on top of video model)
6.2.1. SVD
- AnimateAnyghing: Fine Grained Open Domain Image Animation with Motion Guidance (anything)
- finetuning stable diffusion video
6.2.2. CONTROLER
- MagicStick: Controllable Video Editing via Control Handle Transformations
- keyframe transformations can easily propagate to other frames to provide generation guidance
- inflate image model and ControlNet to temporal dimension, train lora to fit the specific scenes
- Customizing Motion in Text-to-Video Diffusion Models
- map depicted motion to a new unique token, and can invoke the motion in combination with other motions
- Peekaboo: Interactive Video Generation via Masked-Diffusion
- based on masking attention, control size and position
- DIFFDIRECTOR ANIMATELCM
- Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators
- a guidance loss that encourages the sample to have the desired motion
- TrailBlazer: Trajectory Control for Diffusion-Based Video Generation
- pre-trained model without further model training (bounding boxes to guide)
- Boximator: Generating Rich and Controllable Motions for Video Synthesis
- hard box and soft box
- plug-in for existing video diffusion models, training only a module
- Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts
- locally aware and not moving the entire scene
- CameraCtrl: Enabling Camera Control for Text-to-Video Generation
- camera pose control, parameterizing the camera trajector
- AnimateDiff more more
6.2.2.1. MOTION FROM VIDEO
- Spectral Motion Alignment for Video Motion Transfer using Diffusion Models
- aligns motion vectors using Fourier and wavelet transforms
- maintaining computational efficiency and compatibility with other customizations
- Motion Inversion for Video Customization
- Motion Embeddings: temporally coherent derived from a given video
- less than 10 minutes of training time
6.2.2.2. DRAGANYTHING
- DragAnything: Motion Control for Anything using Entity Representation
- trajectory-based is more userfriendly; control of motion for diverse entities
6.2.2.3. DEFINE CAMERA MOVEMENT
- LivePhoto: Real Image Animation with Text-guided Motion Control
- motion-related textual instructions: actions, camera movements, new contents
- motion intensity estimation module(control signal)
- MotionCtrl: A Unified and Flexible Motion Controller for Video Generation
- Icon Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion
- define camera movement and then object motion using bounding box
6.2.3. REFERENENET
- Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation
- ReferenceNet(controlnet), to merge detail features via spatial attention (temporal modeling for inter-frame transitions between video frames)
- Moore-AnimateAnyone (over sd 1.5)
6.3. LONG VIDEO
- NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
- coarse-to-fine process, iteratively complete the middle frames
- sparseformer
- Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers
- autorregresive with patches
- Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers
- FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline
- keyframes synthesis to figure the storyline of a video, then interpolation
- 1
7. GENERATED VIDEO ENHANCEMENT
- optical flow background removal
7.2. USING MODEL
- MS-Vid2Vid
- enhance the resolution and spatiotemporal continuity of text-generated videos and image-generated videos
8. OTHERS EDITING VIDEO
- VIVE3D: Viewpoint-Independent Video Editing using 3D-Aware GANs
- MoonShot: Towards Controllable Video Generation and Editing with Multimodal Conditions
- zero-shot subject customized, controlnet only, video transformation
- ActAnywhere: Subject-Aware Video Background Generation
- input: segmented subject and contextual image input
- STABLEIDENTITY inserting identity
8.1. VIDEO INPAINT
- Anything in Any Scene Photorealistic Video Object Insertion (realism, lighting realism, and photorealism)
- InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions
- use human-painting, drag and drop, as prior to inpainting generation, dynamic interaction,
- Place Anything into Any Video
- using just a photograph of the object, looks like enhanced VR
- Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
- add or remove objects, semantically change objects, insert stock photos into videos
8.1.1. OUTPAINTER
- Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation
=best=
- input-specific adaptation and pattern-aware outpainting
8.2. VIDEO EXCHANGE
- VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
- exploits semantic point correspondences,
- only a small number of semantic points are necessary to align the subject’s motion trajectory and modify its shape
8.3. CONTROLNET VIDEO
- Stable Video Diffusion Temporal Controlnet
8.4. FRAME INTERPOLATION
- MA-VFI: Motion-Aware Video Frame Interpolation
- BlazeBVD: Make Scale-Time Equalization Great Again for Blind Video Deflickering
- illumination histograms that precisely capture flickering and local exposure variation
- to restore faithful and consistent texture affected by lighting changes; 10 times faster