diffusion video

Table of Contents

1. TUNING

  • Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation (dataset alleviation)
    • prevent loss of image details and the noise prediction biases during the denoising process
    • adds noise then denoises the noisy latent with proper rectification to alleviate the noise prediction biases
  • Attention Prompt Tuning: Parameter-Efficient Adaptation of Pre-Trained Models for Action Recognition
    • efficient prompt tuning for video applications such as action recognition

2. ANIMATION

  • Keyframer: Empowering Animation Design using Large Language Models
    • animating static images (SVGs) with natural language
  • Animated Stickers: Bringing Stickers to Life with Video Diffusion (animated emojis)

3. 4D CONTROL

  • BIGGER COHERENCE FACE ARTIST EDITING
  • DiffDreamer: Consistent Single-view Perpetual View Generation with Conditional Diffusion Models
    • landscape(mountains) fly overs
  • DisCo: Disentangled Control for Referring Human Dance Generation in Real World
    • human dance(movement) images and videos (using skelleton rigs)

3.1. GIF

  • Hotshot-XL, text-to-GIF model for Stable Diffusion XL
  • Generative Image Dynamics, interactive gifs(looping dynamic videos)
    • frequency-coordinated diffusion sampling process
    • neural stochastic motion texture
  • Pix2Gif: Motion-Guided Diffusion for GIF Generation
    • transformed feature map (motion) remains within the same space as the target, thus consistency-coherence
  • dynamicrafter: generative frame interpolation and looping video generation (320x512)
  • Explorative Inbetweening of Time and Space
    • bounded generation of a pre-trained image-to-video model without any tuning and optimization
    • two images that capture a subject motion, translation between different viewpoints, or looping

3.2. INTERACTIVE

3.2.1. COLORIZATION

  • Learning Inclusion Matching for Animation Paint Bucket Colorization
    • for hand-drawn cel animation
    • comprehend the inclusion relationships between segments
    • paint based on previous frame

3.2.2. DRAG

  • =DragGAN=: Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold
    • dragging as input primitive, using pairs of points, excellent results, stylegan derivative
  • DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models
    • moving, resizing, appearance replacement, dragging
  • StableDrag: Stable Dragging for Point-based Image Editing
    • models: StableDrag-GAN and StableDrag-Diff
    • confidence-based latent enhancement strategy for motion supervision
3.2.2.1. DRAG DIFFUSION
  • DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
    • RotationDrag: Point-based Image Editing with Rotated Diffusion Features
      • utilizing the feature map to rotate-move images
  • DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models
  • DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
    • control trajectories in different granularities
  • Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation
    • superior control and semantic retention, reducing the optimization time 50% compared to DragDiffusion

3.2.3. HEAD POSE

  • 5.3.1.1
  • Control4D: Dynamic Portrait Editing by Learning 4D GAN from 2D Diffusion-based Editor
    • 4d gan, 2D diffusion, consistent 4D, =best one=
    • change face of video
  • AniPortraitGAN: Animatable 3D Portrait Generation from 2D Image Collections
    • facial expression, head pose, and shoulder movements
    • trained on unstructured 2D images
  • MagiCapture: High-Resolution Multi-Concept Portrait Customization
    • generate high-resolution portrait images given a handful of random selfies
  • DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis
    • input: unposed portrait image, retains identity and facial expression
  • Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation
    • novel view synthesis; input: single image and morphable mesh for desired facial expression (emotion)

4. SEMANTICALLY DEFORMED

  • VideoLDM: hd, but still semantically deformed (nvidia)

4.1. SEMANTICAL FIELD

  • TokenFlow: Consistent Diffusion Features for Consistent Video Editing
    • consistency in edited video can be obtained by enforcing consistency in the diffusion feature space
  • CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
    • video to video, frame consistency
    • aggregating the entire video and then using deformation field on one image =best one=
  • S2DM: Sector-Shaped Diffusion Models for Video Generation =best=
    • explore the use of optical flow as temporal conditions
    • prompt correctness while keeping semantical consistenc, can integrate with another temporal conditions
    • decouple the generation of temporal features from semantic-content features

4.2. SD BASED

  • Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation
    • temporal shift module that can leverage the spatial unet as is
  • Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
    • compatible with existing diffusion =best one=
    • hierarchical cross-frame constraints applied to enforce coherence
  • Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
    • inflated sd model into video
    • FROZEN SD
      • Fate/Zero: Fusing Attentions(MIT) for Zero-shot Text-based Video Editing
        • most fluid one, without training
        • RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models =best=
          • employs novel noise shuffling strategy to leverage temporal interactions (coherence)
          • guidance with ControlNet
  • FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis
    • doesnt strictly adhere to optical flow
    • first frame = supplementary reference in the diffusion model
    • works seamlessly with existing I2I models

4.2.1. I2VGEN-XL

  • I2VGen-XL MS-Image2Video non commercial good consistency and continuity, animate image
    • built on sd; designed UNet to perform spatiotemporal modeling in the latent space;
    • pre-trained on video and images
    • I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
      • utilizing static images as a form taining guidance

4.2.2. ANIMATEDIFF

4.2.2.1. DIFFDIRECTOR
  • DiffDirector: AnimateDiff-MotionDirector, MotionDirector Train a MotionLoRA and run it on any compatible AnimateDiff UI
4.2.2.2. PIA
  • PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
    • motion controllability by text: temporal alignment layers (TA) out of token
4.2.2.3. ANIMATELCM
4.2.2.4. CMD
  • Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition =best=
    • content-motion latent diffusion model (CMD)
      • autoencoder that succinctly encodes a video as a combination of image and a low-dimensional motion latent representation
      • pretrained image diffusion model plus lightweight diffusion motion model

4.3. 3D SD

5. BY INPUT

5.1. VIDEO COHERENCE

  • BIGGER COHERENCE from normal sd image generation
  • InstructVideo: Instructing Video Diffusion Models with Human Feedback
    • recast reward fine-tuning as editing: process corrupted video rated by image reward model

5.2. VCHITECT

5.3. IMAGES

  • SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction =best=
    • SEINE: images of different scenes as inputs, plus text-based control, generates transition videos
  • DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors (prompt and image) =best=
  • AtomoVideo: High Fidelity Image-to-Video Generation =best=
    • from input images, motion intensity and consistency; compatible with sd models without specific tuning
    • pre-trained sd, add 1D temporal convolution, temporal attention

5.3.1. DANCING

  • CLOTH
  • PixelDance: Make Pixels Dance: High-Dynamic Video Generation
    • synthesizing videos with complex scenes and intricate motions
    • incorporates image instructions (not just text instructions)
  • MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
    • video diffusion model to encode temporal information
  • 6.2.3
  • Text2AC-Zero: Consistent Synthesis of Animated Characters using 2D Diffusion
    • zero shot on existing t2i, no training or fine-tuning
    • pixel-wise guidance to steer the diffusion to minimizes visual discrepancies
  • DreaMoving: A Human Dance Video Generation Framework based on Diffusion Models
    • Video ControlNet for motion-controlling and a Content Guider for identity preserving
  • Motionshop: An application of replacing the human motion in the video with a virtual 3D human
    • segment retarget and and inpaint (with light awareness)
  • Diffutoon: High-Resolution Editable Toon Shading via Diffusion Models
    • aiming to directly render(turn) photorealistic videos into anime styles; keeping consistency
  • AnaMoDiff: 2D Analogical Motion Diffusion via Disentangled Denoising
    • best trade-off between motion analogy and identity preservation
  • MagicDance: Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer =best=
    • real people references
5.3.1.1. TALKING FACES
  • DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models
    • inputs: songs, speech in multiple languages, noisy audio, and out-of-domain portraits
  • EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation
    • emotion input, different emotional intensities by adjusting the fine-grained emotion
  • HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting
    • generating animatable avatars from textual prompts, visually appealing
  • PEGASUS: Personalized Generative 3D Avatars with Composable Attributes
    • disentangled controls while preserving the identity, realistic
    • trained using synthetic data at first

5.4. VIDEO INPUT

  • MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation
    • edit one frame, then propagate to all
  • Hierarchical Masked 3D Diffusion Model for Video Outpainting
  • FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
    • Zero shot and EBsynth come together for a new vid2vid

5.5. BY PROMPT

  • Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
    • DDIM enhanced with motion dynamics, after cross-frame attention to protect identity
    • Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models (vid2vid zero)
  • Edit-A-Video: Single Video Editing with Object-Aware Consistency
  • video-p2p cross attention control (more coherance than instruct-pix2pix) (Adobe)
    • VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing (temporal smoothness)
  • StableVideo: Text-driven Consistency-aware Diffusion Video Editing (14 gb vram)
    • temporal dependency = consistent appearance for the edited objects =best one=
  • FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling Video Diffusion Via Noise Rescheduling =best=
    • reschedule a sequence of noises peforming window-based function = longer videos conditioned on multiple texts

5.5.1. MODELS

  • Stable Video Diffusion
    • loras for camara control, multiview generation
  • MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation =best=
    • more coherent movements

5.5.2. LATENT OF BOTH IMAGES AND VIDEO

  • Phenaki
    • C-vit is the video encoder, Vivit repo
    • single images are treated like videos
  • Photorealistic Video Generation with Diffusion Models
    • compress images and videos within a unified latent space
  • 4.2.1

5.5.3. WITH ARCHITECTURE STRUCTURE

  • Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
    • first pixel-based t2v generation then latent-based upscaling
5.5.3.1. CASCADED
  • LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
    • cascaded video latent diffusion models, temporal interpolation model
    • incorporation of simple temporal self-attentions with rotary positional encoding, captures correlations inherent in video =best one=
  • I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models =best=
    • utilizing static images as a form of crucial guidance
      • guarantee coherent semantics by using two hierarchical encoders

6. EXTRA PRIORS

6.1. STYLECRAFTER

  • StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter =best=
    • high-quality stylized videos that align with the content of the texts
    • train a style control adapter from image dataset then transfer to video

6.2. MOTION

  • MCDiff: Motion-Conditioned Diffusion Model for Controllable Video Synthesis
  • VideoComposer: Compositional Video Synthesis with Motion Controllability models (temporal consistency)
    • motion vector from as control signal
  • MotionDirector: Motion Customization of Text-to-Video Diffusion Models
    • dual-path LoRAs architecture to decouple the learning of appearance and motion
  • 4.2.2
    • LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation (8~16 videos = 1 Motion)
      • expand pretrained 2D T2I convolution layers to temporal-spatial motion learning layers
      • shared-noise sampling = improve the stability of videos
  • MotionDirector: Motion Customization of Text-to-Video Diffusion Models =best=
  • DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
    • desired subject and a few videos of target motion (subject, motion learning on top of video model)

6.2.1. SVD

  • AnimateAnyghing: Fine Grained Open Domain Image Animation with Motion Guidance (anything)
    • finetuning stable diffusion video

6.2.2. CONTROLER

  • MagicStick: Controllable Video Editing via Control Handle Transformations
    • keyframe transformations can easily propagate to other frames to provide generation guidance
    • inflate image model and ControlNet to temporal dimension, train lora to fit the specific scenes
  • Customizing Motion in Text-to-Video Diffusion Models
    • map depicted motion to a new unique token, and can invoke the motion in combination with other motions
  • Peekaboo: Interactive Video Generation via Masked-Diffusion
    • based on masking attention, control size and position
  • DIFFDIRECTOR ANIMATELCM
  • Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators
    • a guidance loss that encourages the sample to have the desired motion
  • TrailBlazer: Trajectory Control for Diffusion-Based Video Generation
    • pre-trained model without further model training (bounding boxes to guide)
  • Boximator: Generating Rich and Controllable Motions for Video Synthesis
    • hard box and soft box
    • plug-in for existing video diffusion models, training only a module
  • Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts
    • locally aware and not moving the entire scene
  • CameraCtrl: Enabling Camera Control for Text-to-Video Generation
6.2.2.1. MOTION FROM VIDEO
  • Spectral Motion Alignment for Video Motion Transfer using Diffusion Models
    • aligns motion vectors using Fourier and wavelet transforms
    • maintaining computational efficiency and compatibility with other customizations
  • Motion Inversion for Video Customization
    • Motion Embeddings: temporally coherent derived from a given video
    • less than 10 minutes of training time
6.2.2.2. DRAGANYTHING
  • DragAnything: Motion Control for Anything using Entity Representation
    • trajectory-based is more userfriendly; control of motion for diverse entities
6.2.2.3. DEFINE CAMERA MOVEMENT
  • LivePhoto: Real Image Animation with Text-guided Motion Control
    • motion-related textual instructions: actions, camera movements, new contents
    • motion intensity estimation module(control signal)
  • MotionCtrl: A Unified and Flexible Motion Controller for Video Generation
    • independently control camera and object motion, determined by camera poses and trajectories
    • using drawn lines
    • motionctrl for svd, comfy
  • Icon Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion
    • define camera movement and then object motion using bounding box

6.2.3. REFERENENET

  • Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation
    • ReferenceNet(controlnet), to merge detail features via spatial attention (temporal modeling for inter-frame transitions between video frames)
    • Moore-AnimateAnyone (over sd 1.5)

6.3. LONG VIDEO

  • NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
    • coarse-to-fine process, iteratively complete the middle frames
  • sparseformer
  • FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline
    • keyframes synthesis to figure the storyline of a video, then interpolation
  • 1

7. GENERATED VIDEO ENHANCEMENT

7.1. TRICKS

7.2. USING MODEL

  • MS-Vid2Vid
    • enhance the resolution and spatiotemporal continuity of text-generated videos and image-generated videos

8. OTHERS EDITING VIDEO

  • VIVE3D: Viewpoint-Independent Video Editing using 3D-Aware GANs
  • MoonShot: Towards Controllable Video Generation and Editing with Multimodal Conditions
    • zero-shot subject customized, controlnet only, video transformation
  • ActAnywhere: Subject-Aware Video Background Generation
    • input: segmented subject and contextual image input
  • STABLEIDENTITY inserting identity

8.1. VIDEO INPAINT

  • Anything in Any Scene Photorealistic Video Object Insertion (realism, lighting realism, and photorealism)
  • InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions
    • use human-painting, drag and drop, as prior to inpainting generation, dynamic interaction,
  • Place Anything into Any Video
    • using just a photograph of the object, looks like enhanced VR
  • Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
    • add or remove objects, semantically change objects, insert stock photos into videos

8.1.1. OUTPAINTER

  • Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation =best=
    • input-specific adaptation and pattern-aware outpainting

8.2. VIDEO EXCHANGE

  • VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
    • exploits semantic point correspondences,
    • only a small number of semantic points are necessary to align the subject’s motion trajectory and modify its shape

8.3. CONTROLNET VIDEO

  • Stable Video Diffusion Temporal Controlnet

8.4. FRAME INTERPOLATION

  • MA-VFI: Motion-Aware Video Frame Interpolation
  • BlazeBVD: Make Scale-Time Equalization Great Again for Blind Video Deflickering
    • illumination histograms that precisely capture flickering and local exposure variation
    • to restore faithful and consistent texture affected by lighting changes; 10 times faster

Author: Tekakutli

Created: 2024-04-07 Sun 13:56