stable diffusion

Table of Contents

1. SD MODELS

  • CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images
    • CC-licensed images with BLIP-2 captions, similar performance to Stable Diffusion 2 (apache license)
  • Terminus XL Gamma: simpler SDXL, for inpainting tasks, super-resolution, style transfer
  • 5.2
  • AnimateLCM-SVD-xt: image to video
  • stable-cascade: würstchen architecture = even smaller latent space
    • Stable-Cascade-FP16
    • sd x8 compression (1024x1024 > 128x128) vs cascade x42 compression, (1024x1024 > 24x24)
    • faster inference, cheaper training
  • STABLE DIFFUSION 3

1.1. DISTILLATION

1.1.1. ONE STEP DIFFUSION

  • One-step Diffusion with Distribution Matching Distillation
    • comparable with v1.5 while being 30x faster
    • critic similar to GANs in that is jointly trained with the generator
      • differs in that it does not play adversarial game, and can fully leverage a pretrained model

1.1.2. SDXS

  • SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions
    • knowledge distillation to streamline the U-Net and image decoder architectures
    • one-step DM training technique that utilizes feature matching and score distillation
    • speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FPS (60x faster than SDXL) on a GPU
    • image-conditioned control, facilitating efficient image-to-image translation.

1.2. IRIS LUX

https://civitai.com/models/201287 Model created through consensus via statistical filtering (novel consensus merge) https://gist.github.com/Extraltodeus/0700821a3df907914994eb48036fc23e

1.3. EMOJIS

  • Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression
    • emojis, stickers

1.4. MERGING MODELS

1.4.1. SEGMOE

  • SegMoE - The Stable Diffusion Mixture of Experts for Image Generation, Mixture of Diffusion Experts
    • training free, creation of larger models on the fly, larger knowledge

2. GENERATION CONTROL

2.1. MATERIAL EXTRACTION

  • U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation
    • generates images with the material or color extracted from the input image
    • sentence describing the desired attribute
    • learn user-specified visual attributes
  • ZeST: Zero-Shot Material Transfer from a Single Image
    • leverages adapters to extract implicit material representation from exemplar image

2.2. LIGHT CONTROL

  • DiffusionLight: Light Probes for Free by Painting a Chrome Ball
    • render a chrome ball into the input image
    • produces convincing light estimates

2.3. BACKGROUND

  • BriaAI: Open-Source Background Removal (RMBG v1.4)
  • LayerDiffusion: Transparent Image Layer Diffusion using Latent Transparency
    • layers with alpha, generate pngs, remove backgrounds (more like generate with removable background)
    • method learns a “latent transparency”
    • models

2.4. EMOTIONS

  • Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation
    • face swapping and reenactment, interpolate between emotions
  • EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models
    • clip, abstract emotions
  • Make Me Happier: Evoking Emotions Through Image Diffusion Models
    • understanding and editing source images emotions cues

2.5. NOISE CONTROL

  • offset noise(darkness capable loras), pyramid noise
    • Common Diffusion Noise Schedules and Sample Steps are Flawed (and several proposed fixes)
    • native offset noise
  • noisy perlin latent
    • you can reinject the same noise pattern after an upscale, more coherent results and better upscaling
  • Blue noise for diffusion models
    • allows introducing correlation across images within a single mini-batch to improve gradient flow

2.6. GUIDING FUNCTION

  • Universal Guided Diffusion (face and style transfer)
    • FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model
      • extra: repo has list of deblurring, super-resolution and restoration methods
      • masks as energy function
  • Diffusion Self-Guidance for Controllable Image Generation
    • steer sampling, similarly to classifier guidance, but using signals in the pretrained model itself
    • instructional transfomations
  • MCM Modulating Pretrained Diffusion Models for Multimodal Image Synthesis (module after denoiser) mmc
    • mask like control to tilt the noise, maybe useful for text

2.6.1. ADAPTIVE GUIDANCE

  • Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models
    • AG, efficient variant of CFG(Classifier-Free Guidance); reducing computation by 25%
    • omits network evaluations when the denoising process displays convergence
    • second half of the denoising process redundant; plug-and-play alternative to Guidance Distillation
    • LinearAG: entire neural-evaluations can be replaced by affine transformations of past estimates

2.7. CONTROL NETWORKS, CONTROLNET

2.7.1. SKETCH

  • diffmorph: text-less image morphing with diffusion models
    • sketch-to-image module
  • Block and Detail: Scaffolding Sketch-to-Image Generation
    • sketch-to-image, can generate coherent elements from partial sketches, generate beyond the sketch following the prompt
  • CustomSketching: Sketch Concept Extraction for Sketch-based Image Synthesis and Editing
    • one for contour, the other flow lines representing texture

2.7.2. ALTERNATIVES

  • controlNet (total control of image generation, from doodles to masks)
  • SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing
    • lightweight tuning module named SC-Tuner, synthesis by injecting different conditions
    • reduces training parameters and memory requirements
    • Integrated Into SCEPTER and SWIFT
  • Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis
    • imposing global semantics onto targeted regions without the use of any additional localization cues
    • alternative to controlnet and t2i-adapter

2.7.3. TIP: text restoration

  • TIP: Text-Driven Image Processing with Semantic and Restoration Instructions =best=
    • controlnet architecture, leverages natural language as interface to control image restoration
    • instruction driven, can inprint text into image

2.7.4. HANDS

  • HANDS DATASET
  • HandDiffuse: Generative Controllers for Two-Hand Interactions via Diffusion Models
    • two-hand interactions, motion in-betweening and trajectory control
2.7.4.1. RESTORING HANDS
  • Detecting and Restoring Non-Standard Hands in Stable Diffusion Generated Images
    • body pose estimation to understand hand orientation for accurate anomaly correction
    • integration of ControlNet and InstructPix2Pix
  • HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting
    • incorrect number of fingers, irregular shapes, effectively rectified
    • utilize ControlNet modules to re-inject corrected information, 1.5

2.7.5. USING ATTENTION MAP

2.7.5.1. MASA
  • MasaCtrl: Tuning-free Mutual Self-Attention Control for Consistent Image Synthesis and Editing
    • same thing different views or poses
      • by querying the attention map from another image
    • better than ddim inversion, consistent SD animations; mixable with T2I-Adapter
  1. TI-GUIDED-EDIT
    • Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance
      • rigid=conserve the structure
2.7.5.2. LLLYASVIEL
  • reference-only preprocessor doesnt require any control models, generate variations
    • can guide the diffusion directly using images as references, and generate variations
  • Guess Mode / Non-Prompt Mode, now named: Control Modes, how much prompt vs controlnet; comfy node

2.7.6. SEVERAL CONTROLS IN ONE

  • UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
    • several controlnets in one, contextual understanding
    • image deblurring, image colorization
    • using UniControl with Stable Diffusion XL 1.0 Refiner; sketch to image tool
  • In-Context Learning Unlocked for Diffusion Models
    • learn translation of image to hed, depth, segmentation, outline

2.8. HUMAN PAINT

  • SDEdit: guided image synthesis and editing with stochastic differential equation
    • stroke based inpainting-editing
    • FOOLSDEDIT: Deceptively Steering Your Edits Towards Targeted Attribute-aware Distribution
      • forcing SDEdit to generate a data distribution aligned a specified attribute (e.g. female)
  • Control Color: Multimodal Diffusion-Based Interactive Image Colorization
    • paint over grayscale to recolor it

2.9. LAYOUT DIFFUSION

  • 3d: ROOM LAYOUT
  • 3.5.1 STORYTELLER DIFFUSION
  • ZestGuide: Zero-shot spatial layout conditioning for text-to-image diffusion models
    • implicit segmentation maps can be extracted from cross-attention layers
    • spatial conditioning to sd without finetunning
  • Towards Aligned Layout Generation via Diffusion Model with Aesthetic Constraints
    • constraints representing design intentions
    • continuous state-space design can incorporate differentiable aesthetic constraint functions in training
      • by introducing conditions via masked input
  • RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models
    • dynamically balance the strengths of the two models in denoising process
  • Getting it Right: Improving Spatial Consistency in Text-to-Image Models
    • better representing spatial relationships
    • faithfully follow the spatial relationships specified in the text prompt

2.9.1. SCENES

  • Generate Anything Anywhere in Any Scene
    • training guides to focus on object identity, personalized concept with localization controllability
  • 2.10.3.2 ALDM

2.9.2. WITH BOXES

  • GLIGEN: Open-Set Grounded Text-to-Image Generation (boxes)
    • Training-Free Layout Control with Cross-Attention Guidance
    • SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis
  • BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
  • InstanceDiffusion: Instance-level Control for Image Generation
    • conditional generation, hierarchical bounding-boxes structure, featur(prompt) at point
    • single points, scribbles, bounding boxes or segmentation masks
  • Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models
    • bounding boxes with attribute(prompt) binding

2.9.3. ALDM

  • ALDM: Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive
    • layout faithfulness

2.9.4. OPEN-VOCABULARY

  • Spatial-Aware Latent Initialization for Controllable Image Generation
    • inverted reference image contains spatial awareness regarding positions, resulting in similar layouts
    • open-vocabulary framework to customize a spatial-aware initialization

2.9.5. CARTOON

  • Desigen: A Pipeline for Controllable Design Template Generation
    • generating images with proper layout space for text; generating the template itself
2.9.5.1. COGCARTOON
  • CogCartoon: Towards Practical Story Visualization
    • plugin-guided and layout-guided inference; specific character = 316 KB plugin

2.10. IMAGE PROMPT - ONE IMAGE

2.10.1. UNET LESS

  • ProFusion: Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach
    • and can interpolate between two
    • promptnet (embedding), encoder based, for style transform
    • one image, no regularization needed
  • Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
    • using CLIP features extracted from the subject

2.10.2. IMAGE-SUGGESTION

  • 5.4.1.1
  • UMM-Diffusion, TIUE: Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation
    • takes joint texts and images
    • only the image-mapping to a pseudo word embedding is learned
2.10.2.1. ZERO SHOT
  • Context Diffusion: In-Context Aware Image Generation
    • separates the encoding of the visual context; prompt not needed
  • ReVision - Unclip https://comfyanonymous.github.io/ComfyUI_examples/sdxl/
    • Revision gives the model the pooled output from CLIPVision G instead of the CLIP G text encoder
  • SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation
    • architecture designed for selectively capturing any subject from single or multiple reference images
  1. IP-ADAPTER
    1. LCM-LOOKAHEAD
      • LCM-Lookahead for Encoder-based Text-to-Image Personalization
        • LCM-based approach for propagating image-space losses to personalization model training and classifier guidance
  2. SEECODERS
    • Seecoders: Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models
      • Semantic Context Encoder, replaces clip with seecoder; works with =stock SD=
      • input image and controlnet
      • unlike unclip, seecoders uses extra model
      • one image into several perspectives (MULTIVIEW DIFFUSION)
      • the embeddings can be textures, effects, objects, semantics(contexts)

    tics, etc.

2.10.2.2. PERSONALIZATION
  • InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
    • personalized images with only a single forward pass
  • HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models; just one image

2.10.3. IDENTITY

2.10.3.1. STORYTELLER DIFFUSION
  • ConsiStory: Training-Free Consistent Text-to-Image Generation
  • training-free approach for consistent subject(object) generation x20 faster, multi-subject scenarios
  • by sharing the internal activations of the pretrained model
2.10.3.2. ANYDOOR
  • AnyDoor: Zero-shot Object-level Image Customization
    • teleport target objects to new scenes at user-specified locations
    • identity feature with detail feature
    • moving objects, swapping them, multi-subject composition, try-on a cloth
2.10.3.3. SUBJECT
  • Inserting Anybody in Diffusion Models via Celeb Basis
    • one facial photograph, 1024 learnable parameters, 3 minutes; several at once
  • Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning
    • multi subject, single reference image
  • PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models
    • incorporates facial identity loss, single facial photo, single training phase
  • The Chosen One: Consistent Characters in Text-to-Image Diffusion Models
    • sole input being text
    • generate gallery of images, use pre-trained feature extractor to choose the most cohesive cluster
  • FaceStudio: Put Your Face Everywhere in Seconds =best=
    • direct feed-forward mechanism, circumventing the need for intensive fine-tuning
    • stylized images, facial images, and textual prompts to guide the image generation process
  • SeFi-IDE: Semantic-Fidelity Identity Embedding for Personalized Diffusion-Based Generation
    • face-wise attention loss to fit the face region
  1. IDENTITY IN VIDEO
    • Magic-Me: Identity-Specific Video Customized Diffusion
    1. STABLEIDENTITY
      • StableIdentity: Inserting Anybody into Anywhere at First Sight
        • identity recontextualization with just one face image without finetuning
        • also for into video/3D generation
  2. IDENTITY ZERO-SHOT
    • InstantID: Zero-shot Identity-Preserving Generation in Seconds (using face encoder)
      • PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding Paper page
        • Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm =best=
          • identity provided by the reference image while mitigating interference from textual input
    • CapHuman: Capture Your Moments in Parallel Universes
      • encode then learn to align, identity preservation for new individuals without tuning
    • SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation =best=
      • Token-to-Patch Aligner = preserving fine features of the subjects; multiple subjects
      • combinable with controlnet, and across styles
    • RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization
      • gradually narrowing to the specific subject, iteratively update the influence scope
  3. PHOTOMAKER
2.10.3.4. ANIME
  • DreamArtist: a single one image and target text (mainly works with anime)
    • DreamTuner: Single Image is Enough for Subject-Driven Generation
      • subject-encoder for coarse subject identity preservation, training-free
  • pfg Prompt free generation; learns to interpret (anime) input-images

2.10.4. VARIATIONS

3. BETTER DIFFUSION

  • editing default of a prompt: https://github.com/bahjat-kawar/time-diffusion
  • Self-Attention Guidance (SAG): SAG leverages intermediate attention maps of diffusion models at each iteration to capture essential information for the generative process and guide it accordingly
    • pretty much just reimplemented the attention function without changing much else
  • FreeU: Free Lunch in Diffusion U-Net (unet) =best=
    • improves diffusion model sample quality at no costs
    • more color variance
  • Diffusion Sampling with Momentum for Mitigating Divergence Artifacts
    • incorporation of: Heavy Ball (HB) momentum = expand stability regions; Generalized HB (GHVB) = supression
    • better low step sampling
  • DG: Detector Guidance for Multi-Object Text-to-Image Generation
    • mid-diffusion, performs latent object detection then enhances following CAMs(cross-attention maps)

3.1. SCHEDULER

3.2. QUALITY

  • 3.6.2
  • Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack (dataset method)
    • guide pre-trained model to exclusively generate good images
  • HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion
    • Latent Structural Diffusion Model that simultaneously denoises depth and surface normal with RGB image
  • Consistency Distilled Diff VAE
    • Improved decoding for stable diffusion vaes

3.3. HUMAN FEEDBACK

  • RLCM
  • Aligning Text-to-Image Models using Human Feedback https://arxiv.org/abs/2302.12192
    • Better Aligning Text-to-Image Models with Human Preference
    • RRHF: Rank Responses to Align Language Models with Human Feedback without tears
    • ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation
  • FABRIC: Personalizing Diffusion Models with Iterative Feedback
    • training-free approach, exploits the self-attention layer
    • improve the results of any Stable Diffusion model
  • Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model
    • Direct Preference for Denoising Diffusion Policy Optimization (D3PO)
    • omits training a reward model
  • Diffusion-DPO: Diffusion Model Alignment Using Direct Preference Optimization (training script)
    • improving visual appeal and prompt alignment, using direct preference optimization
    • SDXL: Direct Preference Optimization (better images) (and SD 1.5)
  • ALDM layout
  • RL Diffusion: Large-scale Reinforcement Learning for Diffusion Models (improves pretrained)
  • PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models =best=
    • better training stability for unseen prompts
    • reward difference of generated image pairs from their denoising trajectories
  • MESH HUMAN FEEDBACK

3.3.1. ACTUALLY SELF-FEEDBACK

  • SPIN-Diffusion: Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation =best=
    • diffusion model engages in competition with its earlier versions, iterative self-improvement
  • AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation
    • use Vision Models (VLM) to assess quality across style, coherence, and aesthetics, generating feedback

3.4. SD GENERATION OPTIMIZATION

  • ONE STEP DIFFUSION 4 STABLE CASCADE
  • turning off CFG when denoising sigmas below 1.1
  • Tomesd: Token Merging for Stable Diffusion code
    • ToDo: Token Downsampling for Efficient Generation of High-Resolution Images
      • token downsampling of key and value tokens to accelerate inference 2x-4x
  • Nested Diffusion Processes for Anytime Image Generation
    • can generate viable when stopped arbitrarily before completion
  • BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping
    • use sd as teacher model and train faster one using it as bootstrap; 30 fps
  • Divide & Bind Your Attention for Improved Generative Semantic Nursing
  • Conditional Diffusion Distillation
    • added parameters, suplementing image conditions to the diffusion priors
    • super-resolution, image editing, and depth-to-image generation
  • 4 2.6.1
  • OneDiff: acceleration library for diffusion models, ComfyUI Nodes
  • T-Stitch: Accelerating Sampling in Pre-trained Diffusion Models with Trajectory Stitching
    • improve sampling efficiency with no generation degradation
    • smaller DPM in the initial steps, larger DPM at a later stage, 40% of the early timesteps
  • The Missing U for Efficient Diffusion Models
    • operates with approximately a quarter of the parameters, diffusion models 80% faster

3.4.1. ULTRA SPEED

  • SDXL Turbo: A real-time text-to-image generation model (distillation)
  • ArtSpew: SD at 149 images per second (high volume random image generation)
  • StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation (10ms)
    • transforms sequential denoising into the batching denoising
  • MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices
    • diffusion-GAN finetuning techniques to achieve 8-step and 1-step inference
  • Accelerating Diffusion Sampling with Optimized Time Steps
    • image performance compared to using uniform time steps

3.4.2. CACHE

  • DeepCache: Accelerating Diffusion Models for Free =best=
    • exploits temporal redundancy observed in the sequential denoising steps
    • superiority over existing pruning and distillation
  • Cache Me if You Can: Accelerating Diffusion Models through Block Caching
    • reuse outputs from layer blocks of previous steps, automatically determine caching schedules
  • Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models =best=
    • reuse cyclically the encoder features in the previous time-steps for the decoder
  • Fast Inference Through The Reuse Of Attention Maps In Diffusion Models
    • structured reuse of attention maps during sampling
  • T-GATE: Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models
    • two stages: semantics-planning phase, and subsequent fidelity-improving phase
    • so caching cross-attention output once converges and fixing it during the remaining inference
3.4.2.1. EXPLOITING FEATURES
  • FRDiff: Feature Reuse for Exquisite Zero-shot Acceleration of Diffusion Models
    • Reusing feature maps with high temporal similarity
  • Clockwork Diffusion: Efficient Generation With Model-Step Distillation
    • high-res features sensitive to small perturbations; low-res feature only sets semantic layout
    • so reuses computation from preceding steps for low-res

3.4.3. LCM

3.4.3.1. CCM
  • CCM: Adding Conditional Controls to Text-to-Image Consistency Models
    • ControlNet-like, lightweight adapter can be jointly optimized while consistency training
3.4.3.2. PERFLOW
  • PeRFlow (Piecewise Rectified Flow)
    • fast generation, 4 steps, 4,000 training iterations
    • multiview normal maps and textures from text prompts instantly

3.5. PROMPT CORRECTNESS

  • ReCo: region control, counting donuts
  • sd-webui-cutoff, hide tokens for each separated group, limits the token influence scope (color control)
  • hard-prompts-made-easy
  • Get What You Want, Not What You Don’t: Image Content Suppression for Text-to-Image Diffusion Models
    • suppress unwanted content generation of the prompt, and encourages the generation of desired content
    • better than negative prompts
  • Discriminative Probing and Tuning for Text-to-Image Generation
    • discriminative adapter to improve their text-image alignment
    • global matching and local grounding
  • CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
    • fine-tuning strategy with an image-to-text(captioning model) concept matching mechanism
  • [[https://youtu.be/_Pr7aFkkAvY?si=Xr5e_RL-rwcdL10q

][ELLA]] - A Powerful Adapter for Complex Stable Diffusion Prompts

  • using an adaptor for an llm instead of clip

3.5.1. ATTENTION LAYOUT

3.5.2. LANGUAGE ENHANCEMENT

  • 5.4.1.2
  • Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment
    • using prompt sentence structure during inference to improve the faithfulness
  • Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
    • exploiting language sentences semantical hierarchies (lojban)
  • Structured Diffusion Guidance, language enhanced clip enforces on unet
  • Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering
    • prompt learning, improve the matches between the input text and the generated
3.5.2.1. PROMPT EXPANSION, PROMPT AUGMENTATION
  • DanTagGen: LLaMA arch
  • superprompter: Supercharge your AI/LLM prompts
  • Capability-aware Prompt Reformulation Learning for Text-to-Image Generation
    • effectively learn diverse reformulation strategies across various user capacities to simulate high-capability user reformulation
3.5.2.2. TOKENCOMPOSE
  • TokenCompose: Grounding Diffusion with Token-level Supervision =best=
    • finetuned with token-wise grounding objectives for multi-category instance composition
    • exploiting binary segmentation maps from SAM
    • compositions that are unlikely to appear simultaneously in a natural scene

3.6. BIGGER COHERENCE

3.6.1. PANORAMAS

  • DiffCollage: Parallel Generation of Large Content with Diffusion Models (panoramas)
  • Collaborative Score Distillation for Consistent Visual Synthesis
    • consistent visual synthesis across multiple samples =best one=
    • distill generative priors over a set of images synchronously
    • zoom, video, panoramas
  • SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions
    • plug-and-play module that synchronizes multiple diffusions through gradient descent from a perceptual similarity loss
  • Taming Stable Diffusion for Text to 360° Panorama Image Generation
    • minimize distortion during the collaborative denoising process
3.6.1.1. OUTPAINTING
  • 5.3.2.2.2
  • Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach
    • generate content beyond boundaries using relative positional information
  • BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion
    • pre-trained SD model, useful in product exhibitions, virtual try-on, or background replacement

3.6.2. RESOLUTION

  • Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images
    • training on images of unlimited sizes is unfeasible
    • Fast Seamless Tiled Diffusion (FSTD)
  • ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models (video too)
    • generating images at much higher resolutions than the training image sizes
    • does not require any training or optimization
  • Matryoshka Diffusion Models
    • diffusion process that denoises inputs at multiple resolutions jointly
  • FIT TRANSFORMER
  • Upsample Guidance: Scale Up Diffusion Models without Training
    • technique that adapts pretrained model to generate higher-resolution images by adding a single term in the sampling process, without any additional training or relying on external models
    • can be applied to various models, such as pixel-space, latent space, and video diffusion models
3.6.2.1. ARBITRARY
  • ElasticDiffusion: Training-free Arbitrary Size Image Generation
    • decoding method better than MultiDiffusion
  • ResAdapter: Domain Consistent Resolution Adapter for Diffusion Models
    • unlike post-process, directly generates images with the dynamical resolution
    • compatible with ControlNet, IP-Adapter and LCM-LoRA; can be integrated with ElasticDiffusion

4. SAMPLERS

  • GENIE: Higher-Order Denoising Diffusion Solvers
    • faster diffusion equation?
    • DDIM vs GENIE
    • 4 time less expensive upsampling
  • fastest solver https://arxiv.org/abs/2301.12935
  • unipc sampler (sampling in 5 steps)
    • smea: (nai) global attention sampling
  • Karras no blurry improvement reddit
  • DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics
    • several coefficients efficiently computed on the pretrained mode, faster
  • STABLESR novel approach
  • 5.2.3: controls intensity of style

5. IMAGE EDITING

5.1. IMAGE SCULPTING =best=

  • Image Sculpting: Precise Object Editing with 3D Geometry Control
    • enables direct interaction with their 3D geometry
      • pose editing, translation, rotation, carving, serial addition, space deformation
    • turned into nerf using Zero-1-to-3, then returned to image including features

5.2. STYLE

  • StyleDrop: Text-to-Image Generation in Any Style (muse architecture)
    • 1% of parameters (painting style)
  • PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization
    • learnable style word vectors, style-content features to be located nearby
  • Zero-shot Generative Model Adaptation via Image-specific Prompt Learning
    • adapt style to concept
  • StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation
    • process the prompt and style features separately
  • DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models
    • textual embedding with style guidance
  • Cross-Image Attention for Zero-Shot Appearance Transfer
    • zero-shot appearance transfer by building on the self-attention layers of image diffusion models
    • architectural transfer
  • STYLECRAFTER transfer to video
  • Style Aligned Image Generation via Shared Attention =best= (as controlnet extension)
    • color palette too
  • FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models
    • style transfer built upon sd, dual-stream encoder and single-stream decoder architecture
    • content into pixelart, origami, anime
  • PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis
    • image from segmentation map and also using semantic features
  • Visual Style Prompting with Swapping Self-Attention
    • consistent style across generations
    • unlike others (ip-adapter) disentangle other semantics away (like pose)
  • DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations =best=
    • decouple the style and semantics of reference images
    • optimal balance between the text controllability and style similarity
  • InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation
    • decouples style and content from reference images within the feature space
  • DreamWalk: Style Space Exploration using Diffusion Guidance
    • decompose the text prompt into conceptual elements, apply a separate guidance for each element
  • LCM-LOOKAHEAD

5.2.1. B-LoRA

  • Implicit Style-Content Separation using B-LoRA
    • preserving its underlying objects, structures, and concepts
    • LoRA of two specific blocks
    • image style transfer, text-based stylization, consistent style generation, and style-content mixing

5.2.2. STYLE TOOLS

  • Measuring Style Similarity in Diffusion Models
    • compute similarity score

5.2.3. DIRECT CONSISTENCY OPTIMIZATION

  • DCO: Direct Consistency Optimization for Compositional Text-to-Image Personalization
    • minimally fine-tuning pretrained to achieve consistency
    • new sampling method that controls the tradeoff between image fidelity and prompt fidelity

5.3. REGIONS

5.3.1. REGIONS MERGE

  • MULTIPLE DIFFUSION BIGGER COHERENCE 5.3.2.1 MULTIPLE LORA
  • MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models
    • blending the predicted noises of two diffusion models in a saliency-aware manner (composite)
  • Text2Layer: Layered Image Generation using Latent Diffusion Model
    • train an autoencoder to reconstruct layered images and train models on the latent representation
    • generate background, foreground, layer mask, and the composed image simultaneously
  • Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance
    • bind each attachment to corresponding subjects separately with split text prompts
    • object segmentation to obtain the layouts of subjects, then isolate and resynthesize individually
  • Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation
    • bounded attention, training-free method; bounding information flow in the sampling process
    • prevents leakage, promotes each subject’s individuality, even with complex multi-subject conditioning
5.3.1.1. INTERPOLATION
  • Latent Blending (interpolate latents)
  • Interpolating between Images with Diffusion Models
    • convincing interpolations across diverse subject poses, image styles, and image content
  • Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models =best=
    • steady change in the output image, plug-and-play Smooth-LoRA; best interpolation
    • perhaps for video or drag diffusion
  • OMG: Occlusion-friendly Personalized Multi-concept Generation In Diffusion Models
    • integrate multiple concepts within a single image
    • combined with LoRA and InstantID
  1. DIFFMORPHER
    • DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing =best=
      • alternative to gan; interpolate between their loras (not just their latents)

5.3.2. MINIMAL CHANGES

5.3.2.1. HARMONIZATION
  • 5.3.1
  • SEELE: Repositioning The Subject Within Image
    • minimal changes like moving people, subject removal, subject completion and harmonization
  • Collage Diffusion (harmonize collaged images)
  • Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos
    • given a coarsely edited image (cut and move blob), synthesizes a photorealistic output
  1. SWAPANYTHING
    • SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing
      • keeping the context unchanged (like it’s in texture clothes)
5.3.2.2. REGION EXCHANGE
  1. SUBJECT SWAPPING
    • Photoswap: Personalized Subject Swapping in Images
    • LatentSwap: An Efficient Latent Code Mapping Framework for Face Swapping
  2. BETTER INPAINTING
    • 3.6.1.1
    • A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
      • inpainting model: context-aware image and shape-guided object inpainting, object removal, controlnet
    • ReplaceAnything as you want: Ultra-high quality content replacement
      • masked region is strictly retained
    • 3.6.1.1
    • DiffBody: Human Body Restoration by Imagining with Generative Diffusion Prior
      • good proportions, (clothes) texture quality, no limb distortions
    • StrDiffusion: Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting
      • semantically sparse structure in early stage, dense texture in late stage
    • A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
    1. MAPPED INPAINTING
      • Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators
      1. DIFFERENTIAL DIFFUSION
        • Differential Diffusion: Giving Each Pixel Its Strength =best=
          • control of the extent to which individual objects are modified, or the ability to introduce gradual spatial changes
          • using change maps: gray scale of how many a region can change
    2. CLOTHES OUTFITS
      • Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All
        • virtually place any e-commerce item in any setting
  3. PIX2PIX REGION
    • pix2pix-zero (promp2prompt without prompt)
    • plug-and-play: like pix2pix but features extracted
  4. FORCE IT WHERE IT FITS
    • MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion Path
      • no training or finetuning; instead force the prompt (exchange the noise)
    • PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance
      • forces input image into edited image, object-level
  5. PROMPT IS TARGET
    • Direct Inversion: Optimization-Free Text-Driven Real Image Editing with Diffusion Models
      • only changes where the prompt fits
    • Inst-Inpaint: Instructing to Remove Objects with Diffusion Models
      • erasing unwanted pixels; estimates which object to be removed
    • HIVE: Harnessing Human Feedback for Instructional Visual Editing (reward model)
      • rlhf, editing instruction, to get output to adhere to the correct instructions
      • LIME: Localized Image Editing via Attention Regularization in Diffusion Models
        • do not require specified regions or additional text input
        • clustering technique = segmentation maps; without re-training and fine-tuning
    1. DDIM
      • MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond =best=
        • prompt redescription strategy, revised DDIM inversion
      • Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing
        • better DDIM
      • ReNoise: Real Image Inversion Through Iterative Noising
        • building on reversing the diffusion sampling process to manipulate an image
5.3.2.3. SEMANTIC CHANGE - DETECTION
  • sega semantic guidance, (apply a concept arithmetic after having a generation)
  • EDICT: repo Exact Diffusion Inversion via Coupled Transformations
    • edits-changes object types(dog breeds)
    • adds noise, complex transformations but still getting perfect invertibility
  • The Hidden Language of Diffusion Models
    • learning interpretable pseudotokens from interpolating unet concepts
    • useful for: single-image decomposition to tokens, bias detection, and semantic image manipulation
  1. SWAP PROMPT
    • 2.7.5 2.7.5.1.1
    • LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance
      • prompt changing, minimal variations
      • LEDITS++, an efficient, versatile & precise textual image manipulator =best=
        • no tuning, no optimization, few diffusion steps, multiple simultaneous edits
        • architecture-agnostic, masking for local changes; building on SEGA
    • StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing
      • preserve the object-like attention maps after editing
5.3.2.4. INSTRUCTIONS
  1. IMAGE INSTRUCTIONS
    • Instruct-Imagen: Image Generation with Multi-modal Instruction
      • example images as style, boundary, edges, sketch
    • ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation
      • a pair of images as visual instructions
      • instruction learning as inpainting problem, useful for pose transfer, image translation and video inpainting
  2. IMAGE TRANSLATION
    • 2.7.6 3.4.3.1 MESH TO MESH SDXS
    • DRAG DIFFUSION dragging two points on the image
    • Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation
      • zero-shot (I2I) across large domain gaps, like skelleton to dinosaur
      • prompting provides target domain
    • IIDM: Image-to-Image Diffusion Model for Semantic Image Synthesis
    • One-Step Image Translation with Text-to-Image Models
      • adapting a single-step diffusion model; preserve the input image structure
    1. INTO MANGA
      • Sketch2Manga: Shaded Manga Screening from Sketch with Diffusion Models
        • normal generation into manga style but while fixing the light anomalies (actually looks manga)
        • fixes the tones
    2. ARTIST EDITING
      • Re:Draw – Context Aware Translation as a Controllable Method for Artistic Production
        • inpaint with context(style and emotion) aware; like color of the eye
      • ToonAging: Face Re-Aging upon Artistic Portrait Style Transfer (including anime)
        • and portrait style transfer, single generation step
    3. SLIME
      • SLiMe: Segment Like Me
        • extract attention maps, learn about segmented region, then inference
  3. EXPLICIT REGION

5.4. SPECIFIC CONCEPTS

  • 1
  • ConceptLab: Creative Generation using Diffusion Prior Constraints
    • generate a new, imaginary concept; adaptively constraints-optimization process
  • SeedSelect: rare concept images, generation of uncommon and ill-formed concepts
    • selecting suitable generation seeds from few samples
  • E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance =best=
    • preserving the semantical structure

5.4.1. CONTEXT LEARNING

  • DomainStudio: Fine-Tuning Diffusion Models for Domain-Driven Image Generation using Limited Data
    • keep the relative distances between adapted samples to achieve generation diversity
  • SuTi: Subject-driven Text-to-Image Generation via Apprenticeship Learning (using examples)
    • replaces subject-specific fine tuning with in-context learning,
5.4.1.1. SEMANTIC CORRESPONDENCE
  • Unsupervised Semantic Correspondence Using Stable Diffusion =no training= =from other image=
    • find locations in multiple images that have the same semantic meaning
    • optimize prompt embeddings for maximum attention on the regions of interest
    • capture semantic information about location, which can then be transferred to another image
5.4.1.2. IMAGE RELATIONSHIPS
  • Controlling Text-to-Image Diffusion by Orthogonal Finetuning
    • preserves the hyperspherical energy of the pairwise neuron relationship
    • preserves semantic coherance(relationships)
  • TOKENCOMPOSE
  1. VERBS
    • ReVersion: Diffusion-Based Relation Inversion from Images
      • like putting images on materials
      • unlike inverting object appearance, inverting object relations
    • ADI: Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation
      • learn action-specific identifiers from the exemplar images ignoring appearances
    • Visual Concept-driven Image Generation with Text-to-Image Diffusion Model
      • concepts that can interact with other concepts, using masks to teach

5.4.2. EXTRA PRETRAINED

  • GUIDING FUNCTION 2.10.3.3.2
  • E4T-diffusion: Tuning encoder: the text embedding + offset weights (Needs a >40GB GPU ) (faces)
  • BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
    • learned in 40 steps vs Textual Inversion 3000
    • Subject-driven Style Transfer, Subject Interpolation
    • concept replacement
    • Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models
5.4.2.1. UNDERSTANDING NETWORK
  • Elite: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation
    • extra neural network to get text embedding, fastest text embeddings
  • ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation
    • extra on top, not finetune the original diffusion model, awesome quality,
    • unlike elite: automatic mechanism to generate object mask: cross-attentions
  • 2.10.3.3.3 faces

5.4.3. SEVERAL CONCEPTS

5.4.4. CONES

  • Cones: Concept Neurons in Diffusion Models for Customized Generation (better than Custom Diffusion)
    • index only the locations in the layers that give rise to a subject, add them together to include multiple subjects in a new context
    • Cones 2: Customizable Image Synthesis with Multiple Subjects
      • flexible composition of various subjects without any model tuning
      • leaning an extra on top of a regular text embedding, and using layout to compose

5.4.5. SVDIFF

  • SVDiff: Compact Parameter Space for Diffusion Fine-Tuning, code(soon)
    • multisubject learning, like D3S
    • personalized concepts, combinable; training gan out of its conv
    • Singular Value Decomposition (SVD) = gene coefficient vs expression level
    • CoSINE: Compact parameter space for SINgle image Editing (remove from prompt after finetune it)
    • DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning
      • its PEFT for diffusion

5.4.6. LIKE ORIGINAL ONES

  • 2 passes to make bigger: Standard High-Res fix or Deep Shrink High-Res Fix (kohya)
  • VeRA: Vector-based Random Matrix Adaptation
    • single pair of low-rank matrices shared across all layers and learning small scaling vectors instead
    • 10x less parameters
  • An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning
    • Multi-Concept Prompt Learning (MCPL)
    • disentangled concepts with enhanced word-concept correlation
  • X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model
    • feature remapping from SD 1.5 to SDXL for all loras and controlnets
    • so you can train at lower resources and map to higher
  • 2.9.5.1
  • 2.1 : learning text embeddings for each layer of the unet
    • PALP: Prompt Aligned Personalization of Text-to-Image Models
      • input: image and prompt
      • display ALL the tokens, not just some
  • λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space
  • DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image
    • (Personalization for Kandinsky) trained using projection loss and clip contrastive loss
    • plug-in method that does semantic matching instead of replacement-disruption
  • UniHDA: A Unified and Versatile framework for generative Hybrid Domain Adaptation
    • blends all characteristics at once, maintains robust cross-domain consistency
5.4.6.1. TARGETING CONTEXTUAL CONSISTENCY
  • Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization
    • approach to boost identity consistency and generative diversity for personalization methods
  • Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding
    • class-characterizing regularization to preserve prior knowledge of object classes, so it integrates seamlessly with existing concepts
5.4.6.2. LORA
  • lora, lycoris, loha, lokr
  • use regularization images with lora https://rentry.org/59xed3#regularization-images
  • GLORA: One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning
    • individual adapter of each layer
    • superior accuracy fewer parameters-computations
  • PEFT x Diffusers Integration
  • Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying
    • 13% of parameters than lora, parameter efficiency
  • Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models, plug and play =best=
    • concept sliders that enable precise control over attributes
    • intuitive editing of visual concepts for which textual description is difficult
    • repair of object deformations and fixing distorted hands
  • ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs
    • cheaply and effectively merge independently trained style and subject LoRAs
  • DoRA: Weight-Decomposed Low-Rank Adaptation
    • decomposes the pre-trained weight into two components, magnitude and direction; directional updates
  • DiffuseKronA: A Parameter Efficient Fine-tuning Method for Personalized Diffusion Model
    • Kronecker product-based adaptation, reduces the parameter count by up to 35% lora
  • 5.2.1
  • CAT: Contrastive Adapter Training for Personalized Image Generation
    • no loss of diversity in object generation, no token = no effect
  1. MULTIPLE LORA
    • S-LoRA: Serving Thousands of Concurrent LoRA Adapters
      • scalable serving of many LoRA adapters, all adapters in the main memory, fetches for the current queries
    • MultiLoRA: Democratizing LoRA for Better Multi-Task Learning
      • changes parameter initialization of adaptation matrices to reduce parameter dependency
    • Orthogonal Adaptation for Modular Customization of Diffusion Models
      • customized models can be summed with minimal interference, and jointly synthesize
      • scalable customization of diffusion models by encouraging orthogonal weights
    • Multi-LoRA Composition for Image Generation
    • CLoRA: A Contrastive Approach to Compose Multiple LoRA Models
      • enables the creation of composite images that truly reflect the characteristics of each LoRA
5.4.6.3. TEXTUAL INVERSION
  • Multiresolution Textual Inversion: better textual inversion (embedding)
  • Extended Textual Inversion (XTI)
    • P+: Extended Textual Conditioning in Text-to-Image Generation
      • different text embedding per unet layer
      • code
    • SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models (llm)
      • adapter to transfer the semantic understanding of llm to align complex vs simple prompts
  • DREAMDISTRIBUTION is like Textual Inversion
  • CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization
    • learns the gap between the personalized concept and its base class

6. USE CASES

6.1. IMAGE COMPRESSION FILE

  • Robustly overfitting latents for flexible neural image compression
    • refine the latents of pre-trained neural image compression models
  • Learned Image Compression with Text Quality Enhancement
    • text logit loss function

6.2. DIFFUSION AS ENCODER - RETRIEVE PROMPT

  • De-Diffusion Makes Text a Strong Cross-Modal Interface
    • text as a cross-modal interface
    • autoencoder uses a pre-trained text-to-image diffusion model for decoding
      • encoder is trained to transform an input image into text
  • PH2P: Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models
    • projection scheme to optimize for prompts representative of the space in the model (meaningful prompts)

6.3. DIFFUSING TEXT

6.3.1. GENERATE VECTORS

  • VecFusion: Vector Font Generation with Diffusion
    • rasterized fonts then vector model synthesizes vector fonts
  • StarVector: Generating Scalable Vector Graphics Code from Images
    • CLIP image encoder, learning to align the visual and code tokens, generate SVGs
  • StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis
    • encoding into stroke tokens, naturally compatible with LLMs
  • SVGCraft: Beyond Single Object Text-to-SVG Synthesis with Comprehensive Canvas Layout
    • creation of vector graphics depicting entire scenes from textual descriptions
    • optimized using a pre-trained encoder

6.3.2. INPAINTING TEXT

  • DiffSTE: Inpainting to edit text in images with a prompt (model)
    • Improving Diffusion Models for Scene Text Editing with Dual Encoders
6.3.2.1. DERIVED FROM SD
  • UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models (with training code)
  • Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model (Diff-Text)
    • attention constraint to address unreasonable positioning, more accurate scene text, any language
    • its just a prompt and canny: “sign”, “billboard”, “label”, “promotions”, “notice”, “marquee”, “board”, “blackboard”, “slogan”, “whiteboard”, “logo”
  • AnyText: Multilingual Visual Text Generation And Editing =best=
    • inputs: glyph, position, and masked image to generate latent features for text generation-editing
    • curved into shapes-textures text

6.4. IMAGE RESTORATION, SUPER-RESOLUTION

  • NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement
    • image signal processing pipeline , multiple blendable styles into a single network
  • FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model
  • refusion: Image Restoration with Mean-Reverting Stochastic Differential Equations
  • image restoration IR, DDNM using NULL-SPACE
  • SVNR: Spatially-variant Noise Removal with Denoising Diffusion
    • real life noise fixing
  • Dense Pixel-to-Pixel Harmonization via Continuous Image Representation
    • stretched images due to change in resolution fixed
    • Zero-Shot Image Harmonization with Generative Model Prior
  • DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior
    • using a SwinIR then refine with sd

6.4.1. SUPERRESOLUTION

  • CCSR: Improving the Stability of Diffusion Models for Content Consistent Super-Resolution
  • Swintormer: Image Deblurring based on Diffusion Models (limited memory)
  • Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution
    • for videos, temporal adapter to ensure temporal coherence
  • YONOS-SR: You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
    • start by training a teacher model on a smaller magnification scale
    • instead of 200 steps, and finetuned decoder on top of it
  • SUPIR: Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild
    • based on large-scale diffusion generative prior
  • Face to Cartoon Incremental Super-Resolution using Knowledge Distillation
    • faces and anime restoration at various levels of detail
  • APISR: Anime Production Inspired Real-World Anime Super-Resolution
  • Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model
    • pyramid latent representation
6.4.1.1. STABLESR
  • StableSR: Exploiting Diffusion Prior for Real-World Image Super-Resolution
    • develope a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models
6.4.1.2. DEMOFUSION
  • DemoFusion: Democratising High-Resolution Image Generation With No $$$
    • achieve higher-resolution image generation
    • Enhance This: DemoFusion SDXL
    • ComfyUI Iterative Mixing Nodes =best=
      • iterative mixing of samples to help with upscaling quality
      • SD 1.5 generating at higher resolutions
      • evolution from NNLatentUpscale
  1. PASD MAGNIFY
    • PASD Magnify: Pixel-Aware Stable Diffusion for Realistic Image Super-resolution and Personalized Stylization
      • image slider custom component

6.5. DEPTH GENERATION

  • depth map from diffusion, build 3d enviroment with it
    • VPD: using diffusion for depth estimation, image segmentation (better) comparable 1.1
  • ZoeDepth: Combining relative and metric depth
    • tiling ZoeDepth
    • PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
    • Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation (70s inference)
  • LDM3D by intel, generates image & depth from text prompts
  • LDM3D-VR: Latent Diffusion Model for 3D VR
    • generating depth together, panoramic RGBD
  • DMD (Diffusion for Metric Depth)
    • Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model (depth from image)
  • Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data (temporal coherance no flickering)
  • GIBR

6.5.1. DEPTH DIFFUSION

  • MVDD: Multi-View Depth Diffusion Models
    • 3D shape generation, depth completion, and its potential as a 3D prior
    • enforce 3D consistency in multi-view depth
  • DepthFM: Fast Monocular Depth Estimation with Flow Matching
    • pre-trained image diffusion model can become flow matching depth model

6.5.2. NORMAL MAPS

  • DSine: Rethinking Inductive Biases for Surface Normal Estimation

Author: Tekakutli

Created: 2024-04-13 Sat 04:35