stable diffusion

1. SD MODELS
2. GENERATION CONTROL
3. BETTER DIFFUSION
4. SAMPLERS
5. IMAGE EDITING
6. USE CASES

parent: diffusion
related: diffusion video software
combining pipelines, creating pipelines
generate: NOVEL VIEW
how to guidance-classifier the diffusion

1. SD MODELS

CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images
- CC-licensed images with BLIP-2 captions, similar performance to Stable Diffusion 2 (apache license)
Terminus XL Gamma: simpler SDXL, for inpainting tasks, super-resolution, style transfer
5.2
AnimateLCM-SVD-xt: image to video
stable-cascade: würstchen architecture = even smaller latent space
- Stable-Cascade-FP16
- sd x8 compression (1024x1024 > 128x128) vs cascade x42 compression, (1024x1024 > 24x24)
- faster inference, cheaper training
STABLE DIFFUSION 3

1.1. DISTILLATION

SSD1B (distilled SDXL) 60% Fast -40% VRAM
- Playground v2
SDXL-Lightning: a lightning fast 1024px text-to-image generation model (few-steps generation)
- progressive adversarial diffusion distillation

1.1.1. ONE STEP DIFFUSION

One-step Diffusion with Distribution Matching Distillation
- comparable with v1.5 while being 30x faster
- critic similar to GANs in that is jointly trained with the generator
  - differs in that it does not play adversarial game, and can fully leverage a pretrained model

1.1.2. SDXS

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions
- knowledge distillation to streamline the U-Net and image decoder architectures
- one-step DM training technique that utilizes feature matching and score distillation
- speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FPS (60x faster than SDXL) on a GPU
- image-conditioned control, facilitating efficient image-to-image translation.

1.2. IRIS LUX

https://civitai.com/models/201287 Model created through consensus via statistical filtering (novel consensus merge) https://gist.github.com/Extraltodeus/0700821a3df907914994eb48036fc23e

1.3. EMOJIS

Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression
- emojis, stickers

1.4. MERGING MODELS

1.4.1. SEGMOE

SegMoE - The Stable Diffusion Mixture of Experts for Image Generation, Mixture of Diffusion Experts
- training free, creation of larger models on the fly, larger knowledge

2. GENERATION CONTROL

5.4.2 DRAG 5.3.2.2.2.1
hyperparameters with extra network Mid-U Guidance
block weights lora
DiLightNet: Fine-grained Lighting Control for Diffusion-based Image Generation
- using light hints to resynthetize a prompt with user-defined consistent lighting
Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation
- refines the output iteratively in the latent space
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback
- explicitly optimizing pixel-level cycle consistency between generated images

2.1. MATERIAL EXTRACTION

U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation
- generates images with the material or color extracted from the input image
- sentence describing the desired attribute
- learn user-specified visual attributes
ZeST: Zero-Shot Material Transfer from a Single Image
- leverages adapters to extract implicit material representation from exemplar image

2.2. LIGHT CONTROL

DiffusionLight: Light Probes for Free by Painting a Chrome Ball
- render a chrome ball into the input image
- produces convincing light estimates

2.3. BACKGROUND

BriaAI: Open-Source Background Removal (RMBG v1.4)
LayerDiffusion: Transparent Image Layer Diffusion using Latent Transparency
- layers with alpha, generate pngs, remove backgrounds (more like generate with removable background)
- method learns a “latent transparency”
- models

2.4. EMOTIONS

Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation
- face swapping and reenactment, interpolate between emotions
EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models
- clip, abstract emotions
Make Me Happier: Evoking Emotions Through Image Diffusion Models
- understanding and editing source images emotions cues

2.5. NOISE CONTROL

offset noise(darkness capable loras), pyramid noise
- Common Diffusion Noise Schedules and Sample Steps are Flawed (and several proposed fixes)
- native offset noise
noisy perlin latent
- you can reinject the same noise pattern after an upscale, more coherent results and better upscaling
Blue noise for diffusion models
- allows introducing correlation across images within a single mini-batch to improve gradient flow

2.6. GUIDING FUNCTION

Universal Guided Diffusion (face and style transfer)
- FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model
  - extra: repo has list of deblurring, super-resolution and restoration methods
  - masks as energy function
Diffusion Self-Guidance for Controllable Image Generation
- steer sampling, similarly to classifier guidance, but using signals in the pretrained model itself
- instructional transfomations
MCM Modulating Pretrained Diffusion Models for Multimodal Image Synthesis (module after denoiser) mmc
- mask like control to tilt the noise, maybe useful for text

2.6.1. ADAPTIVE GUIDANCE

Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models
- AG, efficient variant of CFG(Classifier-Free Guidance); reducing computation by 25%
- omits network evaluations when the denoising process displays convergence
- second half of the denoising process redundant; plug-and-play alternative to Guidance Distillation
- LinearAG: entire neural-evaluations can be replaced by affine transformations of past estimates

2.7. CONTROL NETWORKS, CONTROLNET

REFERENENET CONTROLNET FOR 3D 3.4.3.1 CONTROLNET VIDEO
why controlnet, alternatives https://github.com/lllyasviel/ControlNet/discussions/188
VisorGPT: Learning Visual Prior via Generative Pre-Training
- gpt that learns to tranform normal prompts into controlnet primitives
FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection
- geometric control via human pose images and appearance control via instance-level text prompts
FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition
- alignment with guidance image: lidar, face mesh, wireframe mesh, rag doll
FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection
- instance-specific text description, better prompt following

2.7.1. SKETCH

diffmorph: text-less image morphing with diffusion models
- sketch-to-image module
Block and Detail: Scaffolding Sketch-to-Image Generation
- sketch-to-image, can generate coherent elements from partial sketches, generate beyond the sketch following the prompt
CustomSketching: Sketch Concept Extraction for Sketch-based Image Synthesis and Editing
- one for contour, the other flow lines representing texture

2.7.2. ALTERNATIVES

controlNet (total control of image generation, from doodles to masks)
- T2I-Adapter (lighter, composable), how color pallete
- lora like (old) https://github.com/HighCWu/ControlLoRA
- ControlNet-XS: 1% of the parameters
- LooseControl: Lifting ControlNet for Generalized Depth Conditioning
  - loosely specifying scenes with boxes
- controlnet-lltite by kohya
SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing
- lightweight tuning module named SC-Tuner, synthesis by injecting different conditions
- reduces training parameters and memory requirements
- Integrated Into SCEPTER and SWIFT
Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis
- imposing global semantics onto targeted regions without the use of any additional localization cues
- alternative to controlnet and t2i-adapter

2.7.3. TIP: text restoration

TIP: Text-Driven Image Processing with Semantic and Restoration Instructions =best=
- controlnet architecture, leverages natural language as interface to control image restoration
- instruction driven, can inprint text into image

2.7.4. HANDS

HANDS DATASET
HandDiffuse: Generative Controllers for Two-Hand Interactions via Diffusion Models
- two-hand interactions, motion in-betweening and trajectory control

2.7.4.1. RESTORING HANDS

Detecting and Restoring Non-Standard Hands in Stable Diffusion Generated Images
- body pose estimation to understand hand orientation for accurate anomaly correction
- integration of ControlNet and InstructPix2Pix
HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting
- incorrect number of fingers, irregular shapes, effectively rectified
- utilize ControlNet modules to re-inject corrected information, 1.5

2.7.5. USING ATTENTION MAP

5.4.4 4.1 STORYTELLER DIFFUSION
RIVAL: Real-World Image Variation by Aligning Diffusion Inversion Chain =best=

2.7.5.1. MASA

MasaCtrl: Tuning-free Mutual Self-Attention Control for Consistent Image Synthesis and Editing
- same thing different views or poses
  - by querying the attention map from another image
- better than ddim inversion, consistent SD animations; mixable with T2I-Adapter

TI-GUIDED-EDIT
- Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance
  - rigid=conserve the structure

2.7.5.2. LLLYASVIEL

reference-only preprocessor doesnt require any control models, generate variations
- can guide the diffusion directly using images as references, and generate variations
Guess Mode / Non-Prompt Mode, now named: Control Modes, how much prompt vs controlnet; comfy node

2.7.6. SEVERAL CONTROLS IN ONE

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
- several controlnets in one, contextual understanding
- image deblurring, image colorization
- using UniControl with Stable Diffusion XL 1.0 Refiner; sketch to image tool
In-Context Learning Unlocked for Diffusion Models
- learn translation of image to hed, depth, segmentation, outline

2.8. HUMAN PAINT

SDEdit: guided image synthesis and editing with stochastic differential equation
- stroke based inpainting-editing
- FOOLSDEDIT: Deceptively Steering Your Edits Towards Targeted Attribute-aware Distribution
  - forcing SDEdit to generate a data distribution aligned a specified attribute (e.g. female)
Control Color: Multimodal Diffusion-Based Interactive Image Colorization
- paint over grayscale to recolor it

2.9. LAYOUT DIFFUSION

3d: ROOM LAYOUT
3.5.1 STORYTELLER DIFFUSION
ZestGuide: Zero-shot spatial layout conditioning for text-to-image diffusion models
- implicit segmentation maps can be extracted from cross-attention layers
- spatial conditioning to sd without finetunning
Towards Aligned Layout Generation via Diffusion Model with Aesthetic Constraints
- constraints representing design intentions
- continuous state-space design can incorporate differentiable aesthetic constraint functions in training
  - by introducing conditions via masked input
RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models
- dynamically balance the strengths of the two models in denoising process
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
- better representing spatial relationships
- faithfully follow the spatial relationships specified in the text prompt

2.9.1. SCENES

Generate Anything Anywhere in Any Scene
- training guides to focus on object identity, personalized concept with localization controllability
2.10.3.2 ALDM

2.9.2. WITH BOXES

GLIGEN: Open-Set Grounded Text-to-Image Generation (boxes)
- Training-Free Layout Control with Cross-Attention Guidance
- SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis
BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
InstanceDiffusion: Instance-level Control for Image Generation
- conditional generation, hierarchical bounding-boxes structure, featur(prompt) at point
- single points, scribbles, bounding boxes or segmentation masks
Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models
- bounding boxes with attribute(prompt) binding

2.9.3. ALDM

ALDM: Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive
- layout faithfulness

2.9.4. OPEN-VOCABULARY

Spatial-Aware Latent Initialization for Controllable Image Generation
- inverted reference image contains spatial awareness regarding positions, resulting in similar layouts
- open-vocabulary framework to customize a spatial-aware initialization

2.9.5. CARTOON

Desigen: A Pipeline for Controllable Design Template Generation
- generating images with proper layout space for text; generating the template itself

2.9.5.1. COGCARTOON

CogCartoon: Towards Practical Story Visualization
- plugin-guided and layout-guided inference; specific character = 316 KB plugin

2.10. IMAGE PROMPT - ONE IMAGE

2.1 3.4

2.10.1. UNET LESS

ProFusion: Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach
- and can interpolate between two
- promptnet (embedding), encoder based, for style transform
- one image, no regularization needed
Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
- using CLIP features extracted from the subject

2.10.2. IMAGE-SUGGESTION

5.4.1.1
UMM-Diffusion, TIUE: Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation
- takes joint texts and images
- only the image-mapping to a pseudo word embedding is learned

2.10.2.1. ZERO SHOT

Context Diffusion: In-Context Aware Image Generation
- separates the encoding of the visual context; prompt not needed
ReVision - Unclip https://comfyanonymous.github.io/ComfyUI_examples/sdxl/
- Revision gives the model the pooled output from CLIPVision G instead of the CLIP G text encoder
SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation
- architecture designed for selectively capturing any subject from single or multiple reference images

IP-ADAPTER
- IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models =stock SD=
  - works with other controlnets
  - IP-Adapter-FaceID (face recognition model)
1. LCM-LOOKAHEAD
  - LCM-Lookahead for Encoder-based Text-to-Image Personalization
    - LCM-based approach for propagating image-space losses to personalization model training and classifier guidance
SEECODERS
- Seecoders: Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models
  - Semantic Context Encoder, replaces clip with seecoder; works with =stock SD=
  - input image and controlnet
  - unlike unclip, seecoders uses extra model
  - one image into several perspectives (MULTIVIEW DIFFUSION)
  - the embeddings can be textures, effects, objects, semantics(contexts)
tics, etc.

2.10.2.2. PERSONALIZATION

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
- personalized images with only a single forward pass
HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models; just one image

2.10.3. IDENTITY

masked score estimation
HiPer: Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion
- one image single thing, gets the clip
2.10.2.1.1

2.10.3.1. STORYTELLER DIFFUSION

ConsiStory: Training-Free Consistent Text-to-Image Generation
training-free approach for consistent subject(object) generation x20 faster, multi-subject scenarios
by sharing the internal activations of the pretrained model

2.10.3.2. ANYDOOR

AnyDoor: Zero-shot Object-level Image Customization
- teleport target objects to new scenes at user-specified locations
- identity feature with detail feature
- moving objects, swapping them, multi-subject composition, try-on a cloth

2.10.3.3. SUBJECT

Inserting Anybody in Diffusion Models via Celeb Basis
- one facial photograph, 1024 learnable parameters, 3 minutes; several at once
Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning
- multi subject, single reference image
PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models
- incorporates facial identity loss, single facial photo, single training phase
The Chosen One: Consistent Characters in Text-to-Image Diffusion Models
- sole input being text
- generate gallery of images, use pre-trained feature extractor to choose the most cohesive cluster
FaceStudio: Put Your Face Everywhere in Seconds =best=
- direct feed-forward mechanism, circumventing the need for intensive fine-tuning
- stylized images, facial images, and textual prompts to guide the image generation process
SeFi-IDE: Semantic-Fidelity Identity Embedding for Personalized Diffusion-Based Generation
- face-wise attention loss to fit the face region

IDENTITY IN VIDEO
- Magic-Me: Identity-Specific Video Customized Diffusion
1. STABLEIDENTITY
  - StableIdentity: Inserting Anybody into Anywhere at First Sight
    - identity recontextualization with just one face image without finetuning
    - also for into video/3D generation
IDENTITY ZERO-SHOT
- InstantID: Zero-shot Identity-Preserving Generation in Seconds (using face encoder)
  - PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding Paper page
    - Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm =best=
      
      identity provided by the reference image while mitigating interference from textual input
- CapHuman: Capture Your Moments in Parallel Universes
  - encode then learn to align, identity preservation for new individuals without tuning
- SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation =best=
  - Token-to-Patch Aligner = preserving fine features of the subjects; multiple subjects
  - combinable with controlnet, and across styles
- RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization
  - gradually narrowing to the specific subject, iteratively update the influence scope
PHOTOMAKER
- PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
  - encodes (into mlp) images into embedding wich preserves id

2.10.3.4. ANIME

DreamArtist: a single one image and target text (mainly works with anime)
- DreamTuner: Single Image is Enough for Subject-Driven Generation
  - subject-encoder for coarse subject identity preservation, training-free
pfg Prompt free generation; learns to interpret (anime) input-images
- old one: PaintByExample

2.10.4. VARIATIONS

others: 2.7.5 2.10.4 1.2 2.10.2.1
image variations model (mix images): https://twitter.com/Buntworthy/status/1615302310854381571
- by versatile diffusion model guy, reddit
  - improved: https://github.com/SHI-Labs/Versatile-Diffusion
stable diffusion reimagine: conditioning the unet with the image clip embeddings, then training

3. BETTER DIFFUSION

editing default of a prompt: https://github.com/bahjat-kawar/time-diffusion
Self-Attention Guidance (SAG): SAG leverages intermediate attention maps of diffusion models at each iteration to capture essential information for the generative process and guide it accordingly
- pretty much just reimplemented the attention function without changing much else
FreeU: Free Lunch in Diffusion U-Net (unet) =best=
- improves diffusion model sample quality at no costs
- more color variance
Diffusion Sampling with Momentum for Mitigating Divergence Artifacts
- incorporation of: Heavy Ball (HB) momentum = expand stability regions; Generalized HB (GHVB) = supression
- better low step sampling
DG: Detector Guidance for Multi-Object Text-to-Image Generation
- mid-diffusion, performs latent object detection then enhances following CAMs(cross-attention maps)

3.1. SCHEDULER

simple diffusion: End-to-end diffusion for high resolution images
- shifted scheduled noise
Sigmas Tools and The Golden Scheduler

3.2. QUALITY

3.6.2
Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack (dataset method)
- guide pre-trained model to exclusively generate good images
HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion
- Latent Structural Diffusion Model that simultaneously denoises depth and surface normal with RGB image
Consistency Distilled Diff VAE
- Improved decoding for stable diffusion vaes

3.3. HUMAN FEEDBACK

RLCM
Aligning Text-to-Image Models using Human Feedback https://arxiv.org/abs/2302.12192
- Better Aligning Text-to-Image Models with Human Preference
- RRHF: Rank Responses to Align Language Models with Human Feedback without tears
- ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation
FABRIC: Personalizing Diffusion Models with Iterative Feedback
- training-free approach, exploits the self-attention layer
- improve the results of any Stable Diffusion model
Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model
- Direct Preference for Denoising Diffusion Policy Optimization (D3PO)
- omits training a reward model
Diffusion-DPO: Diffusion Model Alignment Using Direct Preference Optimization (training script)
- improving visual appeal and prompt alignment, using direct preference optimization
- SDXL: Direct Preference Optimization (better images) (and SD 1.5)
ALDM layout
RL Diffusion: Large-scale Reinforcement Learning for Diffusion Models (improves pretrained)
PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models =best=
- better training stability for unseen prompts
- reward difference of generated image pairs from their denoising trajectories
MESH HUMAN FEEDBACK

3.3.1. ACTUALLY SELF-FEEDBACK

SPIN-Diffusion: Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation =best=
- diffusion model engages in competition with its earlier versions, iterative self-improvement
AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation
- use Vision Models (VLM) to assess quality across style, coherence, and aesthetics, generating feedback

3.4. SD GENERATION OPTIMIZATION

ONE STEP DIFFUSION 4 STABLE CASCADE
turning off CFG when denoising sigmas below 1.1
Tomesd: Token Merging for Stable Diffusion code
- ToDo: Token Downsampling for Efficient Generation of High-Resolution Images
  - token downsampling of key and value tokens to accelerate inference 2x-4x
Nested Diffusion Processes for Anytime Image Generation
- can generate viable when stopped arbitrarily before completion
BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping
- use sd as teacher model and train faster one using it as bootstrap; 30 fps
Divide & Bind Your Attention for Improved Generative Semantic Nursing
- novel objective functions: can handle complex prompts with proper attribute binding
Conditional Diffusion Distillation
- added parameters, suplementing image conditions to the diffusion priors
- super-resolution, image editing, and depth-to-image generation
4 2.6.1
OneDiff: acceleration library for diffusion models, ComfyUI Nodes
T-Stitch: Accelerating Sampling in Pre-trained Diffusion Models with Trajectory Stitching
- improve sampling efficiency with no generation degradation
- smaller DPM in the initial steps, larger DPM at a later stage, 40% of the early timesteps
The Missing U for Efficient Diffusion Models
- operates with approximately a quarter of the parameters, diffusion models 80% faster

3.4.1. ULTRA SPEED

SDXL Turbo: A real-time text-to-image generation model (distillation)
ArtSpew: SD at 149 images per second (high volume random image generation)
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation (10ms)
- transforms sequential denoising into the batching denoising
MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices
- diffusion-GAN finetuning techniques to achieve 8-step and 1-step inference
Accelerating Diffusion Sampling with Optimized Time Steps
- image performance compared to using uniform time steps

3.4.2. CACHE

DeepCache: Accelerating Diffusion Models for Free =best=
- exploits temporal redundancy observed in the sequential denoising steps
- superiority over existing pruning and distillation
Cache Me if You Can: Accelerating Diffusion Models through Block Caching
- reuse outputs from layer blocks of previous steps, automatically determine caching schedules
Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models =best=
- reuse cyclically the encoder features in the previous time-steps for the decoder
Fast Inference Through The Reuse Of Attention Maps In Diffusion Models
- structured reuse of attention maps during sampling
T-GATE: Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models
- two stages: semantics-planning phase, and subsequent fidelity-improving phase
- so caching cross-attention output once converges and fixing it during the remaining inference

3.4.2.1. EXPLOITING FEATURES

FRDiff: Feature Reuse for Exquisite Zero-shot Acceleration of Diffusion Models
- Reusing feature maps with high temporal similarity
Clockwork Diffusion: Efficient Generation With Model-Step Distillation
- high-res features sensitive to small perturbations; low-res feature only sets semantic layout
- so reuses computation from preceding steps for low-res

3.4.3. LCM

LCMs: Latent Consistency Models: Synthesizing High-Resolution Images with Few-step Inference
- inference with minimal steps (2-4)
- training LCM model: only 32 A100 GPU hours
- Latent Consistency Fine-tuning (LCF) custom datasets
- comfyui auto1111 the model
- LCM-LoRA: A Universal Stable-Diffusion Acceleration Module
  - universally applicable accelerator for diffusion models, plug-in neural PF-ODE solver
VideoLCM: Video Latent Consistency Model
- smooth video synthesis with only four sampling steps
- ANIMATELCM
Quick Image Variations with LCM and Image Caption
TCD: Trajectory Consistency Distillation (lora)
- accurately trace the entire trajectory of the Probability Flow ODE
- https://github.com/dfl/comfyui-tcd-scheduler
LCM-LOOKAHEAD

3.4.3.1. CCM

CCM: Adding Conditional Controls to Text-to-Image Consistency Models
- ControlNet-like, lightweight adapter can be jointly optimized while consistency training

3.4.3.2. PERFLOW

PeRFlow (Piecewise Rectified Flow)
- fast generation, 4 steps, 4,000 training iterations
- multiview normal maps and textures from text prompts instantly

3.5. PROMPT CORRECTNESS

ReCo: region control, counting donuts
sd-webui-cutoff, hide tokens for each separated group, limits the token influence scope (color control)
hard-prompts-made-easy
- magic prompt: amplifies-improves the prompt
Get What You Want, Not What You Don’t: Image Content Suppression for Text-to-Image Diffusion Models
- suppress unwanted content generation of the prompt, and encourages the generation of desired content
- better than negative prompts
Discriminative Probing and Tuning for Text-to-Image Generation
- discriminative adapter to improve their text-image alignment
- global matching and local grounding
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
- fine-tuning strategy with an image-to-text(captioning model) concept matching mechanism
[[https://youtu.be/_Pr7aFkkAvY?si=Xr5e_RL-rwcdL10q

][ELLA]] - A Powerful Adapter for Complex Stable Diffusion Prompts

using an adaptor for an llm instead of clip

3.5.1. ATTENTION LAYOUT

Attend-and-Excite (excite the ignored prompt tokens) (no retrain)
- Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis
Directed Diffusion: Direct Control of Object Placement through Attention Guidance (no retrain) repo
DenseDiffusion: Dense Text-to-Image Generation with Attention Modulation
- training free, layout guidance

3.5.2. LANGUAGE ENHANCEMENT

5.4.1.2
Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment
- using prompt sentence structure during inference to improve the faithfulness
Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
- exploiting language sentences semantical hierarchies (lojban)
Structured Diffusion Guidance, language enhanced clip enforces on unet
Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering
- prompt learning, improve the matches between the input text and the generated

3.5.2.1. PROMPT EXPANSION, PROMPT AUGMENTATION

DanTagGen: LLaMA arch
superprompter: Supercharge your AI/LLM prompts
Capability-aware Prompt Reformulation Learning for Text-to-Image Generation
- effectively learn diverse reformulation strategies across various user capacities to simulate high-capability user reformulation

3.5.2.2. TOKENCOMPOSE

TokenCompose: Grounding Diffusion with Token-level Supervision =best=
- finetuned with token-wise grounding objectives for multi-category instance composition
- exploiting binary segmentation maps from SAM
- compositions that are unlikely to appear simultaneously in a natural scene

3.6. BIGGER COHERENCE

5.3.1.1 VIDEO COHERENCE
Many-to-many Image Generation with Auto-regressive Diffusion Models

3.6.1. PANORAMAS

DiffCollage: Parallel Generation of Large Content with Diffusion Models (panoramas)
Collaborative Score Distillation for Consistent Visual Synthesis
- consistent visual synthesis across multiple samples =best one=
- distill generative priors over a set of images synchronously
- zoom, video, panoramas
SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions
- plug-and-play module that synchronizes multiple diffusions through gradient descent from a perceptual similarity loss
Taming Stable Diffusion for Text to 360° Panorama Image Generation
- minimize distortion during the collaborative denoising process

3.6.1.1. OUTPAINTING

5.3.2.2.2
Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach
- generate content beyond boundaries using relative positional information
BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion
- pre-trained SD model, useful in product exhibitions, virtual try-on, or background replacement

3.6.2. RESOLUTION

Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images
- training on images of unlimited sizes is unfeasible
- Fast Seamless Tiled Diffusion (FSTD)
ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models (video too)
- generating images at much higher resolutions than the training image sizes
- does not require any training or optimization
Matryoshka Diffusion Models
- diffusion process that denoises inputs at multiple resolutions jointly
FIT TRANSFORMER
Upsample Guidance: Scale Up Diffusion Models without Training
- technique that adapts pretrained model to generate higher-resolution images by adding a single term in the sampling process, without any additional training or relying on external models
- can be applied to various models, such as pixel-space, latent space, and video diffusion models

3.6.2.1. ARBITRARY

ElasticDiffusion: Training-free Arbitrary Size Image Generation
- decoding method better than MultiDiffusion
ResAdapter: Domain Consistent Resolution Adapter for Diffusion Models
- unlike post-process, directly generates images with the dynamical resolution
- compatible with ControlNet, IP-Adapter and LCM-LoRA; can be integrated with ElasticDiffusion

4. SAMPLERS

GENIE: Higher-Order Denoising Diffusion Solvers
- faster diffusion equation?
- DDIM vs GENIE
- 4 time less expensive upsampling
fastest solver https://arxiv.org/abs/2301.12935
- another accelerator: https://arxiv.org/abs/2301.11558
unipc sampler (sampling in 5 steps)
- smea: (nai) global attention sampling
Karras no blurry improvement reddit
DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics
- several coefficients efficiently computed on the pretrained mode, faster
STABLESR novel approach
5.2.3: controls intensity of style

5. IMAGE EDITING

3D-AWARE IMAGE EDITING
null-text inversion: prompttoprompt but better
- imagic: editing photo with prompt

5.1. IMAGE SCULPTING `=best=`

Image Sculpting: Precise Object Editing with 3D Geometry Control
- enables direct interaction with their 3D geometry
  - pose editing, translation, rotation, carving, serial addition, space deformation
- turned into nerf using Zero-1-to-3, then returned to image including features

5.2. STYLE

StyleDrop: Text-to-Image Generation in Any Style (muse architecture)
- 1% of parameters (painting style)
PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization
- learnable style word vectors, style-content features to be located nearby
Zero-shot Generative Model Adaptation via Image-specific Prompt Learning
- adapt style to concept
StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation
- process the prompt and style features separately
DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models
- textual embedding with style guidance
Cross-Image Attention for Zero-Shot Appearance Transfer
- zero-shot appearance transfer by building on the self-attention layers of image diffusion models
- architectural transfer
STYLECRAFTER transfer to video
Style Aligned Image Generation via Shared Attention =best= (as controlnet extension)
- color palette too
FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models
- style transfer built upon sd, dual-stream encoder and single-stream decoder architecture
- content into pixelart, origami, anime
PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis
- image from segmentation map and also using semantic features
Visual Style Prompting with Swapping Self-Attention
- consistent style across generations
- unlike others (ip-adapter) disentangle other semantics away (like pose)
DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations =best=
- decouple the style and semantics of reference images
- optimal balance between the text controllability and style similarity
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation
- decouples style and content from reference images within the feature space
DreamWalk: Style Space Exploration using Diffusion Guidance
- decompose the text prompt into conceptual elements, apply a separate guidance for each element
LCM-LOOKAHEAD

5.2.1. B-LoRA

Implicit Style-Content Separation using B-LoRA
- preserving its underlying objects, structures, and concepts
- LoRA of two specific blocks
- image style transfer, text-based stylization, consistent style generation, and style-content mixing

5.2.2. STYLE TOOLS

Measuring Style Similarity in Diffusion Models
- compute similarity score

5.2.3. DIRECT CONSISTENCY OPTIMIZATION

DCO: Direct Consistency Optimization for Compositional Text-to-Image Personalization
- minimally fine-tuning pretrained to achieve consistency
- new sampling method that controls the tradeoff between image fidelity and prompt fidelity

5.3. REGIONS

different inpainting ways with diffusers: https://github.com/huggingface/diffusers/pull/1585
SceneComposer: paint with words but cooler
- bounding boxes instead: GLIGEN: image grounding
- better VAE and better masks: https://lipurple.github.io/Grounded_Diffusion/
InstructGIE: Towards Generalizable Image Editing
- leveraging the VMamba Block, aligns language embeddings with editing semantics
- editing instructions dataset

5.3.1. REGIONS MERGE

MULTIPLE DIFFUSION BIGGER COHERENCE 5.3.2.1 MULTIPLE LORA
MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models
- blending the predicted noises of two diffusion models in a saliency-aware manner (composite)
Text2Layer: Layered Image Generation using Latent Diffusion Model
- train an autoencoder to reconstruct layered images and train models on the latent representation
- generate background, foreground, layer mask, and the composed image simultaneously
Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance
- bind each attachment to corresponding subjects separately with split text prompts
- object segmentation to obtain the layouts of subjects, then isolate and resynthesize individually
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation
- bounded attention, training-free method; bounding information flow in the sampling process
- prevents leakage, promotes each subject’s individuality, even with complex multi-subject conditioning

5.3.1.1. INTERPOLATION

Latent Blending (interpolate latents)
- latent couple, multidiffusion, attention couple
  - comfy ui like but masks
Interpolating between Images with Diffusion Models
- convincing interpolations across diverse subject poses, image styles, and image content
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models =best=
- steady change in the output image, plug-and-play Smooth-LoRA; best interpolation
- perhaps for video or drag diffusion
OMG: Occlusion-friendly Personalized Multi-concept Generation In Diffusion Models
- integrate multiple concepts within a single image
- combined with LoRA and InstantID

DIFFMORPHER
- DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing =best=
  - alternative to gan; interpolate between their loras (not just their latents)

5.3.2. MINIMAL CHANGES

SEMANTICALLY DEFORMED
Delta Denoising Score: minimal modifications, keeping the image

5.3.2.1. HARMONIZATION

5.3.1
SEELE: Repositioning The Subject Within Image
- minimal changes like moving people, subject removal, subject completion and harmonization
Collage Diffusion (harmonize collaged images)
Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos
- given a coarsely edited image (cut and move blob), synthesizes a photorealistic output

SWAPANYTHING
- SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing
  - keeping the context unchanged (like it’s in texture clothes)

5.3.2.2. REGION EXCHANGE

VIDEO EXCHANGE 5.3.2.1.1
RDM-Region-Aware-Diffusion-Model edits only the region of interest
magicmix merge their noise shapes
Blended Latent Diffusion
- input image and a mask, modifies the masked area according to a guiding text prompt

SUBJECT SWAPPING
- Photoswap: Personalized Subject Swapping in Images
- LatentSwap: An Efficient Latent Code Mapping Framework for Face Swapping
BETTER INPAINTING
- 3.6.1.1
- A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
  - inpainting model: context-aware image and shape-guided object inpainting, object removal, controlnet
- ReplaceAnything as you want: Ultra-high quality content replacement
  - masked region is strictly retained
- 3.6.1.1
- DiffBody: Human Body Restoration by Imagining with Generative Diffusion Prior
  - good proportions, (clothes) texture quality, no limb distortions
- StrDiffusion: Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting
  - semantically sparse structure in early stage, dense texture in late stage
- A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
1. MAPPED INPAINTING
  - Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators
  1. DIFFERENTIAL DIFFUSION
    - Differential Diffusion: Giving Each Pixel Its Strength =best=
      
      control of the extent to which individual objects are modified, or the ability to introduce gradual spatial changes
      
      using change maps: gray scale of how many a region can change
2. CLOTHES OUTFITS
  - Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All
    - virtually place any e-commerce item in any setting
PIX2PIX REGION
- pix2pix-zero (promp2prompt without prompt)
  - no fine tuning, using BLIP captions ; docs
- plug-and-play: like pix2pix but features extracted
FORCE IT WHERE IT FITS
- MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion Path
  - no training or finetuning; instead force the prompt (exchange the noise)
- PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance
  - forces input image into edited image, object-level
PROMPT IS TARGET
- Direct Inversion: Optimization-Free Text-Driven Real Image Editing with Diffusion Models
  - only changes where the prompt fits
- Inst-Inpaint: Instructing to Remove Objects with Diffusion Models
  - erasing unwanted pixels; estimates which object to be removed
- HIVE: Harnessing Human Feedback for Instructional Visual Editing (reward model)
  - rlhf, editing instruction, to get output to adhere to the correct instructions
  - LIME: Localized Image Editing via Attention Regularization in Diffusion Models
    - do not require specified regions or additional text input
    - clustering technique = segmentation maps; without re-training and fine-tuning
1. DDIM
  - MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond =best=
    - prompt redescription strategy, revised DDIM inversion
  - Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing
    - better DDIM
  - ReNoise: Real Image Inversion Through Iterative Noising
    - building on reversing the diffusion sampling process to manipulate an image

5.3.2.3. SEMANTIC CHANGE - DETECTION

sega semantic guidance, (apply a concept arithmetic after having a generation)
EDICT: repo Exact Diffusion Inversion via Coupled Transformations
- edits-changes object types(dog breeds)
- adds noise, complex transformations but still getting perfect invertibility
The Hidden Language of Diffusion Models
- learning interpretable pseudotokens from interpolating unet concepts
- useful for: single-image decomposition to tokens, bias detection, and semantic image manipulation

SWAP PROMPT
- 2.7.5 2.7.5.1.1
- LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance
  - prompt changing, minimal variations
  - LEDITS++, an efficient, versatile & precise textual image manipulator =best=
    - no tuning, no optimization, few diffusion steps, multiple simultaneous edits
    - architecture-agnostic, masking for local changes; building on SEGA
- StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing
  - preserve the object-like attention maps after editing

5.3.2.4. INSTRUCTIONS

other: 5.3.2.2.3 GUIDING FUNCTION 2.7.3
MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing
InstructPix2Pix paper
- MegaEdit: like instructPix2Pix but for any model
  - based on EDICT and plug-and-play but using DDIM

IMAGE INSTRUCTIONS
- Instruct-Imagen: Image Generation with Multi-modal Instruction
  - example images as style, boundary, edges, sketch
- ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation
  - a pair of images as visual instructions
  - instruction learning as inpainting problem, useful for pose transfer, image translation and video inpainting
IMAGE TRANSLATION
- 2.7.6 3.4.3.1 MESH TO MESH SDXS
- DRAG DIFFUSION dragging two points on the image
- Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation
  - zero-shot (I2I) across large domain gaps, like skelleton to dinosaur
  - prompting provides target domain
- IIDM: Image-to-Image Diffusion Model for Semantic Image Synthesis
- One-Step Image Translation with Text-to-Image Models
  - adapting a single-step diffusion model; preserve the input image structure
1. INTO MANGA
  - Sketch2Manga: Shaded Manga Screening from Sketch with Diffusion Models
    - normal generation into manga style but while fixing the light anomalies (actually looks manga)
    - fixes the tones
2. ARTIST EDITING
  - Re:Draw – Context Aware Translation as a Controllable Method for Artistic Production
    - inpaint with context(style and emotion) aware; like color of the eye
  - ToonAging: Face Re-Aging upon Artistic Portrait Style Transfer (including anime)
    - and portrait style transfer, single generation step
3. SLIME
  - SLiMe: Segment Like Me
    - extract attention maps, learn about segmented region, then inference
EXPLICIT REGION
- X-Decoder: instructPix2Pix per region(objects)
  - compaable to 1.1
- PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models (region editing)

5.4. SPECIFIC CONCEPTS

1
ConceptLab: Creative Generation using Diffusion Prior Constraints
- generate a new, imaginary concept; adaptively constraints-optimization process
SeedSelect: rare concept images, generation of uncommon and ill-formed concepts
- selecting suitable generation seeds from few samples
E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance =best=
- preserving the semantical structure

5.4.1. CONTEXT LEARNING

DomainStudio: Fine-Tuning Diffusion Models for Domain-Driven Image Generation using Limited Data
- keep the relative distances between adapted samples to achieve generation diversity
SuTi: Subject-driven Text-to-Image Generation via Apprenticeship Learning (using examples)
- replaces subject-specific fine tuning with in-context learning,

5.4.1.1. SEMANTIC CORRESPONDENCE

Unsupervised Semantic Correspondence Using Stable Diffusion =no training= =from other image=
- find locations in multiple images that have the same semantic meaning
- optimize prompt embeddings for maximum attention on the regions of interest
- capture semantic information about location, which can then be transferred to another image

5.4.1.2. IMAGE RELATIONSHIPS

Controlling Text-to-Image Diffusion by Orthogonal Finetuning
- preserves the hyperspherical energy of the pairwise neuron relationship
- preserves semantic coherance(relationships)
TOKENCOMPOSE

VERBS
- ReVersion: Diffusion-Based Relation Inversion from Images
  - like putting images on materials
  - unlike inverting object appearance, inverting object relations
- ADI: Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation
  - learn action-specific identifiers from the exemplar images ignoring appearances
- Visual Concept-driven Image Generation with Text-to-Image Diffusion Model
  - concepts that can interact with other concepts, using masks to teach

5.4.2. EXTRA PRETRAINED

GUIDING FUNCTION 2.10.3.3.2
E4T-diffusion: Tuning encoder: the text embedding + offset weights (Needs a >40GB GPU ) (faces)
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
- learned in 40 steps vs Textual Inversion 3000
- Subject-driven Style Transfer, Subject Interpolation
- concept replacement
- Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models

5.4.2.1. UNDERSTANDING NETWORK

Elite: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation
- extra neural network to get text embedding, fastest text embeddings
ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation
- extra on top, not finetune the original diffusion model, awesome quality,
- unlike elite: automatic mechanism to generate object mask: cross-attentions
2.10.3.3.3 faces

5.4.3. SEVERAL CONCEPTS

MULTIPLE DIFFUSION
Expressive Text-to-Image Generation with Rich Text (learn concept-map from maxed avarages)
Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA
- sequentially learned concepts
Break-A-Scene: Extracting Multiple Concepts from a Single Image
Key-Locked Rank One Editing for Text-to-Image Personalization
- combine individually learned concepts into a single generated image
Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models
- solving concept conflicts

5.4.4. CONES

Cones: Concept Neurons in Diffusion Models for Customized Generation (better than Custom Diffusion)
- index only the locations in the layers that give rise to a subject, add them together to include multiple subjects in a new context
- Cones 2: Customizable Image Synthesis with Multiple Subjects
  - flexible composition of various subjects without any model tuning
  - leaning an extra on top of a regular text embedding, and using layout to compose

5.4.5. SVDIFF

SVDiff: Compact Parameter Space for Diffusion Fine-Tuning, code(soon)
- multisubject learning, like D3S
- personalized concepts, combinable; training gan out of its conv
- Singular Value Decomposition (SVD) = gene coefficient vs expression level
- CoSINE: Compact parameter space for SINgle image Editing (remove from prompt after finetune it)
- DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning
  - its PEFT for diffusion

5.4.6. LIKE ORIGINAL ONES

2 passes to make bigger: Standard High-Res fix or Deep Shrink High-Res Fix (kohya)
VeRA: Vector-based Random Matrix Adaptation
- single pair of low-rank matrices shared across all layers and learning small scaling vectors instead
- 10x less parameters
An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning
- Multi-Concept Prompt Learning (MCPL)
- disentangled concepts with enhanced word-concept correlation
X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model
- feature remapping from SD 1.5 to SDXL for all loras and controlnets
- so you can train at lower resources and map to higher
2.9.5.1
2.1 : learning text embeddings for each layer of the unet
- PALP: Prompt Aligned Personalization of Text-to-Image Models
  - input: image and prompt
  - display ALL the tokens, not just some
λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space
DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image
- (Personalization for Kandinsky) trained using projection loss and clip contrastive loss
- plug-in method that does semantic matching instead of replacement-disruption
UniHDA: A Unified and Versatile framework for generative Hybrid Domain Adaptation
- blends all characteristics at once, maintains robust cross-domain consistency

5.4.6.1. TARGETING CONTEXTUAL CONSISTENCY

Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization
- approach to boost identity consistency and generative diversity for personalization methods
Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding
- class-characterizing regularization to preserve prior knowledge of object classes, so it integrates seamlessly with existing concepts

5.4.6.2. LORA

lora, lycoris, loha, lokr
- loha handles multiple-concepts better
  - https://www.canva.com/design/DAFeAteHW18/view#5
use regularization images with lora https://rentry.org/59xed3#regularization-images
GLORA: One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning
- individual adapter of each layer
- superior accuracy fewer parameters-computations
PEFT x Diffusers Integration
Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying
- 13% of parameters than lora, parameter efficiency
Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models, plug and play =best=
- concept sliders that enable precise control over attributes
- intuitive editing of visual concepts for which textual description is difficult
- repair of object deformations and fixing distorted hands
ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs
- cheaply and effectively merge independently trained style and subject LoRAs
DoRA: Weight-Decomposed Low-Rank Adaptation
- decomposes the pre-trained weight into two components, magnitude and direction; directional updates
DiffuseKronA: A Parameter Efficient Fine-tuning Method for Personalized Diffusion Model
- Kronecker product-based adaptation, reduces the parameter count by up to 35% lora
5.2.1
CAT: Contrastive Adapter Training for Personalized Image Generation
- no loss of diversity in object generation, no token = no effect

MULTIPLE LORA
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters
  - scalable serving of many LoRA adapters, all adapters in the main memory, fetches for the current queries
- MultiLoRA: Democratizing LoRA for Better Multi-Task Learning
  - changes parameter initialization of adaptation matrices to reduce parameter dependency
- Orthogonal Adaptation for Modular Customization of Diffusion Models
  - customized models can be summed with minimal interference, and jointly synthesize
  - scalable customization of diffusion models by encouraging orthogonal weights
- Multi-LoRA Composition for Image Generation
- CLoRA: A Contrastive Approach to Compose Multiple LoRA Models
  - enables the creation of composite images that truly reflect the characteristics of each LoRA

5.4.6.3. TEXTUAL INVERSION

Multiresolution Textual Inversion: better textual inversion (embedding)
Extended Textual Inversion (XTI)
- P+: Extended Textual Conditioning in Text-to-Image Generation
  - different text embedding per unet layer
  - code
- SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models (llm)
  - adapter to transfer the semantic understanding of llm to align complex vs simple prompts
DREAMDISTRIBUTION is like Textual Inversion
CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization
- learns the gap between the personalized concept and its base class

6. USE CASES

3.1 ERASING CONCEPTS

6.1. IMAGE COMPRESSION FILE

Robustly overfitting latents for flexible neural image compression
- refine the latents of pre-trained neural image compression models
Learned Image Compression with Text Quality Enhancement
- text logit loss function

6.2. DIFFUSION AS ENCODER - RETRIEVE PROMPT

De-Diffusion Makes Text a Strong Cross-Modal Interface
- text as a cross-modal interface
- autoencoder uses a pre-trained text-to-image diffusion model for decoding
  - encoder is trained to transform an input image into text
PH2P: Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models
- projection scheme to optimize for prompts representative of the space in the model (meaningful prompts)

6.3. DIFFUSING TEXT

2.7.4.1
DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion (fonts)
GlyphDraw: Learning to Draw Chinese Characters in Image Synthesis Models Coherently
- TextDiffuser: Diffusion Models as Text Painters
  - Typographic Text Generation with Off-the-Shelf Diffusion Model
    - complex effects while preserving its overall coherence
- GlyphControl: Glyph Conditional Control for Visual Text Generation =this=
TextDiffuser: Diffusion Models as Text Painters
TextDiffuser-2: two language models: for layout planning and layout encoding; before the unet
2.7.3
Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation
- training-free framework to enhance layout generator and image generator conditioned on it
- generating images with long and rare text sequences

6.3.1. GENERATE VECTORS

VecFusion: Vector Font Generation with Diffusion
- rasterized fonts then vector model synthesizes vector fonts
StarVector: Generating Scalable Vector Graphics Code from Images
- CLIP image encoder, learning to align the visual and code tokens, generate SVGs
StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis
- encoding into stroke tokens, naturally compatible with LLMs
SVGCraft: Beyond Single Object Text-to-SVG Synthesis with Comprehensive Canvas Layout
- creation of vector graphics depicting entire scenes from textual descriptions
- optimized using a pre-trained encoder

6.3.2. INPAINTING TEXT

DiffSTE: Inpainting to edit text in images with a prompt (model)
- Improving Diffusion Models for Scene Text Editing with Dual Encoders

6.3.2.1. DERIVED FROM SD

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models (with training code)
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model (Diff-Text)
- attention constraint to address unreasonable positioning, more accurate scene text, any language
- its just a prompt and canny: “sign”, “billboard”, “label”, “promotions”, “notice”, “marquee”, “board”, “blackboard”, “slogan”, “whiteboard”, “logo”
AnyText: Multilingual Visual Text Generation And Editing =best=
- inputs: glyph, position, and masked image to generate latent features for text generation-editing
- curved into shapes-textures text

6.4. IMAGE RESTORATION, SUPER-RESOLUTION

NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement
- image signal processing pipeline , multiple blendable styles into a single network
FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model
refusion: Image Restoration with Mean-Reverting Stochastic Differential Equations
image restoration IR, DDNM using NULL-SPACE
- unlimited superresolution
SVNR: Spatially-variant Noise Removal with Denoising Diffusion
- real life noise fixing
Dense Pixel-to-Pixel Harmonization via Continuous Image Representation
- stretched images due to change in resolution fixed
- Zero-Shot Image Harmonization with Generative Model Prior
DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior
- using a SwinIR then refine with sd

6.4.1. SUPERRESOLUTION

CCSR: Improving the Stability of Diffusion Models for Content Consistent Super-Resolution
Swintormer: Image Deblurring based on Diffusion Models (limited memory)
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution
- for videos, temporal adapter to ensure temporal coherence
YONOS-SR: You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
- start by training a teacher model on a smaller magnification scale
- instead of 200 steps, and finetuned decoder on top of it
SUPIR: Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild
- based on large-scale diffusion generative prior
Face to Cartoon Incremental Super-Resolution using Knowledge Distillation
- faces and anime restoration at various levels of detail
APISR: Anime Production Inspired Real-World Anime Super-Resolution
Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model
- pyramid latent representation

6.4.1.1. STABLESR

StableSR: Exploiting Diffusion Prior for Real-World Image Super-Resolution
- develope a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models

6.4.1.2. DEMOFUSION

DemoFusion: Democratising High-Resolution Image Generation With No $$$
- achieve higher-resolution image generation
- Enhance This: DemoFusion SDXL
- ComfyUI Iterative Mixing Nodes =best=
  - iterative mixing of samples to help with upscaling quality
  - SD 1.5 generating at higher resolutions
  - evolution from NNLatentUpscale

PASD MAGNIFY
- PASD Magnify: Pixel-Aware Stable Diffusion for Realistic Image Super-resolution and Personalized Stylization
  - image slider custom component

6.5. DEPTH GENERATION

depth map from diffusion, build 3d enviroment with it
- VPD: using diffusion for depth estimation, image segmentation (better) comparable 1.1
ZoeDepth: Combining relative and metric depth
- tiling ZoeDepth
- PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
- Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation (70s inference)
LDM3D by intel, generates image & depth from text prompts
LDM3D-VR: Latent Diffusion Model for 3D VR
- generating depth together, panoramic RGBD
DMD (Diffusion for Metric Depth)
- Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model (depth from image)
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data (temporal coherance no flickering)
GIBR

6.5.1. DEPTH DIFFUSION

MVDD: Multi-View Depth Diffusion Models
- 3D shape generation, depth completion, and its potential as a 3D prior
- enforce 3D consistency in multi-view depth
DepthFM: Fast Monocular Depth Estimation with Flow Matching
- pre-trained image diffusion model can become flow matching depth model

6.5.2. NORMAL MAPS

DSine: Rethinking Inductive Biases for Surface Normal Estimation
- better than bae and midas
- preprocessor

stable diffusion

Table of Contents

1. SD MODELS

1.1. DISTILLATION

1.1.1. ONE STEP DIFFUSION

1.1.2. SDXS

1.2. IRIS LUX

1.3. EMOJIS

1.4. MERGING MODELS

1.4.1. SEGMOE

2. GENERATION CONTROL

2.1. MATERIAL EXTRACTION

2.2. LIGHT CONTROL

2.3. BACKGROUND

2.4. EMOTIONS

2.5. NOISE CONTROL

2.6. GUIDING FUNCTION

2.6.1. ADAPTIVE GUIDANCE

2.7. CONTROL NETWORKS, CONTROLNET

2.7.1. SKETCH

2.7.2. ALTERNATIVES

2.7.3. TIP: text restoration

2.7.4. HANDS

2.7.4.1. RESTORING HANDS

2.7.5. USING ATTENTION MAP

2.7.5.1. MASA

2.7.5.2. LLLYASVIEL

2.7.6. SEVERAL CONTROLS IN ONE

2.8. HUMAN PAINT

2.9. LAYOUT DIFFUSION

2.9.1. SCENES

2.9.2. WITH BOXES

2.9.3. ALDM

2.9.4. OPEN-VOCABULARY

2.9.5. CARTOON

2.9.5.1. COGCARTOON

2.10. IMAGE PROMPT - ONE IMAGE

2.10.1. UNET LESS

2.10.2. IMAGE-SUGGESTION

2.10.2.1. ZERO SHOT

2.10.2.2. PERSONALIZATION

2.10.3. IDENTITY

2.10.3.1. STORYTELLER DIFFUSION

2.10.3.2. ANYDOOR

2.10.3.3. SUBJECT

2.10.3.4. ANIME

2.10.4. VARIATIONS

3. BETTER DIFFUSION

3.1. SCHEDULER

3.2. QUALITY

3.3. HUMAN FEEDBACK

3.3.1. ACTUALLY SELF-FEEDBACK

3.4. SD GENERATION OPTIMIZATION

3.4.1. ULTRA SPEED

3.4.2. CACHE

3.4.2.1. EXPLOITING FEATURES

3.4.3. LCM

3.4.3.1. CCM

3.4.3.2. PERFLOW

3.5. PROMPT CORRECTNESS

3.5.1. ATTENTION LAYOUT

3.5.2. LANGUAGE ENHANCEMENT

3.5.2.1. PROMPT EXPANSION, PROMPT AUGMENTATION

3.5.2.2. TOKENCOMPOSE

3.6. BIGGER COHERENCE

3.6.1. PANORAMAS

3.6.1.1. OUTPAINTING

3.6.2. RESOLUTION

3.6.2.1. ARBITRARY

4. SAMPLERS

5. IMAGE EDITING

5.1. IMAGE SCULPTING =best=

5.2. STYLE

5.2.1. B-LoRA

5.2.2. STYLE TOOLS

5.2.3. DIRECT CONSISTENCY OPTIMIZATION

5.3. REGIONS

5.3.1. REGIONS MERGE

5.3.1.1. INTERPOLATION

5.3.2. MINIMAL CHANGES

5.1. IMAGE SCULPTING `=best=`