stable diffusion
Table of Contents
- 1. SD MODELS
- 2. GENERATION CONTROL
- 3. BETTER DIFFUSION
- 4. SAMPLERS
- 5. IMAGE EDITING
- 6. USE CASES
- parent: diffusion
- related: diffusion video software
- combining pipelines, creating pipelines
- generate: NOVEL VIEW
- how to guidance-classifier the diffusion
1. SD MODELS
- CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images
- CC-licensed images with BLIP-2 captions, similar performance to Stable Diffusion 2 (apache license)
- Terminus XL Gamma: simpler SDXL, for inpainting tasks, super-resolution, style transfer
- 5.2
- AnimateLCM-SVD-xt: image to video
- stable-cascade: würstchen architecture = even smaller latent space
- Stable-Cascade-FP16
- sd x8 compression (1024x1024 > 128x128) vs cascade x42 compression, (1024x1024 > 24x24)
- faster inference, cheaper training
- STABLE DIFFUSION 3
1.1. DISTILLATION
- SSD1B (distilled SDXL) 60% Fast -40% VRAM
- SDXL-Lightning: a lightning fast 1024px text-to-image generation model (few-steps generation)
- progressive adversarial diffusion distillation
1.1.1. ONE STEP DIFFUSION
- One-step Diffusion with Distribution Matching Distillation
- comparable with v1.5 while being 30x faster
- critic similar to GANs in that is jointly trained with the generator
- differs in that it does not play adversarial game, and can fully leverage a pretrained model
1.1.2. SDXS
- SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions
- knowledge distillation to streamline the U-Net and image decoder architectures
- one-step DM training technique that utilizes feature matching and score distillation
- speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FPS (60x faster than SDXL) on a GPU
- image-conditioned control, facilitating efficient image-to-image translation.
1.2. IRIS LUX
https://civitai.com/models/201287 Model created through consensus via statistical filtering (novel consensus merge) https://gist.github.com/Extraltodeus/0700821a3df907914994eb48036fc23e
1.3. EMOJIS
- Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression
- emojis, stickers
1.4. MERGING MODELS
1.4.1. SEGMOE
- SegMoE - The Stable Diffusion Mixture of Experts for Image Generation, Mixture of Diffusion Experts
- training free, creation of larger models on the fly, larger knowledge
2. GENERATION CONTROL
- 5.4.2 DRAG 5.3.2.2.2.1
- hyperparameters with extra network Mid-U Guidance
- block weights lora
- DiLightNet: Fine-grained Lighting Control for Diffusion-based Image Generation
- using light hints to resynthetize a prompt with user-defined consistent lighting
- Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation
- refines the output iteratively in the latent space
- ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback
- explicitly optimizing pixel-level cycle consistency between generated images
2.1. MATERIAL EXTRACTION
- U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation
- generates images with the material or color extracted from the input image
- sentence describing the desired attribute
- learn user-specified visual attributes
- ZeST: Zero-Shot Material Transfer from a Single Image
- leverages adapters to extract implicit material representation from exemplar image
2.2. LIGHT CONTROL
- DiffusionLight: Light Probes for Free by Painting a Chrome Ball
- render a chrome ball into the input image
- produces convincing light estimates
2.3. BACKGROUND
- BriaAI: Open-Source Background Removal (RMBG v1.4)
- LayerDiffusion: Transparent Image Layer Diffusion using Latent Transparency
- layers with alpha, generate pngs, remove backgrounds (more like generate with removable background)
- method learns a “latent transparency”
- models
2.4. EMOTIONS
- Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation
- face swapping and reenactment, interpolate between emotions
- EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models
- clip, abstract emotions
- Make Me Happier: Evoking Emotions Through Image Diffusion Models
- understanding and editing source images emotions cues
2.5. NOISE CONTROL
- offset noise(darkness capable loras), pyramid noise
- Common Diffusion Noise Schedules and Sample Steps are Flawed (and several proposed fixes)
- native offset noise
- noisy perlin latent
- you can reinject the same noise pattern after an upscale, more coherent results and better upscaling
- Blue noise for diffusion models
- allows introducing correlation across images within a single mini-batch to improve gradient flow
2.6. GUIDING FUNCTION
- Universal Guided Diffusion (face and style transfer)
- FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model
- extra: repo has list of deblurring, super-resolution and restoration methods
- masks as energy function
- FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model
- Diffusion Self-Guidance for Controllable Image Generation
- steer sampling, similarly to classifier guidance, but using signals in the pretrained model itself
- instructional transfomations
- MCM Modulating Pretrained Diffusion Models for Multimodal Image Synthesis (module after denoiser) mmc
2.6.1. ADAPTIVE GUIDANCE
- Adaptive Guidance: Training-free Acceleration of Conditional Diffusion Models
- AG, efficient variant of CFG(Classifier-Free Guidance); reducing computation by 25%
- omits network evaluations when the denoising process displays convergence
- second half of the denoising process redundant; plug-and-play alternative to Guidance Distillation
- LinearAG: entire neural-evaluations can be replaced by affine transformations of past estimates
2.7. CONTROL NETWORKS, CONTROLNET
- REFERENENET CONTROLNET FOR 3D 3.4.3.1 CONTROLNET VIDEO
- why controlnet, alternatives https://github.com/lllyasviel/ControlNet/discussions/188
- VisorGPT: Learning Visual Prior via Generative Pre-Training
- gpt that learns to tranform normal prompts into controlnet primitives
- FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection
- geometric control via human pose images and appearance control via instance-level text prompts
- FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition
- alignment with guidance image: lidar, face mesh, wireframe mesh, rag doll
- FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection
- instance-specific text description, better prompt following
2.7.1. SKETCH
- diffmorph: text-less image morphing with diffusion models
- sketch-to-image module
- Block and Detail: Scaffolding Sketch-to-Image Generation
- sketch-to-image, can generate coherent elements from partial sketches, generate beyond the sketch following the prompt
- CustomSketching: Sketch Concept Extraction for Sketch-based Image Synthesis and Editing
- one for contour, the other flow lines representing texture
2.7.2. ALTERNATIVES
- controlNet (total control of image generation, from doodles to masks)
- T2I-Adapter (lighter, composable), how color pallete
- lora like (old) https://github.com/HighCWu/ControlLoRA
- ControlNet-XS: 1% of the parameters
- LooseControl: Lifting ControlNet for Generalized Depth Conditioning
- loosely specifying scenes with boxes
- controlnet-lltite by kohya
- SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing
- lightweight tuning module named SC-Tuner, synthesis by injecting different conditions
- reduces training parameters and memory requirements
- Integrated Into SCEPTER and SWIFT
- Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis
- imposing global semantics onto targeted regions without the use of any additional localization cues
- alternative to controlnet and t2i-adapter
2.7.3. TIP: text restoration
- TIP: Text-Driven Image Processing with Semantic and Restoration Instructions
=best=
- controlnet architecture, leverages natural language as interface to control image restoration
- instruction driven, can inprint text into image
2.7.4. HANDS
- HANDS DATASET
- HandDiffuse: Generative Controllers for Two-Hand Interactions via Diffusion Models
- two-hand interactions, motion in-betweening and trajectory control
2.7.4.1. RESTORING HANDS
- Detecting and Restoring Non-Standard Hands in Stable Diffusion Generated Images
- body pose estimation to understand hand orientation for accurate anomaly correction
- integration of ControlNet and InstructPix2Pix
- HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting
- incorrect number of fingers, irregular shapes, effectively rectified
- utilize ControlNet modules to re-inject corrected information, 1.5
2.7.5. USING ATTENTION MAP
- 5.4.4 4.1 STORYTELLER DIFFUSION
- RIVAL: Real-World Image Variation by Aligning Diffusion Inversion Chain
=best=
2.7.5.1. MASA
- MasaCtrl: Tuning-free Mutual Self-Attention Control for Consistent Image Synthesis and Editing
- same thing different views or poses
- by querying the attention map from another image
- better than ddim inversion, consistent SD animations; mixable with T2I-Adapter
- same thing different views or poses
- TI-GUIDED-EDIT
- Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance
- rigid=conserve the structure
- Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance
2.7.5.2. LLLYASVIEL
- reference-only preprocessor doesnt require any control models, generate variations
- can guide the diffusion directly using images as references, and generate variations
- Guess Mode / Non-Prompt Mode, now named: Control Modes, how much prompt vs controlnet; comfy node
2.7.6. SEVERAL CONTROLS IN ONE
- UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
- several controlnets in one, contextual understanding
- image deblurring, image colorization
- using UniControl with Stable Diffusion XL 1.0 Refiner; sketch to image tool
- In-Context Learning Unlocked for Diffusion Models
- learn translation of image to hed, depth, segmentation, outline
2.8. HUMAN PAINT
- SDEdit: guided image synthesis and editing with stochastic differential equation
- stroke based inpainting-editing
- FOOLSDEDIT: Deceptively Steering Your Edits Towards Targeted Attribute-aware Distribution
- forcing SDEdit to generate a data distribution aligned a specified attribute (e.g. female)
- Control Color: Multimodal Diffusion-Based Interactive Image Colorization
- paint over grayscale to recolor it
2.9. LAYOUT DIFFUSION
- 3d: ROOM LAYOUT
- 3.5.1 STORYTELLER DIFFUSION
- ZestGuide: Zero-shot spatial layout conditioning for text-to-image diffusion models
- implicit segmentation maps can be extracted from cross-attention layers
- spatial conditioning to sd without finetunning
- Towards Aligned Layout Generation via Diffusion Model with Aesthetic Constraints
- constraints representing design intentions
- continuous state-space design can incorporate differentiable aesthetic constraint functions in training
- by introducing conditions via masked input
- RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models
- dynamically balance the strengths of the two models in denoising process
- Getting it Right: Improving Spatial Consistency in Text-to-Image Models
- better representing spatial relationships
- faithfully follow the spatial relationships specified in the text prompt
2.9.1. SCENES
- Generate Anything Anywhere in Any Scene
- training guides to focus on object identity, personalized concept with localization controllability
- 2.10.3.2 ALDM
2.9.2. WITH BOXES
- GLIGEN: Open-Set Grounded Text-to-Image Generation (boxes)
- Training-Free Layout Control with Cross-Attention Guidance
- SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis
- BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
- InstanceDiffusion: Instance-level Control for Image Generation
- conditional generation, hierarchical bounding-boxes structure, featur(prompt) at point
- single points, scribbles, bounding boxes or segmentation masks
- Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models
- bounding boxes with attribute(prompt) binding
2.9.3. ALDM
- ALDM: Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive
- layout faithfulness
2.9.4. OPEN-VOCABULARY
- Spatial-Aware Latent Initialization for Controllable Image Generation
- inverted reference image contains spatial awareness regarding positions, resulting in similar layouts
- open-vocabulary framework to customize a spatial-aware initialization
2.9.5. CARTOON
- Desigen: A Pipeline for Controllable Design Template Generation
- generating images with proper layout space for text; generating the template itself
2.9.5.1. COGCARTOON
- CogCartoon: Towards Practical Story Visualization
- plugin-guided and layout-guided inference; specific character = 316 KB plugin
2.10. IMAGE PROMPT - ONE IMAGE
2.10.1. UNET LESS
- ProFusion: Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach
- and can interpolate between two
- promptnet (embedding), encoder based, for style transform
- one image, no regularization needed
- Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
- using CLIP features extracted from the subject
2.10.2. IMAGE-SUGGESTION
- 5.4.1.1
- UMM-Diffusion, TIUE: Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation
- takes joint texts and images
- only the image-mapping to a pseudo word embedding is learned
2.10.2.1. ZERO SHOT
- Context Diffusion: In-Context Aware Image Generation
- separates the encoding of the visual context; prompt not needed
- ReVision - Unclip https://comfyanonymous.github.io/ComfyUI_examples/sdxl/
- Revision gives the model the pooled output from CLIPVision G instead of the CLIP G text encoder
- SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation
- architecture designed for selectively capturing any subject from single or multiple reference images
- IP-ADAPTER
- IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
=stock SD=
- works with other controlnets
- IP-Adapter-FaceID (face recognition model)
- LCM-LOOKAHEAD
- LCM-Lookahead for Encoder-based Text-to-Image Personalization
- LCM-based approach for propagating image-space losses to personalization model training and classifier guidance
- LCM-Lookahead for Encoder-based Text-to-Image Personalization
- IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
- SEECODERS
- Seecoders: Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models
- Semantic Context Encoder, replaces clip with seecoder; works with
=stock SD=
- input image and controlnet
- unlike unclip, seecoders uses extra model
- one image into several perspectives (MULTIVIEW DIFFUSION)
- the embeddings can be textures, effects, objects, semantics(contexts)
- Semantic Context Encoder, replaces clip with seecoder; works with
tics, etc.
- Seecoders: Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models
2.10.2.2. PERSONALIZATION
- InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
- personalized images with only a single forward pass
- HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models; just one image
2.10.3. IDENTITY
- masked score estimation
- HiPer: Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion
- one image single thing, gets the clip
- 2.10.2.1.1
2.10.3.1. STORYTELLER DIFFUSION
- ConsiStory: Training-Free Consistent Text-to-Image Generation
- training-free approach for consistent subject(object) generation x20 faster, multi-subject scenarios
- by sharing the internal activations of the pretrained model
2.10.3.2. ANYDOOR
- AnyDoor: Zero-shot Object-level Image Customization
- teleport target objects to new scenes at user-specified locations
- identity feature with detail feature
- moving objects, swapping them, multi-subject composition, try-on a cloth
2.10.3.3. SUBJECT
- Inserting Anybody in Diffusion Models via Celeb Basis
- one facial photograph, 1024 learnable parameters, 3 minutes; several at once
- Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning
- multi subject, single reference image
- PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models
- incorporates facial identity loss, single facial photo, single training phase
- The Chosen One: Consistent Characters in Text-to-Image Diffusion Models
- FaceStudio: Put Your Face Everywhere in Seconds
=best=
- direct feed-forward mechanism, circumventing the need for intensive fine-tuning
- stylized images, facial images, and textual prompts to guide the image generation process
- SeFi-IDE: Semantic-Fidelity Identity Embedding for Personalized Diffusion-Based Generation
- face-wise attention loss to fit the face region
- IDENTITY IN VIDEO
- Magic-Me: Identity-Specific Video Customized Diffusion
- STABLEIDENTITY
- StableIdentity: Inserting Anybody into Anywhere at First Sight
- identity recontextualization with just one face image without finetuning
- also for into video/3D generation
- StableIdentity: Inserting Anybody into Anywhere at First Sight
- IDENTITY ZERO-SHOT
- InstantID: Zero-shot Identity-Preserving Generation in Seconds (using face encoder)
- PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding Paper page
- Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm
=best=
- identity provided by the reference image while mitigating interference from textual input
- Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm
- PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding Paper page
- CapHuman: Capture Your Moments in Parallel Universes
- encode then learn to align, identity preservation for new individuals without tuning
- SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation
=best=
- Token-to-Patch Aligner = preserving fine features of the subjects; multiple subjects
- combinable with controlnet, and across styles
- RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization
- gradually narrowing to the specific subject, iteratively update the influence scope
- InstantID: Zero-shot Identity-Preserving Generation in Seconds (using face encoder)
- PHOTOMAKER
- PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
- encodes (into mlp) images into embedding wich preserves id
- PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
2.10.3.4. ANIME
- DreamArtist: a single one image and target text (mainly works with anime)
- DreamTuner: Single Image is Enough for Subject-Driven Generation
- subject-encoder for coarse subject identity preservation, training-free
- DreamTuner: Single Image is Enough for Subject-Driven Generation
- pfg Prompt free generation; learns to interpret (anime) input-images
- old one: PaintByExample
3. BETTER DIFFUSION
- editing default of a prompt: https://github.com/bahjat-kawar/time-diffusion
- Self-Attention Guidance (SAG): SAG leverages intermediate attention maps of diffusion models at each iteration to capture essential information for the generative process and guide it accordingly
- pretty much just reimplemented the attention function without changing much else
- FreeU: Free Lunch in Diffusion U-Net (unet)
=best=
- improves diffusion model sample quality at no costs
- more color variance
- Diffusion Sampling with Momentum for Mitigating Divergence Artifacts
- incorporation of: Heavy Ball (HB) momentum = expand stability regions; Generalized HB (GHVB) = supression
- better low step sampling
- DG: Detector Guidance for Multi-Object Text-to-Image Generation
- mid-diffusion, performs latent object detection then enhances following CAMs(cross-attention maps)
3.1. SCHEDULER
- simple diffusion: End-to-end diffusion for high resolution images
- shifted scheduled noise
- Sigmas Tools and The Golden Scheduler
3.2. QUALITY
- 3.6.2
- Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack (dataset method)
- guide pre-trained model to exclusively generate good images
- HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion
- Latent Structural Diffusion Model that simultaneously denoises depth and surface normal with RGB image
- Consistency Distilled Diff VAE
- Improved decoding for stable diffusion vaes
3.3. HUMAN FEEDBACK
- RLCM
- Aligning Text-to-Image Models using Human Feedback https://arxiv.org/abs/2302.12192
- Better Aligning Text-to-Image Models with Human Preference
- RRHF: Rank Responses to Align Language Models with Human Feedback without tears
- ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation
- FABRIC: Personalizing Diffusion Models with Iterative Feedback
- training-free approach, exploits the self-attention layer
- improve the results of any Stable Diffusion model
- Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model
- Direct Preference for Denoising Diffusion Policy Optimization (D3PO)
- omits training a reward model
- Diffusion-DPO: Diffusion Model Alignment Using Direct Preference Optimization (training script)
- ALDM layout
- RL Diffusion: Large-scale Reinforcement Learning for Diffusion Models (improves pretrained)
- PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
=best=
- better training stability for unseen prompts
- reward difference of generated image pairs from their denoising trajectories
- MESH HUMAN FEEDBACK
3.3.1. ACTUALLY SELF-FEEDBACK
- SPIN-Diffusion: Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation
=best=
- diffusion model engages in competition with its earlier versions, iterative self-improvement
- AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation
- use Vision Models (VLM) to assess quality across style, coherence, and aesthetics, generating feedback
3.4. SD GENERATION OPTIMIZATION
- ONE STEP DIFFUSION 4 STABLE CASCADE
- turning off CFG when denoising sigmas below 1.1
- Tomesd: Token Merging for Stable Diffusion code
- ToDo: Token Downsampling for Efficient Generation of High-Resolution Images
- token downsampling of key and value tokens to accelerate inference 2x-4x
- ToDo: Token Downsampling for Efficient Generation of High-Resolution Images
- Nested Diffusion Processes for Anytime Image Generation
- can generate viable when stopped arbitrarily before completion
- BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping
- use sd as teacher model and train faster one using it as bootstrap; 30 fps
- Divide & Bind Your Attention for Improved Generative Semantic Nursing
- novel objective functions: can handle complex prompts with proper attribute binding
- Conditional Diffusion Distillation
- added parameters, suplementing image conditions to the diffusion priors
- super-resolution, image editing, and depth-to-image generation
- 4 2.6.1
- OneDiff: acceleration library for diffusion models, ComfyUI Nodes
- T-Stitch: Accelerating Sampling in Pre-trained Diffusion Models with Trajectory Stitching
- improve sampling efficiency with no generation degradation
- smaller DPM in the initial steps, larger DPM at a later stage, 40% of the early timesteps
- The Missing U for Efficient Diffusion Models
- operates with approximately a quarter of the parameters, diffusion models 80% faster
3.4.1. ULTRA SPEED
- SDXL Turbo: A real-time text-to-image generation model (distillation)
- ArtSpew: SD at 149 images per second (high volume random image generation)
- StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation (10ms)
- transforms sequential denoising into the batching denoising
- MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices
- diffusion-GAN finetuning techniques to achieve 8-step and 1-step inference
- Accelerating Diffusion Sampling with Optimized Time Steps
- image performance compared to using uniform time steps
3.4.2. CACHE
- DeepCache: Accelerating Diffusion Models for Free
=best=
- exploits temporal redundancy observed in the sequential denoising steps
- superiority over existing pruning and distillation
- Cache Me if You Can: Accelerating Diffusion Models through Block Caching
- reuse outputs from layer blocks of previous steps, automatically determine caching schedules
- Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models
=best=
- reuse cyclically the encoder features in the previous time-steps for the decoder
- Fast Inference Through The Reuse Of Attention Maps In Diffusion Models
- structured reuse of attention maps during sampling
- T-GATE: Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models
- two stages: semantics-planning phase, and subsequent fidelity-improving phase
- so caching cross-attention output once converges and fixing it during the remaining inference
3.4.2.1. EXPLOITING FEATURES
- FRDiff: Feature Reuse for Exquisite Zero-shot Acceleration of Diffusion Models
- Reusing feature maps with high temporal similarity
- Clockwork Diffusion: Efficient Generation With Model-Step Distillation
- high-res features sensitive to small perturbations; low-res feature only sets semantic layout
- so reuses computation from preceding steps for low-res
3.4.3. LCM
- LCMs: Latent Consistency Models: Synthesizing High-Resolution Images with Few-step Inference
- inference with minimal steps (2-4)
- training LCM model: only 32 A100 GPU hours
- Latent Consistency Fine-tuning (LCF) custom datasets
- comfyui auto1111 the model
- LCM-LoRA: A Universal Stable-Diffusion Acceleration Module
- universally applicable accelerator for diffusion models, plug-in neural PF-ODE solver
- VideoLCM: Video Latent Consistency Model
- smooth video synthesis with only four sampling steps
- ANIMATELCM
- Quick Image Variations with LCM and Image Caption
- TCD: Trajectory Consistency Distillation (lora)
- accurately trace the entire trajectory of the Probability Flow ODE
- https://github.com/dfl/comfyui-tcd-scheduler
- LCM-LOOKAHEAD
3.4.3.1. CCM
- CCM: Adding Conditional Controls to Text-to-Image Consistency Models
- ControlNet-like, lightweight adapter can be jointly optimized while consistency training
3.4.3.2. PERFLOW
- PeRFlow (Piecewise Rectified Flow)
- fast generation, 4 steps, 4,000 training iterations
- multiview normal maps and textures from text prompts instantly
3.5. PROMPT CORRECTNESS
- ReCo: region control, counting donuts
- sd-webui-cutoff, hide tokens for each separated group, limits the token influence scope (color control)
- hard-prompts-made-easy
- magic prompt: amplifies-improves the prompt
- Get What You Want, Not What You Don’t: Image Content Suppression for Text-to-Image Diffusion Models
- suppress unwanted content generation of the prompt, and encourages the generation of desired content
- better than negative prompts
- Discriminative Probing and Tuning for Text-to-Image Generation
- discriminative adapter to improve their text-image alignment
- global matching and local grounding
- CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
- fine-tuning strategy with an image-to-text(captioning model) concept matching mechanism
- [[https://youtu.be/_Pr7aFkkAvY?si=Xr5e_RL-rwcdL10q
][ELLA]] - A Powerful Adapter for Complex Stable Diffusion Prompts
- using an adaptor for an llm instead of clip
3.5.1. ATTENTION LAYOUT
- Attend-and-Excite (excite the ignored prompt tokens) (no retrain)
- Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis
- Directed Diffusion: Direct Control of Object Placement through Attention Guidance (no retrain) repo
- DenseDiffusion: Dense Text-to-Image Generation with Attention Modulation
- training free, layout guidance
3.5.2. LANGUAGE ENHANCEMENT
- 5.4.1.2
- Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment
- using prompt sentence structure during inference to improve the faithfulness
- Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
- exploiting language sentences semantical hierarchies (lojban)
- Structured Diffusion Guidance, language enhanced clip enforces on unet
- Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering
- prompt learning, improve the matches between the input text and the generated
3.5.2.1. PROMPT EXPANSION, PROMPT AUGMENTATION
- DanTagGen: LLaMA arch
- superprompter: Supercharge your AI/LLM prompts
- Capability-aware Prompt Reformulation Learning for Text-to-Image Generation
- effectively learn diverse reformulation strategies across various user capacities to simulate high-capability user reformulation
3.5.2.2. TOKENCOMPOSE
- TokenCompose: Grounding Diffusion with Token-level Supervision
=best=
- finetuned with token-wise grounding objectives for multi-category instance composition
- exploiting binary segmentation maps from SAM
- compositions that are unlikely to appear simultaneously in a natural scene
3.6. BIGGER COHERENCE
- 5.3.1.1 VIDEO COHERENCE
- Many-to-many Image Generation with Auto-regressive Diffusion Models
3.6.1. PANORAMAS
- DiffCollage: Parallel Generation of Large Content with Diffusion Models (panoramas)
- Collaborative Score Distillation for Consistent Visual Synthesis
- consistent visual synthesis across multiple samples
=best one=
- distill generative priors over a set of images synchronously
- zoom, video, panoramas
- consistent visual synthesis across multiple samples
- SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions
- plug-and-play module that synchronizes multiple diffusions through gradient descent from a perceptual similarity loss
- Taming Stable Diffusion for Text to 360° Panorama Image Generation
- minimize distortion during the collaborative denoising process
3.6.1.1. OUTPAINTING
- 5.3.2.2.2
- Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach
- generate content beyond boundaries using relative positional information
- BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion
- pre-trained SD model, useful in product exhibitions, virtual try-on, or background replacement
3.6.2. RESOLUTION
- Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images
- training on images of unlimited sizes is unfeasible
- Fast Seamless Tiled Diffusion (FSTD)
- ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models (video too)
- generating images at much higher resolutions than the training image sizes
- does not require any training or optimization
- Matryoshka Diffusion Models
- diffusion process that denoises inputs at multiple resolutions jointly
- FIT TRANSFORMER
- Upsample Guidance: Scale Up Diffusion Models without Training
- technique that adapts pretrained model to generate higher-resolution images by adding a single term in the sampling process, without any additional training or relying on external models
- can be applied to various models, such as pixel-space, latent space, and video diffusion models
3.6.2.1. ARBITRARY
- ElasticDiffusion: Training-free Arbitrary Size Image Generation
- decoding method better than MultiDiffusion
- ResAdapter: Domain Consistent Resolution Adapter for Diffusion Models
- unlike post-process, directly generates images with the dynamical resolution
- compatible with ControlNet, IP-Adapter and LCM-LoRA; can be integrated with ElasticDiffusion
4. SAMPLERS
- GENIE: Higher-Order Denoising Diffusion Solvers
- faster diffusion equation?
- DDIM vs GENIE
- 4 time less expensive upsampling
- fastest solver https://arxiv.org/abs/2301.12935
- another accelerator: https://arxiv.org/abs/2301.11558
- unipc sampler (sampling in 5 steps)
- smea: (nai) global attention sampling
- Karras no blurry improvement reddit
- DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics
- several coefficients efficiently computed on the pretrained mode, faster
- STABLESR novel approach
- 5.2.3: controls intensity of style
5. IMAGE EDITING
- 3D-AWARE IMAGE EDITING
- null-text inversion: prompttoprompt but better
- imagic: editing photo with prompt
5.1. IMAGE SCULPTING =best=
- Image Sculpting: Precise Object Editing with 3D Geometry Control
- enables direct interaction with their 3D geometry
- pose editing, translation, rotation, carving, serial addition, space deformation
- turned into nerf using Zero-1-to-3, then returned to image including features
- enables direct interaction with their 3D geometry
5.2. STYLE
- StyleDrop: Text-to-Image Generation in Any Style (muse architecture)
- 1% of parameters (painting style)
- PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization
- learnable style word vectors, style-content features to be located nearby
- Zero-shot Generative Model Adaptation via Image-specific Prompt Learning
- adapt style to concept
- StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation
- process the prompt and style features separately
- DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models
- textual embedding with style guidance
- Cross-Image Attention for Zero-Shot Appearance Transfer
- zero-shot appearance transfer by building on the self-attention layers of image diffusion models
- architectural transfer
- STYLECRAFTER transfer to video
- Style Aligned Image Generation via Shared Attention
=best=
(as controlnet extension)- color palette too
- FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models
- style transfer built upon sd, dual-stream encoder and single-stream decoder architecture
- content into pixelart, origami, anime
- PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis
- image from segmentation map and also using semantic features
- Visual Style Prompting with Swapping Self-Attention
- consistent style across generations
- unlike others (ip-adapter) disentangle other semantics away (like pose)
- DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations
=best=
- decouple the style and semantics of reference images
- optimal balance between the text controllability and style similarity
- InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation
- decouples style and content from reference images within the feature space
- DreamWalk: Style Space Exploration using Diffusion Guidance
- decompose the text prompt into conceptual elements, apply a separate guidance for each element
- LCM-LOOKAHEAD
5.2.1. B-LoRA
- Implicit Style-Content Separation using B-LoRA
- preserving its underlying objects, structures, and concepts
- LoRA of two specific blocks
- image style transfer, text-based stylization, consistent style generation, and style-content mixing
5.2.2. STYLE TOOLS
- Measuring Style Similarity in Diffusion Models
- compute similarity score
5.2.3. DIRECT CONSISTENCY OPTIMIZATION
- DCO: Direct Consistency Optimization for Compositional Text-to-Image Personalization
- minimally fine-tuning pretrained to achieve consistency
- new sampling method that controls the tradeoff between image fidelity and prompt fidelity
5.3. REGIONS
- different inpainting ways with diffusers: https://github.com/huggingface/diffusers/pull/1585
- SceneComposer: paint with words but cooler
- bounding boxes instead: GLIGEN: image grounding
- better VAE and better masks: https://lipurple.github.io/Grounded_Diffusion/
- InstructGIE: Towards Generalizable Image Editing
- leveraging the VMamba Block, aligns language embeddings with editing semantics
- editing instructions dataset
5.3.1. REGIONS MERGE
- MULTIPLE DIFFUSION BIGGER COHERENCE 5.3.2.1 MULTIPLE LORA
- MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models
- blending the predicted noises of two diffusion models in a saliency-aware manner (composite)
- Text2Layer: Layered Image Generation using Latent Diffusion Model
- train an autoencoder to reconstruct layered images and train models on the latent representation
- generate background, foreground, layer mask, and the composed image simultaneously
- Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance
- bind each attachment to corresponding subjects separately with split text prompts
- object segmentation to obtain the layouts of subjects, then isolate and resynthesize individually
- Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation
- bounded attention, training-free method; bounding information flow in the sampling process
- prevents leakage, promotes each subject’s individuality, even with complex multi-subject conditioning
5.3.1.1. INTERPOLATION
- Latent Blending (interpolate latents)
- latent couple, multidiffusion, attention couple
- comfy ui like but masks
- latent couple, multidiffusion, attention couple
- Interpolating between Images with Diffusion Models
- convincing interpolations across diverse subject poses, image styles, and image content
- Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
=best=
- steady change in the output image, plug-and-play Smooth-LoRA; best interpolation
- perhaps for video or drag diffusion
- OMG: Occlusion-friendly Personalized Multi-concept Generation In Diffusion Models
- integrate multiple concepts within a single image
- combined with LoRA and InstantID
- DIFFMORPHER
- DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing
=best=
- alternative to gan; interpolate between their loras (not just their latents)
- DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing
5.3.2. MINIMAL CHANGES
- SEMANTICALLY DEFORMED
- Delta Denoising Score: minimal modifications, keeping the image
5.3.2.1. HARMONIZATION
- 5.3.1
- SEELE: Repositioning The Subject Within Image
- minimal changes like moving people, subject removal, subject completion and harmonization
- Collage Diffusion (harmonize collaged images)
- Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos
- given a coarsely edited image (cut and move blob), synthesizes a photorealistic output
- SWAPANYTHING
- SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing
- keeping the context unchanged (like it’s in texture clothes)
- SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing
5.3.2.2. REGION EXCHANGE
- VIDEO EXCHANGE 5.3.2.1.1
- RDM-Region-Aware-Diffusion-Model edits only the region of interest
- magicmix merge their noise shapes
- Blended Latent Diffusion
- input image and a mask, modifies the masked area according to a guiding text prompt
- SUBJECT SWAPPING
- Photoswap: Personalized Subject Swapping in Images
- LatentSwap: An Efficient Latent Code Mapping Framework for Face Swapping
- BETTER INPAINTING
- 3.6.1.1
- A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
- inpainting model: context-aware image and shape-guided object inpainting, object removal, controlnet
- ReplaceAnything as you want: Ultra-high quality content replacement
- masked region is strictly retained
- 3.6.1.1
- DiffBody: Human Body Restoration by Imagining with Generative Diffusion Prior
- good proportions, (clothes) texture quality, no limb distortions
- StrDiffusion: Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting
- semantically sparse structure in early stage, dense texture in late stage
- A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
- MAPPED INPAINTING
- Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators
- DIFFERENTIAL DIFFUSION
- Differential Diffusion: Giving Each Pixel Its Strength
=best=
- control of the extent to which individual objects are modified, or the ability to introduce gradual spatial changes
- using change maps: gray scale of how many a region can change
- Differential Diffusion: Giving Each Pixel Its Strength
- CLOTHES OUTFITS
- Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All
- virtually place any e-commerce item in any setting
- Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All
- PIX2PIX REGION
- FORCE IT WHERE IT FITS
- MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion Path
- no training or finetuning; instead force the prompt (exchange the noise)
- PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance
- forces input image into edited image, object-level
- MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion Path
- PROMPT IS TARGET
- Direct Inversion: Optimization-Free Text-Driven Real Image Editing with Diffusion Models
- only changes where the prompt fits
- Inst-Inpaint: Instructing to Remove Objects with Diffusion Models
- erasing unwanted pixels; estimates which object to be removed
- HIVE: Harnessing Human Feedback for Instructional Visual Editing (reward model)
- rlhf, editing instruction, to get output to adhere to the correct instructions
- LIME: Localized Image Editing via Attention Regularization in Diffusion Models
- do not require specified regions or additional text input
- clustering technique = segmentation maps; without re-training and fine-tuning
- DDIM
- MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond
=best=
- prompt redescription strategy, revised DDIM inversion
- Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing
- better DDIM
- ReNoise: Real Image Inversion Through Iterative Noising
- building on reversing the diffusion sampling process to manipulate an image
- MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond
- Direct Inversion: Optimization-Free Text-Driven Real Image Editing with Diffusion Models
5.3.2.3. SEMANTIC CHANGE - DETECTION
- sega semantic guidance, (apply a concept arithmetic after having a generation)
- EDICT: repo Exact Diffusion Inversion via Coupled Transformations
- edits-changes object types(dog breeds)
- adds noise, complex transformations but still getting perfect invertibility
- The Hidden Language of Diffusion Models
- learning interpretable pseudotokens from interpolating unet concepts
- useful for: single-image decomposition to tokens, bias detection, and semantic image manipulation
- SWAP PROMPT
- 2.7.5 2.7.5.1.1
- LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance
- prompt changing, minimal variations
- LEDITS++, an efficient, versatile & precise textual image manipulator
=best=
- no tuning, no optimization, few diffusion steps, multiple simultaneous edits
- architecture-agnostic, masking for local changes; building on SEGA
- StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing
- preserve the object-like attention maps after editing
5.3.2.4. INSTRUCTIONS
- other: 5.3.2.2.3 GUIDING FUNCTION 2.7.3
- MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing
- InstructPix2Pix paper
- MegaEdit: like instructPix2Pix but for any model
- based on EDICT and plug-and-play but using DDIM
- MegaEdit: like instructPix2Pix but for any model
- IMAGE INSTRUCTIONS
- Instruct-Imagen: Image Generation with Multi-modal Instruction
- example images as style, boundary, edges, sketch
- ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation
- a pair of images as visual instructions
- instruction learning as inpainting problem, useful for pose transfer, image translation and video inpainting
- Instruct-Imagen: Image Generation with Multi-modal Instruction
- IMAGE TRANSLATION
- 2.7.6 3.4.3.1 MESH TO MESH SDXS
- DRAG DIFFUSION dragging two points on the image
- Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation
- IIDM: Image-to-Image Diffusion Model for Semantic Image Synthesis
- One-Step Image Translation with Text-to-Image Models
- adapting a single-step diffusion model; preserve the input image structure
- INTO MANGA
- Sketch2Manga: Shaded Manga Screening from Sketch with Diffusion Models
- normal generation into manga style but while fixing the light anomalies (actually looks manga)
- fixes the tones
- Sketch2Manga: Shaded Manga Screening from Sketch with Diffusion Models
- ARTIST EDITING
- SLIME
- SLiMe: Segment Like Me
- extract attention maps, learn about segmented region, then inference
- SLiMe: Segment Like Me
- EXPLICIT REGION
- X-Decoder: instructPix2Pix per region(objects)
- compaable to 1.1
- PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models (region editing)
- X-Decoder: instructPix2Pix per region(objects)
5.4. SPECIFIC CONCEPTS
- 1
- ConceptLab: Creative Generation using Diffusion Prior Constraints
- generate a new, imaginary concept; adaptively constraints-optimization process
- SeedSelect: rare concept images, generation of uncommon and ill-formed concepts
- selecting suitable generation seeds from few samples
- E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance
=best=
- preserving the semantical structure
5.4.1. CONTEXT LEARNING
- DomainStudio: Fine-Tuning Diffusion Models for Domain-Driven Image Generation using Limited Data
- keep the relative distances between adapted samples to achieve generation diversity
- SuTi: Subject-driven Text-to-Image Generation via Apprenticeship Learning (using examples)
5.4.1.1. SEMANTIC CORRESPONDENCE
- Unsupervised Semantic Correspondence Using Stable Diffusion
=no training=
=from other image=
- find locations in multiple images that have the same semantic meaning
- optimize prompt embeddings for maximum attention on the regions of interest
- capture semantic information about location, which can then be transferred to another image
5.4.1.2. IMAGE RELATIONSHIPS
- Controlling Text-to-Image Diffusion by Orthogonal Finetuning
- preserves the hyperspherical energy of the pairwise neuron relationship
- preserves semantic coherance(relationships)
- TOKENCOMPOSE
- VERBS
- ReVersion: Diffusion-Based Relation Inversion from Images
- like putting images on materials
- unlike inverting object appearance, inverting object relations
- ADI: Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation
- learn action-specific identifiers from the exemplar images ignoring appearances
- Visual Concept-driven Image Generation with Text-to-Image Diffusion Model
- concepts that can interact with other concepts, using masks to teach
- ReVersion: Diffusion-Based Relation Inversion from Images
5.4.2. EXTRA PRETRAINED
- GUIDING FUNCTION 2.10.3.3.2
- E4T-diffusion: Tuning encoder: the text embedding + offset weights (Needs a >40GB GPU ) (faces)
- BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing
- learned in 40 steps vs Textual Inversion 3000
- Subject-driven Style Transfer, Subject Interpolation
- concept replacement
- Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models
5.4.2.1. UNDERSTANDING NETWORK
- Elite: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation
- ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation
- 2.10.3.3.3 faces
5.4.3. SEVERAL CONCEPTS
- MULTIPLE DIFFUSION
- Expressive Text-to-Image Generation with Rich Text (learn concept-map from maxed avarages)
- Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA
- sequentially learned concepts
- Break-A-Scene: Extracting Multiple Concepts from a Single Image
- Key-Locked Rank One Editing for Text-to-Image Personalization
- combine individually learned concepts into a single generated image
- Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models
- solving concept conflicts
5.4.4. CONES
- Cones: Concept Neurons in Diffusion Models for Customized Generation (better than Custom Diffusion)
- index only the locations in the layers that give rise to a subject, add them together to include multiple subjects in a new context
- Cones 2: Customizable Image Synthesis with Multiple Subjects
- flexible composition of various subjects without any model tuning
- leaning an extra on top of a regular text embedding, and using layout to compose
5.4.5. SVDIFF
- SVDiff: Compact Parameter Space for Diffusion Fine-Tuning, code(soon)
- multisubject learning, like D3S
- personalized concepts, combinable; training gan out of its conv
- Singular Value Decomposition (SVD) = gene coefficient vs expression level
- CoSINE: Compact parameter space for SINgle image Editing (remove from prompt after finetune it)
- DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning
- its PEFT for diffusion
5.4.6. LIKE ORIGINAL ONES
- 2 passes to make bigger: Standard High-Res fix or Deep Shrink High-Res Fix (kohya)
- VeRA: Vector-based Random Matrix Adaptation
- single pair of low-rank matrices shared across all layers and learning small scaling vectors instead
- 10x less parameters
- An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning
- Multi-Concept Prompt Learning (MCPL)
- disentangled concepts with enhanced word-concept correlation
- X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model
- feature remapping from SD 1.5 to SDXL for all loras and controlnets
- so you can train at lower resources and map to higher
- 2.9.5.1
- 2.1 : learning text embeddings for each layer of the unet
- PALP: Prompt Aligned Personalization of Text-to-Image Models
- input: image and prompt
- display ALL the tokens, not just some
- PALP: Prompt Aligned Personalization of Text-to-Image Models
- λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space
- DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image
- (Personalization for Kandinsky) trained using projection loss and clip contrastive loss
- plug-in method that does semantic matching instead of replacement-disruption
- UniHDA: A Unified and Versatile framework for generative Hybrid Domain Adaptation
- blends all characteristics at once, maintains robust cross-domain consistency
5.4.6.1. TARGETING CONTEXTUAL CONSISTENCY
- Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization
- approach to boost identity consistency and generative diversity for personalization methods
- Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding
- class-characterizing regularization to preserve prior knowledge of object classes, so it integrates seamlessly with existing concepts
5.4.6.2. LORA
- lora, lycoris, loha, lokr
- loha handles multiple-concepts better
- use regularization images with lora https://rentry.org/59xed3#regularization-images
- GLORA: One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning
- individual adapter of each layer
- superior accuracy fewer parameters-computations
- PEFT x Diffusers Integration
- Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying
- 13% of parameters than lora, parameter efficiency
- Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models, plug and play
=best=
- concept sliders that enable precise control over attributes
- intuitive editing of visual concepts for which textual description is difficult
- repair of object deformations and fixing distorted hands
- ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs
- cheaply and effectively merge independently trained style and subject LoRAs
- DoRA: Weight-Decomposed Low-Rank Adaptation
- decomposes the pre-trained weight into two components, magnitude and direction; directional updates
- DiffuseKronA: A Parameter Efficient Fine-tuning Method for Personalized Diffusion Model
- Kronecker product-based adaptation, reduces the parameter count by up to 35% lora
- 5.2.1
- CAT: Contrastive Adapter Training for Personalized Image Generation
- no loss of diversity in object generation, no token = no effect
- MULTIPLE LORA
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters
- scalable serving of many LoRA adapters, all adapters in the main memory, fetches for the current queries
- MultiLoRA: Democratizing LoRA for Better Multi-Task Learning
- changes parameter initialization of adaptation matrices to reduce parameter dependency
- Orthogonal Adaptation for Modular Customization of Diffusion Models
- customized models can be summed with minimal interference, and jointly synthesize
- scalable customization of diffusion models by encouraging orthogonal weights
- Multi-LoRA Composition for Image Generation
- CLoRA: A Contrastive Approach to Compose Multiple LoRA Models
- enables the creation of composite images that truly reflect the characteristics of each LoRA
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters
5.4.6.3. TEXTUAL INVERSION
- Multiresolution Textual Inversion: better textual inversion (embedding)
- Extended Textual Inversion (XTI)
- P+: Extended Textual Conditioning in Text-to-Image Generation
- different text embedding per unet layer
- code
- SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models (llm)
- adapter to transfer the semantic understanding of llm to align complex vs simple prompts
- P+: Extended Textual Conditioning in Text-to-Image Generation
- DREAMDISTRIBUTION is like Textual Inversion
- CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization
- learns the gap between the personalized concept and its base class
6. USE CASES
6.1. IMAGE COMPRESSION FILE
- Robustly overfitting latents for flexible neural image compression
- refine the latents of pre-trained neural image compression models
- Learned Image Compression with Text Quality Enhancement
- text logit loss function
6.2. DIFFUSION AS ENCODER - RETRIEVE PROMPT
- De-Diffusion Makes Text a Strong Cross-Modal Interface
- text as a cross-modal interface
- autoencoder uses a pre-trained text-to-image diffusion model for decoding
- encoder is trained to transform an input image into text
- PH2P: Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models
- projection scheme to optimize for prompts representative of the space in the model (meaningful prompts)
6.3. DIFFUSING TEXT
- 2.7.4.1
- DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion (fonts)
- GlyphDraw: Learning to Draw Chinese Characters in Image Synthesis Models Coherently
- TextDiffuser: Diffusion Models as Text Painters
- Typographic Text Generation with Off-the-Shelf Diffusion Model
- complex effects while preserving its overall coherence
- Typographic Text Generation with Off-the-Shelf Diffusion Model
- GlyphControl: Glyph Conditional Control for Visual Text Generation
=this=
- TextDiffuser: Diffusion Models as Text Painters
- TextDiffuser: Diffusion Models as Text Painters
- TextDiffuser-2: two language models: for layout planning and layout encoding; before the unet
- 2.7.3
- Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation
- training-free framework to enhance layout generator and image generator conditioned on it
- generating images with long and rare text sequences
6.3.1. GENERATE VECTORS
- VecFusion: Vector Font Generation with Diffusion
- rasterized fonts then vector model synthesizes vector fonts
- StarVector: Generating Scalable Vector Graphics Code from Images
- CLIP image encoder, learning to align the visual and code tokens, generate SVGs
- StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis
- encoding into stroke tokens, naturally compatible with LLMs
- SVGCraft: Beyond Single Object Text-to-SVG Synthesis with Comprehensive Canvas Layout
- creation of vector graphics depicting entire scenes from textual descriptions
- optimized using a pre-trained encoder
6.3.2. INPAINTING TEXT
6.3.2.1. DERIVED FROM SD
- UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models (with training code)
- Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model (Diff-Text)
- attention constraint to address unreasonable positioning, more accurate scene text, any language
- its just a prompt and canny: “sign”, “billboard”, “label”, “promotions”, “notice”, “marquee”, “board”, “blackboard”, “slogan”, “whiteboard”, “logo”
- AnyText: Multilingual Visual Text Generation And Editing
=best=
- inputs: glyph, position, and masked image to generate latent features for text generation-editing
- curved into shapes-textures text
6.4. IMAGE RESTORATION, SUPER-RESOLUTION
- NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement
- image signal processing pipeline , multiple blendable styles into a single network
- FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model
- refusion: Image Restoration with Mean-Reverting Stochastic Differential Equations
- image restoration IR, DDNM using NULL-SPACE
- unlimited superresolution
- SVNR: Spatially-variant Noise Removal with Denoising Diffusion
- real life noise fixing
- Dense Pixel-to-Pixel Harmonization via Continuous Image Representation
- stretched images due to change in resolution fixed
- Zero-Shot Image Harmonization with Generative Model Prior
- DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior
- using a SwinIR then refine with sd
6.4.1. SUPERRESOLUTION
- CCSR: Improving the Stability of Diffusion Models for Content Consistent Super-Resolution
- Swintormer: Image Deblurring based on Diffusion Models (limited memory)
- Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution
- for videos, temporal adapter to ensure temporal coherence
- YONOS-SR: You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
- start by training a teacher model on a smaller magnification scale
- instead of 200 steps, and finetuned decoder on top of it
- SUPIR: Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild
- based on large-scale diffusion generative prior
- Face to Cartoon Incremental Super-Resolution using Knowledge Distillation
- faces and anime restoration at various levels of detail
- APISR: Anime Production Inspired Real-World Anime Super-Resolution
- Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model
- pyramid latent representation
6.4.1.1. STABLESR
- StableSR: Exploiting Diffusion Prior for Real-World Image Super-Resolution
- develope a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models
6.4.1.2. DEMOFUSION
- DemoFusion: Democratising High-Resolution Image Generation With No $$$
- achieve higher-resolution image generation
- Enhance This: DemoFusion SDXL
- ComfyUI Iterative Mixing Nodes
=best=
- iterative mixing of samples to help with upscaling quality
- SD 1.5 generating at higher resolutions
- evolution from NNLatentUpscale
- PASD MAGNIFY
- PASD Magnify: Pixel-Aware Stable Diffusion for Realistic Image Super-resolution and Personalized Stylization
- image slider custom component
- PASD Magnify: Pixel-Aware Stable Diffusion for Realistic Image Super-resolution and Personalized Stylization
6.5. DEPTH GENERATION
- depth map from diffusion, build 3d enviroment with it
- ZoeDepth: Combining relative and metric depth
- tiling ZoeDepth
- PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
- Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation (70s inference)
- LDM3D by intel, generates image & depth from text prompts
- LDM3D-VR: Latent Diffusion Model for 3D VR
- generating depth together, panoramic RGBD
- DMD (Diffusion for Metric Depth)
- Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model (depth from image)
- Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data (temporal coherance no flickering)
- GIBR
6.5.1. DEPTH DIFFUSION
6.5.2. NORMAL MAPS
- DSine: Rethinking Inductive Biases for Surface Normal Estimation
- better than bae and midas
- preprocessor