diffusion alternative
Table of Contents
- parent: stablediffusion
- SEED: autoregressive
- karlo, stable karlo (image generation based on unclip)
- DeepFloyd is a StableDiffusion style image model that more or less replaced CLIP with a full LLM, more like Google’s Imagen model.
- it’s a cascaded diffusion model conditioned on the T5 encoder
- Inversion by Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration
- iterative restoration from low-quality and high-quality paired examples
1. TRANSFORMERS
- Muse: diffusion alternative, Masked Generative Transformers, T5 text discrete tokens
- super-resolution
- transformers instead of unet, DiT
- StraIT: Non-autoregressive Generation with Stratified Image Transformer
1.1. GPT
- Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
- VAR: a new visual generation method elevates GPT-style models beyond diffusion
- outperforms Diffusion Transformer (DiT) in quality, inference speed, data efficiency, and scalability
1.2. DIFFUSION TRANSFORMER
- GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation
- Lucas Beyer: represent videos-images as collections of units of data called patches, akin to a gpt token
- now you can train diffusion transformers on data like: different durations, resolutions, aspect ratios
- ZigMa: Zigzag Mamba Diffusion Model
- mamba(state space) instead of transformer
1.2.1. FIT TRANSFORMER
- FiT: Flexible Vision Transformer for Diffusion Model
- architecture designed for generating images with unrestricted resolutions and aspect ratios
- promoting resolution generalization, eliminating biases induced by image cropping
1.2.2. PIXART
- PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis (model)
=best=
- only takes 10.8% of Stable Diffusion, less than 8VRAM
- controlnet and lcm
- PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models (other lcm controlnet)
- PIXART-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
- smaller size (0.6B parameters) than SDXL (2.6B parameters) and SD Cascade (5.1B parameters)
1.2.3. SiT
- SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
- Scalable Interpolant Transformers (SiT)
- using discrete vs continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions
1.3. RWKV
- Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models
- RWKV(CNN) instead of tranformers
2. STILL DIFFUSION
- Composer: better impainting, training independently semantic components
- High Fidelity Image Synthesis With Deep VAEs In Latent Space
- hierarchical variational autoencoders (VAEs)
- Binary Latent Diffusion; binary latent space, binary latent diffusion model; 1/3 of LDM parameters
- they tie the “probability” of discrete representation to the probability of the dataset: Variational Inference itself
- Self-conditioned Image Generation via Generating Representations
=best=
- RCG: Representation-Conditioned image Generation
- does not condition on any human annotations, instead using a pre-trained encoder
- CLIP-VQDiffusion: Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model
- uses clip image encoder instead at train time, then clip text encoder at test time
- representation diffusion model (RDM)
- uses clip image encoder instead at train time, then clip text encoder at test time
2.1. STABLE CASCADE
2.2. PERCEPTUAL LOSS
- Diffusion Model with Perceptual Loss
=best=
- the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance
- the diffusion model itself is a perceptual network (training objetive)
- models capable of generating more realistic samples (at lower steps)
2.3. WITH LLM
- Kandinsky 2: image fusion, inpainting, open source (apache)
- (uses XLM-Roberta-LARGE an LLM); BERT, but uses a byte-level BPE as a tokenizer
- maps CLIP text CLIP image; allows image mixing and blending
- ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
- without training of either U-Net or LLM, 2 pre-trained models bridged with Timestep-Aware Semantic Connector Module, which adapts semantic features at different stages of the denoising
- interpreting lengthy and intricate prompts over sampling timesteps
2.4. FASTER
- SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds
- mobile devices = 2 seconds, reducing the computation of the image decoder via data distillation
- Beyond U: Making Diffusion Models Faster & Lighter
- continuous dynamical systems to design a novel denoising network
- 1/4 of parameters and 30% flops than sd, 70% faster inference
2.4.1. ONE STEP DIFFUSION
- Consistency Models: consistency distillation vs progressive distillation
- Diffusion World Model (DWM)
=best=
- long-horizon predictions in a single forward pass, eliminating the need for recursive quires
- enables offline Q-learning with synthetic data
- long-horizon predictions in a single forward pass, eliminating the need for recursive quires
- distribution matching distillation (DMD)
- multi-step process of traditional diffusion models into a single step, through a teacher-student model
2.4.1.1. RECTIFIED FLOW
- Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
- unified solution to generative modeling and domain transfer
- simple approach to learning models to transport between two observed distributions
- shortest paths between two points, increasingly straight paths
- uses: image generation, image-to-image translation, and domain adaptation
- ⚡InstaFlow! One-Step Stable Diffusion with Rectified Flow
- Leveraging pre-trained Stable Diffusion; one step = faster, 0.12s per image
- can quickly choose one lowresolution images: fast previewer
- can have controlnet and lora
- PERFLOW
- Boosting Latent Diffusion with Flow Matching
- FM between diffusion model and the convolutional decoder = high-resolution and reduced computational
- diffusion provides generation diversity, FM maps the small latent space to a high-dimensional one
- STABLE DIFFUSION 3
- Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
- biasing rectified flow models towards perceptually relevant scales
- bidirectional flow of information between image and text tokens
- Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
2.5. MULTIPLE DIFFUSION composable
- Any-to-Any Generation via Composable Diffusion (audio, imagen, text)
- SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions (synchronizes them)
=best=
- RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths
- mixture-of-experts (MoEs) layers, encompassing multiple nouns, adjectives, and verbs
- trained on 1000 gpus for 2 months
- DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
- multiple GPUs to accelerate diffusion model, coherent output
2.5.1. COMPOSITIONAL DIFFUSION
- Training Data Protection with Compositional Diffusion Models; (CDM) parallel training
=best=
- method to train different diffusion models on distinct data and compose them at inference time
- SEGMOE
2.5.1.1. PANGU
- PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion
- novel latent diffusion model designed for resource-efficient and multiple control signals
- split structure and texture generators
- cutting data preparation by 48% and reducing training resources by 51%
- cooperatively use different latent spaces within a unified denoising process
- multi-control image synthesis
- novel latent diffusion model designed for resource-efficient and multiple control signals
2.6. MULTIMODAL DIFFUSION
- Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
- disentanglement of style and semantics, dual- and multi-context blending
- generate similar expressions from reference text
- unidiffuser: marginal, conditional, and joint diffusion, paper arxiv
- extra diffusion conditions; perturbs data in all modalities
- image, text, text-to-image, image-to-text, and image-text pair generation
3. GAN
- GigaGAN: adobe implementation
- StyleGAN-T: nvidia (style transfer)
- diffusion as alternative to gans: DIFFMORPHER