diffusion alternative

1. TRANSFORMERS
2. STILL DIFFUSION
3. GAN
4. BETTER DECODER

parent: stable_diffusion
SEED: autoregressive
karlo, stable karlo (image generation based on unclip)
DeepFloyd is a StableDiffusion style image model that more or less replaced CLIP with a full LLM, more like Google’s Imagen model.
- it’s a cascaded diffusion model conditioned on the T5 encoder
Inversion by Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration
- iterative restoration from low-quality and high-quality paired examples

1. TRANSFORMERS

Muse: diffusion alternative, Masked Generative Transformers, T5 text discrete tokens
- super-resolution
transformers instead of unet, DiT
StraIT: Non-autoregressive Generation with Stratified Image Transformer

1.1. GPT

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
- VAR: a new visual generation method elevates GPT-style models beyond diffusion
- outperforms Diffusion Transformer (DiT) in quality, inference speed, data efficiency, and scalability

1.2. DIFFUSION TRANSFORMER

GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation
Lucas Beyer: represent videos-images as collections of units of data called patches, akin to a gpt token
- now you can train diffusion transformers on data like: different durations, resolutions, aspect ratios
ZigMa: Zigzag Mamba Diffusion Model
- mamba(state space) instead of transformer

1.2.1. FIT TRANSFORMER

FiT: Flexible Vision Transformer for Diffusion Model
- architecture designed for generating images with unrestricted resolutions and aspect ratios
- promoting resolution generalization, eliminating biases induced by image cropping

1.2.2. PIXART

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis (model) =best=
- only takes 10.8% of Stable Diffusion, less than 8VRAM
- controlnet and lcm
- PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models (other lcm controlnet)
- PIXART-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
  - smaller size (0.6B parameters) than SDXL (2.6B parameters) and SD Cascade (5.1B parameters)

1.2.3. SiT

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
- Scalable Interpolant Transformers (SiT)
- using discrete vs continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions

1.3. RWKV

Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models
- RWKV(CNN) instead of tranformers

2. STILL DIFFUSION

Composer: better impainting, training independently semantic components
High Fidelity Image Synthesis With Deep VAEs In Latent Space
- hierarchical variational autoencoders (VAEs)
Binary Latent Diffusion; binary latent space, binary latent diffusion model; 1/3 of LDM parameters
- they tie the “probability” of discrete representation to the probability of the dataset: Variational Inference itself
Self-conditioned Image Generation via Generating Representations =best=
- RCG: Representation-Conditioned image Generation
- does not condition on any human annotations, instead using a pre-trained encoder
CLIP-VQDiffusion: Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model
- uses clip image encoder instead at train time, then clip text encoder at test time
  - representation diffusion model (RDM)

2.1. STABLE CASCADE

Stable Cascade: by Stability, a new text to image model building upon the Würstchen architecture
- working at a much smaller latent space, 42x compression vs 8x
- the faster you can run inference and the cheaper the training becomes

2.2. PERCEPTUAL LOSS

Diffusion Model with Perceptual Loss =best=
- the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance
- the diffusion model itself is a perceptual network (training objetive)
- models capable of generating more realistic samples (at lower steps)

2.3. WITH LLM

Kandinsky 2: image fusion, inpainting, open source (apache)
- (uses XLM-Roberta-LARGE an LLM); BERT, but uses a byte-level BPE as a tokenizer
- maps CLIP text CLIP image; allows image mixing and blending
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
- without training of either U-Net or LLM, 2 pre-trained models bridged with Timestep-Aware Semantic Connector Module, which adapts semantic features at different stages of the denoising
- interpreting lengthy and intricate prompts over sampling timesteps

2.4. FASTER

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds
- mobile devices = 2 seconds, reducing the computation of the image decoder via data distillation
Beyond U: Making Diffusion Models Faster & Lighter
- continuous dynamical systems to design a novel denoising network
- 1/4 of parameters and 30% flops than sd, 70% faster inference

2.4.1. ONE STEP DIFFUSION

Consistency Models: consistency distillation vs progressive distillation
Diffusion World Model (DWM) =best=
- long-horizon predictions in a single forward pass, eliminating the need for recursive quires
  - enables offline Q-learning with synthetic data
distribution matching distillation (DMD)
- multi-step process of traditional diffusion models into a single step, through a teacher-student model

2.4.1.1. RECTIFIED FLOW

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
- unified solution to generative modeling and domain transfer
- simple approach to learning models to transport between two observed distributions
- shortest paths between two points, increasingly straight paths
- uses: image generation, image-to-image translation, and domain adaptation
⚡InstaFlow! One-Step Stable Diffusion with Rectified Flow
- Leveraging pre-trained Stable Diffusion; one step = faster, 0.12s per image
- can quickly choose one lowresolution images: fast previewer
- can have controlnet and lora
- PERFLOW
Boosting Latent Diffusion with Flow Matching
- FM between diffusion model and the convolutional decoder = high-resolution and reduced computational
- diffusion provides generation diversity, FM maps the small latent space to a high-dimensional one

STABLE DIFFUSION 3
- Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
  - biasing rectified flow models towards perceptually relevant scales
  - bidirectional flow of information between image and text tokens

2.5. MULTIPLE DIFFUSION composable

Any-to-Any Generation via Composable Diffusion (audio, imagen, text)
SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions (synchronizes them) =best=
RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths
- mixture-of-experts (MoEs) layers, encompassing multiple nouns, adjectives, and verbs
- trained on 1000 gpus for 2 months
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
- multiple GPUs to accelerate diffusion model, coherent output

2.5.1. COMPOSITIONAL DIFFUSION

Training Data Protection with Compositional Diffusion Models; (CDM) parallel training =best=
- method to train different diffusion models on distinct data and compose them at inference time
SEGMOE

2.5.1.1. PANGU

PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion
- novel latent diffusion model designed for resource-efficient and multiple control signals
  - split structure and texture generators
  - cutting data preparation by 48% and reducing training resources by 51%
- cooperatively use different latent spaces within a unified denoising process
  - multi-control image synthesis

2.6. MULTIMODAL DIFFUSION

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
- disentanglement of style and semantics, dual- and multi-context blending
- generate similar expressions from reference text
unidiffuser: marginal, conditional, and joint diffusion, paper arxiv
- extra diffusion conditions; perturbs data in all modalities
- image, text, text-to-image, image-to-text, and image-text pair generation

3. GAN

GigaGAN: adobe implementation
- StyleGAN-T: nvidia (style transfer)
diffusion as alternative to gans: DIFFMORPHER

4. BETTER DECODER

Designing a Better Asymmetric VQGAN for StableDiffusion better vqgan
- only need to retrain a new asymmetric decoder for vanilla sd; better text
k-diffusion: OpenAI’s consistency decoder (HF model) as a k-diffusion v-prediction denoiser
- supports n>2 step sampling
- sdxl-diffusion-decoder

diffusion alternative

Table of Contents

1. TRANSFORMERS

1.1. GPT

1.2. DIFFUSION TRANSFORMER

1.2.1. FIT TRANSFORMER

1.2.2. PIXART

1.2.3. SiT

1.3. RWKV

2. STILL DIFFUSION

2.1. STABLE CASCADE

2.2. PERCEPTUAL LOSS

2.3. WITH LLM

2.4. FASTER

2.4.1. ONE STEP DIFFUSION

2.4.1.1. RECTIFIED FLOW

2.5. MULTIPLE DIFFUSION composable

2.5.1. COMPOSITIONAL DIFFUSION

2.5.1.1. PANGU

2.6. MULTIMODAL DIFFUSION

3. GAN

4. BETTER DECODER