diffusion alternative

Table of Contents

1. TRANSFORMERS

  • Muse: diffusion alternative, Masked Generative Transformers, T5 text discrete tokens
    • super-resolution
  • transformers instead of unet, DiT
  • StraIT: Non-autoregressive Generation with Stratified Image Transformer

1.1. GPT

  • Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
    • VAR: a new visual generation method elevates GPT-style models beyond diffusion
    • outperforms Diffusion Transformer (DiT) in quality, inference speed, data efficiency, and scalability

1.2. DIFFUSION TRANSFORMER

  • GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation
  • Lucas Beyer: represent videos-images as collections of units of data called patches, akin to a gpt token
    • now you can train diffusion transformers on data like: different durations, resolutions, aspect ratios
  • ZigMa: Zigzag Mamba Diffusion Model
    • mamba(state space) instead of transformer

1.2.1. FIT TRANSFORMER

  • FiT: Flexible Vision Transformer for Diffusion Model
    • architecture designed for generating images with unrestricted resolutions and aspect ratios
    • promoting resolution generalization, eliminating biases induced by image cropping

1.2.2. PIXART

  • PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis (model) =best=
    • only takes 10.8% of Stable Diffusion, less than 8VRAM
    • controlnet and lcm
    • PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models (other lcm controlnet)
    • PIXART-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
      • smaller size (0.6B parameters) than SDXL (2.6B parameters) and SD Cascade (5.1B parameters)

1.2.3. SiT

  • SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
    • Scalable Interpolant Transformers (SiT)
    • using discrete vs continuous time learning, deciding the objective for the model to learn, choosing the interpolant connecting the distributions

1.3. RWKV

  • Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models
    • RWKV(CNN) instead of tranformers

2. STILL DIFFUSION

  • Composer: better impainting, training independently semantic components
  • High Fidelity Image Synthesis With Deep VAEs In Latent Space
    • hierarchical variational autoencoders (VAEs)
  • Binary Latent Diffusion; binary latent space, binary latent diffusion model; 1/3 of LDM parameters
    • they tie the “probability” of discrete representation to the probability of the dataset: Variational Inference itself
  • Self-conditioned Image Generation via Generating Representations =best=
    • RCG: Representation-Conditioned image Generation
    • does not condition on any human annotations, instead using a pre-trained encoder
  • CLIP-VQDiffusion: Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model
    • uses clip image encoder instead at train time, then clip text encoder at test time
      • representation diffusion model (RDM)

2.1. STABLE CASCADE

  • Stable Cascade: by Stability, a new text to image model building upon the Würstchen architecture
    • working at a much smaller latent space, 42x compression vs 8x
    • the faster you can run inference and the cheaper the training becomes

2.2. PERCEPTUAL LOSS

  • Diffusion Model with Perceptual Loss =best=
    • the effectiveness of classifier-free guidance partly originates from it being a form of implicit perceptual guidance
    • the diffusion model itself is a perceptual network (training objetive)
    • models capable of generating more realistic samples (at lower steps)

2.3. WITH LLM

  • Kandinsky 2: image fusion, inpainting, open source (apache)
    • (uses XLM-Roberta-LARGE an LLM); BERT, but uses a byte-level BPE as a tokenizer
    • maps CLIP text CLIP image; allows image mixing and blending
  • ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
    • without training of either U-Net or LLM, 2 pre-trained models bridged with Timestep-Aware Semantic Connector Module, which adapts semantic features at different stages of the denoising
    • interpreting lengthy and intricate prompts over sampling timesteps

2.4. FASTER

  • SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds
    • mobile devices = 2 seconds, reducing the computation of the image decoder via data distillation
  • Beyond U: Making Diffusion Models Faster & Lighter
    • continuous dynamical systems to design a novel denoising network
    • 1/4 of parameters and 30% flops than sd, 70% faster inference

2.4.1. ONE STEP DIFFUSION

  • Consistency Models: consistency distillation vs progressive distillation
  • Diffusion World Model (DWM) =best=
    • long-horizon predictions in a single forward pass, eliminating the need for recursive quires
      • enables offline Q-learning with synthetic data
  • distribution matching distillation (DMD)
    • multi-step process of traditional diffusion models into a single step, through a teacher-student model
2.4.1.1. RECTIFIED FLOW
  • Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
    • unified solution to generative modeling and domain transfer
    • simple approach to learning models to transport between two observed distributions
    • shortest paths between two points, increasingly straight paths
    • uses: image generation, image-to-image translation, and domain adaptation
  • ⚡InstaFlow! One-Step Stable Diffusion with Rectified Flow
    • Leveraging pre-trained Stable Diffusion; one step = faster, 0.12s per image
    • can quickly choose one lowresolution images: fast previewer
    • can have controlnet and lora
    • PERFLOW
  • Boosting Latent Diffusion with Flow Matching
    • FM between diffusion model and the convolutional decoder = high-resolution and reduced computational
    • diffusion provides generation diversity, FM maps the small latent space to a high-dimensional one
  1. STABLE DIFFUSION 3
    • Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
      • biasing rectified flow models towards perceptually relevant scales
      • bidirectional flow of information between image and text tokens

2.5. MULTIPLE DIFFUSION   composable

  • Any-to-Any Generation via Composable Diffusion (audio, imagen, text)
  • SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions (synchronizes them) =best=
  • RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths
    • mixture-of-experts (MoEs) layers, encompassing multiple nouns, adjectives, and verbs
    • trained on 1000 gpus for 2 months
  • DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
    • multiple GPUs to accelerate diffusion model, coherent output

2.5.1. COMPOSITIONAL DIFFUSION

  • Training Data Protection with Compositional Diffusion Models; (CDM) parallel training =best=
    • method to train different diffusion models on distinct data and compose them at inference time
  • SEGMOE
2.5.1.1. PANGU
  • PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion
    • novel latent diffusion model designed for resource-efficient and multiple control signals
      • split structure and texture generators
      • cutting data preparation by 48% and reducing training resources by 51%
    • cooperatively use different latent spaces within a unified denoising process
      • multi-control image synthesis

2.6. MULTIMODAL DIFFUSION

  • Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
    • disentanglement of style and semantics, dual- and multi-context blending
    • generate similar expressions from reference text
  • unidiffuser: marginal, conditional, and joint diffusion, paper arxiv
    • extra diffusion conditions; perturbs data in all modalities
    • image, text, text-to-image, image-to-text, and image-text pair generation

3. GAN

4. BETTER DECODER

  • Designing a Better Asymmetric VQGAN for StableDiffusion better vqgan
    • only need to retrain a new asymmetric decoder for vanilla sd; better text
  • k-diffusion: OpenAI’s consistency decoder (HF model) as a k-diffusion v-prediction denoiser
    • supports n>2 step sampling
    • sdxl-diffusion-decoder

Author: Tekakutli

Created: 2024-04-13 Sat 04:35