diffusion train
Table of Contents
- parent: stablediffusion train
- BETTER DECODER blue noise: NOISE CONTROL
- 400x (and use vae leafing to make big)
- Diffusers Compatible SDXL Unet Rewrite (520 lines)
- ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection
- scaling the coefficients of LSC(which connect distant blocks) in UNet to improve training stability of UNet
- Cas-DM: Bring Metric Functions into Diffusion Models (incorporating additional metric functions, objectives)
- Quantum Denoising Diffusion Models
- explores integrating variational quantum circuits to augment efficacy of diffusion
- MPI: Masked Pre-trained Model Enables Universal Zero-shot Denoiser
- spontaneously attains the underlying potential for strong image denoising
- Simplified Diffusion Schrödinger Bridge
- simplification of the Diffusion Schrödinger Bridge (DSB) that facilitates its unification with Score-based Generative Models (SGMs)
1. RLCM
- RL for Consistency Models, Faster Reward Guided Text-to-Image Generation
- to optimize for task specific rewards, enable fast training-inference, we propose fine-tuning via RL
- Reinforcement Learning for Consistency Model (RLCM)
- objectives challenging with prompting, like image compressibility and human feedback
2. IDEAS
2.1. REMEMBER
2.2. BEFORE-AFTER
- Switch EMA: A Free Lunch for Better Flatness and Sharpness
- switching the EMA parameters to the original model after each epoch, dubbed as Switch EMA (SEMA)
- free lunch by boosting convergence speeds
- Rolling Diffusion Model (VIDEO)
- a sliding window denoising process
- more noise to frames that appear later in a sequence
2.3. ONLY ONCE
- Fixed Point Diffusion Models
- reallocating computation across timesteps and reusing fixed point solutions between timesteps
- 87% fewer parameters, consumes 60% less training memory
- Analyzing and Improving the Training Dynamics of Diffusion Models
- redesigned, so better networks at equal computational complexity
- precise tuning of EMA length without the cost of performing several training runs
2.4. CONTEXT
- ConPreDiff: Improving Diffusion-Based Image Synthesis with Context Prediction (better zeroshot)
- Any-Shift Prompting for Generalization over Distributions
- encode the distribution information and their relationships
- guide the generalization of the CLIP image-language model from training to any test distribution
- encode the distribution information and their relationships
3. PRIORS
3.1. STRUCTURE
3.2. VAE TRAINING
- Deconstructing Denoising Diffusion Models for Self-Supervised Learning
- gradually transforming a Denoising Diffusion Models (DDM) into a classical Denoising Autoencoder (DAE) (VAE)
- FLAWED, The VAE used for Stable Diffusion 1.x/2.x and other models (KL-F8) has a critical flaw, probably due to bad training, needs a new trained from scratch like SDXL
=best=
- the encoder is having to do a lot of extra work to get around the bad latent space
3.3. 3D INCORPORATED
4. DISTRIBUTED TRAINING
- distributed-diffusion using hivemind (distributed training) vs Deepspeed
- COMPOSITIONAL DIFFUSION
- SiT discrete transformers
5. DIFFUSION QUANTIZATION
- 4, 8 bit models, Q-Diffusion insight reddit quantization
- Memory-Efficient Personalization using Quantized Diffusion Model (enhancing it)
- Enhanced Distribution Alignment for Post-Training Quantization of Diffusion Models
- align outputs of the quantized model and the full-precision model at different network granularity
- QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning
- finetuning the quantized model to better adapt to the activation distribution (mitigation)
- Task-Oriented Diffusion Model Compression
- satisfactory output quality with 39.2% and 56.4% reduction in model footprint and 81.4% and 68.7%
- applying it to InstructPix2Pix and StableSR
6. ACADEMIC
- GIT RE-BASIN: MERGING MODELS MODULO PERMUTATION SYMMETRIES
- transfer knowledge between teacher to student model
- Idempotent Generative Network
- f(f(z))=f(z), can generate an output in one step
- step towards a “global projector” = projecting any input into a target data distribution
7. CLIP RELATED
- uform: clip not required, trained in a day
- cloneofsimo: learning from the clip
- wanna perform affordable kernel regression on l2-normalized data?
- get yourself Spherical Random Features for Polynomial Kernels
- relevant if you are aiming for large scale non-parametric regression on CLIP projected feature spaces
- wanna perform affordable kernel regression on l2-normalized data?
8. CHEAPER TRAINING
- Efficient Diffusion Training via Min-SNR Weighting Strategy
- slow convergence due to conflicting optimization directions between timesteps, 3.4 times faster
- Imagen suggests that scaling the text encoder is much more impactful than scaling the UNet
- at least for diffusion models
- mosaiclml: custom $50k stable diffusion training, reddit post
- Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models
- compressed-stable-diffusion 36% reduced parameters and latency
- Wuerstchen: Efficient Pretraining of Text-to-Image Models
- 16 times faster to train, 2 times faster inference, , only 9200 GPU hours (42 time compression rate vs 8 of sd)
- DREAM: Diffusion Rectification and Estimation-Adaptive Models (requiring minimal code changes)
- 2 to 3 times faster training convergence
- PERCEPTUAL LOSS
best=
9. DIFFERENT ARCHITECTURE
- faster using electric flow-charges
https://www.assemblyai.com/blog/an-introduction-to-poisson-flow-generative-models/
- better than inference: https://twitter.com/_akhaliq/status/1620958983639924736 https://arxiv.org/pdf/2302.00482.pdf
- Spectral Diffusion: slim Standard Diffusion, 20 times smaller in size
- Wavelet diffusion code
- Wavelet Diffusion Models are fast and scalable Image Generators
- Wavelet diffusion code
- Score-Based Diffusion Models as Principled Priors for Inverse Imaging (more complex priors)
- COMPOSITIONAL DIFFUSION DIFFUSION TRANSFORMER
10. DATASET MANIPULATION
- Shifted Diffusion
=Corgi=
for Text-to-image Generation: from clip straight to diffusion,=only 1.7 of the images required captions=
- Object Detection: CutLER
- D3S: Invariant Learning via Diffusion Dreamed Distribution Shifts, separating foreground-background
- disentangling foreground from background by chopping-pasting them out in the synthetic training dataset
- like SVDiff
- A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation
- automatic captioning is better than crawled low quality captions
- CapsFusion: Rethinking Image-Text Data at Scale
- hindered by simplistic captioners, consolidate and refine information
10.1. BATCH STRUCTURE
- Structure-Guided Adversarial Training of Diffusion Models
- compel the model to learn manifold structures between samples in each training batch
10.1.1. ATLAS
- IMAGE CLUSTERING
- Neural Congealing: Aligning Images to a Joint Semantic Atlas
- zeroshot leaning concept-shapes
- ASIC: Aligning Sparse in-the-wild Image Collections
- Ablating Concepts in Text-to-Image Diffusion Models (adobe)
10.2. MASKS
- masking to accelerate learning VQ-Diffusion https://arxiv.org/pdf/2111.14822.pdf
- DeepMIM: Deep Supervision for Masked Image Modeling
- pre-trains a Vision Transformer (ViT) via a mask-and-predict scheme.
- MDT: Masked Diffusion Transformer (3 times faster)
- Predicting masked tokens in stochastic locations improves masked image modeling
- learning features that are more robust to location uncertainties; Masked Image Modeling (MIM)
11. MATHEMATICAL (COPY PASTED COMMENT YET TO ANALYZE)
I have recently written a paper on understanding transformer learning via the lens of coinduction & Hopf algebra. https://arxiv.org/abs/2302.01834
The learning mechanism of transformer models was poorly understood however it turns out that a transformer is like a circuit with a feedback.
I argue that autodiff can be replaced with what I call in the paper Hopf coherence which happens within the single layer as opposed to across the whole graph.
Furthermore, if we view transformers as Hopf algebras, one can bring convolutional models, diffusion models and transformers under a single umbrella.
I’m working on a next gen Hopf algebra based machine learning framework.
Join my discord if you want to discuss this further https://discord.gg/mr9TAhpyBW