train
Table of Contents
- multiple working-into together:
- diffusion: MULTIPLE DIFFUSION, GIT RE-BASIN
- text: RAD, EFT
- logistic: Auto-Instruct
- Self-Supervised Learning with Lie Symmetries for Partial Differential Equations
- computationally efficient alternatives to numerical solvers
- self-supervised learn general-purpose representations of PDEs from heterogeneous data
- Q*: New Objective: Q-Learning and Q* - Decision Making Under Uncertainty (CS238/AA228)
- Q-learning parallels biological reward neurocircuitry, reinforcement learning (RL)
- Model-Based Control with Sparse Neural Dynamics (aggressive sparsification) (distillation)
- parsify it by removing redundant neurons, applicable to a wide variety of DNNs
- Zero Bubble Pipeline Parallelism
- algorithm for optimal schedule on config and memory limit
- Fuyou: Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU
- training with low-end GPU and limited CPU memory capacity
- Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
- post-training llm using preference feedback from a teacher model to iteratively improve over itself
- marries the simplicity and stability of contrastive learning with the theoretical generality from optimizing general preferences
- From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples
- rivaling supervised methods such as Random Forest, Bagging, or Gradient Boosting
1. RESEARCH
- Deep neural networks are robust to weight binarization and other non-linear distortions
- 0.68 effective bits per weight (below 1 bit models)
- points to the idea that a stochastic memory element can be used
- 0.68 effective bits per weight (below 1 bit models)
2. SOFTWARE WISE
- optimizer from 32 bits to 8 bits
- faster matrix using approximations
3. WITH REWARD
- feedback: FEEDBACK AS TARGET HUMAN FEEDBACK PROPER-ING INSTRUCTIONS
- AlignProp: Aligning Text-to-Image Diffusion Models with Reward Backpropagation
- aligns to reward functions
- CPL: Contrastive Prefence Learning: Learning from Human Feedback without RL
- learning optimal policies from preferences without learning reward functions
- regret-based model of human preferences instead of reward
3.1. CLIP AS REWARD
- Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
- reward function = often infeasible(not posible), reward model from human feedback = often very expensive
- VLMs(CLIP) as reward models: a single sentence text prompt describing the desired task
3.2. REINFORCEMENT LEARNING
- TD-MPC2: Scalable, Robust World Models for Continuous Control
- agent to perform 80 tasks across multiple task domains, embodiments, and action spaces
- performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model
3.3. LLM AS REWARD
- Text2Reward: Automated Dense Reward Function Generation for Reinforcement Learning
- automates the generation of dense reward functions based on llm
- Eureka: Human-Level Reward Design via Coding Large Language Models
- generates reward functions that outperform expert human-engineered rewards
- so now can acquire complex skills via reinforcement learning, optimization over reward
- to get sequential decision-making tasks
- in-context RLHF to incorporate feedback and steer and align the reward function
- so now can acquire complex skills via reinforcement learning, optimization over reward
- outer loop: inference-only LLM instructs a learnable NN to refine the reward function
- inner loop: reinforcement learning to train a controller
- pen spinning
- generates reward functions that outperform expert human-engineered rewards
4. STRUCTURE
- LORA
- ConvNeXt (vs ViT, for image classification)
- accurate, efficient, scalable and very simple in design
- for: zero-shot image classification, image and text retrieval
- clip convnext: https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft (320 vs 320)
- accurate, efficient, scalable and very simple in design
- CNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing
- replacing linear recurrence with a special temporal convolutional network
- permits larger receptive field size with shallower networks
- reduces the computational complexity to O(L)
- replacing linear recurrence with a special temporal convolutional network
- PanGu-π: Enhancing Language Model Architectures via Nonlinearity Compensation
- shortcut used to enhance the model nonlinearity, 10% inference speed-up
- non linearity usual in convolutional networks for vision tasks
4.1. HYPERPARAMETER
- muP is proposes “right way to scale”, effective weight init scheme; searching the optimal hyperparameters
5. CLASSIFIER
5.1. GZIP VS GPT
- are llm just text compression algorithms?
- LLMZip: Lossless Text Compression using Large Language Models
- gzip instead of parameters for classification
- “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors
6. SMALLER
6.1. COMPRESSION
- Knowledge Translation: A New Pathway for Model Compression
- teacher-student model that receives parameters and generates compressed ones
6.2. QUANTIZATION
- DIFFUSION QUANTIZATION
- AdaLoRA adaptively allocates the parameter budget among weight matrices according to their importance (adaptive lora)
- FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search
- mixed-precision quantization, eliminates the need for retraining
7. OPTIMIZER
- Lion: better than Adam, optimizer
- Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions
- Kronecker-factored diagonal eigenvalues, Frequent Directions
8. CHEAPNESS
- One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention
- Optimized Network Architectures for Large Language Model Training with Billions of Parameters
- only small subgroups of GPUs require high-bandwidth any-to-any communication within them
9. DATASET
- CAPTIONING
- dimensionality reduction algorithms
- t-SNE and UMAP had long been the favorites
- “Deep TDA” combines self-supervised learning and Topological Data Analysis (TDA)
- unlock new insights from complex datasets
- more robust to noise and outliers in the data
- Gen2Det: Generate to Detect
- directly generating scene-centric images (synthetic)
- improves the performance on rare categories
- Image classification network enhancement methods based on knowledge injection
- knowledge injection dataset to improve interpretability and classification performance of hidden layers
- MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
- generate a script and correspoinding video as dataset
9.1. MISTAKES
- In-Context Principle Learning from Mistakes
- induce model to make mistakes; then we reflect on these mistakes, and learn explicit task-specific “principles” from them which help solve similar mistakes
9.2. ACTUAL DATASET
- MatSynth: Physically Based Rendering (PBR) materials dataset (4,000 ultra-high resolution)
- FindingEmo: An Image Dataset for Emotion Recognition in the Wild
- annotated dimensions include: valence, arousal and emotion
- English public domain books
9.2.1. HANDS DATASET
- Annotated Hands for Generative Models
- with three additional channels that provide annotations to hands in the image, additional structure
9.3. ENHANCEMENT
- AUDIO VISION
- Learning to Identify Critical States for Reinforcement Learning from Videos
- mask-based sensitivity analysis to extract/identify important critical states
=identify important=
- recognize relevant states/actions/rewards. = untagged videos
- mask-based sensitivity analysis to extract/identify important critical states
- Let’s Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models
- extrapolating the errors made by a small model trained on the synthesized dataset using llm
- GeNIe: Generative Hard Negative Images Through Diffusion (synthetic enhanced dataset)
- generate challenging samples for the target category
- DistDiff: Distribution-Aware Data Expansion with Diffusion Models
- dataset expansion framework based on the distribution-aware diffusion model
- hierarchical prototypes to approximate the real data distribution
9.4. SIMULATION
- madrona-engine: ECS-based game engine that runs 10,000s of environments in parallel on a single GPU
- V-IRL: Grounding Virtual Intelligence in Real Life
- test foundation models in virtual real world cities, geospatial data and street view imagery
10. FINETUNING
- Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning
- surrogate network to finetune a pretrained model with substantially reduced memory consumption
- comparable performance to conventional finetuning but with significantly less memory usage
- Data-Free Generalized Zero-Shot Learning (using only it’s clip features)
- Gradient Correlation Subspace Learning against Catastrophic Forgetting
- detects a subspace of the weights that is least affected by previous tasks trains the new task into said subspace
- Evolutionary Optimization of Model Merging Recipes
- facilitates crossdomain merging, automated model composition
- The Unreasonable Ineffectiveness of the Deeper Layers
- identify optimal block of layers to prune by considering similarity across layers
- then, to “heal” the damage, we perform a small amount of finetuning
- identify optimal block of layers to prune by considering similarity across layers