train

1. RESEARCH
2. SOFTWARE WISE
3. WITH REWARD
4. STRUCTURE
- 4.1. HYPERPARAMETER
5. CLASSIFIER
- 5.1. GZIP VS GPT
6. SMALLER
- 6.1. COMPRESSION
- 6.2. QUANTIZATION
7. OPTIMIZER
8. CHEAPNESS
9. DATASET
10. FINETUNING
- 10.1. FINETUNES
  - 10.1.1. YOLO
11. GAN ALTERNATIVE

multiple working-into together:
- diffusion: MULTIPLE DIFFUSION, GIT RE-BASIN
- text: RAD, EFT
- logistic: Auto-Instruct
Self-Supervised Learning with Lie Symmetries for Partial Differential Equations
- computationally efficient alternatives to numerical solvers
- self-supervised learn general-purpose representations of PDEs from heterogeneous data
Q*: New Objective: Q-Learning and Q* - Decision Making Under Uncertainty (CS238/AA228)
- Q-learning parallels biological reward neurocircuitry, reinforcement learning (RL)
Model-Based Control with Sparse Neural Dynamics (aggressive sparsification) (distillation)
- parsify it by removing redundant neurons, applicable to a wide variety of DNNs
Zero Bubble Pipeline Parallelism
- algorithm for optimal schedule on config and memory limit
Fuyou: Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU
- training with low-end GPU and limited CPU memory capacity
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
- post-training llm using preference feedback from a teacher model to iteratively improve over itself
- marries the simplicity and stability of contrastive learning with the theoretical generality from optimizing general preferences
From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples
- rivaling supervised methods such as Random Forest, Bagging, or Gradient Boosting

1. RESEARCH

Deep neural networks are robust to weight binarization and other non-linear distortions
- 0.68 effective bits per weight (below 1 bit models)
  - points to the idea that a stochastic memory element can be used

2. SOFTWARE WISE

optimizer from 32 bits to 8 bits
- https://github.com/pyg-team/pytorch_geometric
faster matrix using approximations
- https://github.com/dblalock/bolt

3. WITH REWARD

feedback: FEEDBACK AS TARGET HUMAN FEEDBACK PROPER-ING INSTRUCTIONS
AlignProp: Aligning Text-to-Image Diffusion Models with Reward Backpropagation
- aligns to reward functions
CPL: Contrastive Prefence Learning: Learning from Human Feedback without RL
- learning optimal policies from preferences without learning reward functions
- regret-based model of human preferences instead of reward

3.1. CLIP AS REWARD

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
- reward function = often infeasible(not posible), reward model from human feedback = often very expensive
- VLMs(CLIP) as reward models: a single sentence text prompt describing the desired task

3.2. REINFORCEMENT LEARNING

TD-MPC2: Scalable, Robust World Models for Continuous Control
- agent to perform 80 tasks across multiple task domains, embodiments, and action spaces
- performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model

3.3. LLM AS REWARD

Text2Reward: Automated Dense Reward Function Generation for Reinforcement Learning
- automates the generation of dense reward functions based on llm
Eureka: Human-Level Reward Design via Coding Large Language Models
- generates reward functions that outperform expert human-engineered rewards
  - so now can acquire complex skills via reinforcement learning, optimization over reward
    - to get sequential decision-making tasks
  - in-context RLHF to incorporate feedback and steer and align the reward function
- outer loop: inference-only LLM instructs a learnable NN to refine the reward function
  - inner loop: reinforcement learning to train a controller
- pen spinning

4. STRUCTURE

LORA
ConvNeXt (vs ViT, for image classification)
- accurate, efficient, scalable and very simple in design
  - for: zero-shot image classification, image and text retrieval
- clip convnext: https://huggingface.co/laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft (320 vs 320)
CNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing
- replacing linear recurrence with a special temporal convolutional network
  - permits larger receptive field size with shallower networks
  - reduces the computational complexity to O(L)
PanGu-π: Enhancing Language Model Architectures via Nonlinearity Compensation
- shortcut used to enhance the model nonlinearity, 10% inference speed-up
- non linearity usual in convolutional networks for vision tasks

4.1. HYPERPARAMETER

muP is proposes “right way to scale”, effective weight init scheme; searching the optimal hyperparameters

5. CLASSIFIER

5.1. GZIP VS GPT

are llm just text compression algorithms?
- LLMZip: Lossless Text Compression using Large Language Models
- gzip instead of parameters for classification
  - “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

6. SMALLER

6.1. COMPRESSION

Knowledge Translation: A New Pathway for Model Compression
- teacher-student model that receives parameters and generates compressed ones

6.2. QUANTIZATION

DIFFUSION QUANTIZATION
AdaLoRA adaptively allocates the parameter budget among weight matrices according to their importance (adaptive lora)
FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search
- mixed-precision quantization, eliminates the need for retraining

7. OPTIMIZER

Lion: better than Adam, optimizer
Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions
- Kronecker-factored diagonal eigenvalues, Frequent Directions

8. CHEAPNESS

One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention
Optimized Network Architectures for Large Language Model Training with Billions of Parameters
- only small subgroups of GPUs require high-bandwidth any-to-any communication within them

9. DATASET

CAPTIONING
dimensionality reduction algorithms
- t-SNE and UMAP had long been the favorites
- “Deep TDA” combines self-supervised learning and Topological Data Analysis (TDA)
  - unlock new insights from complex datasets
  - more robust to noise and outliers in the data
Gen2Det: Generate to Detect
- directly generating scene-centric images (synthetic)
- improves the performance on rare categories
Image classification network enhancement methods based on knowledge injection
- knowledge injection dataset to improve interpretability and classification performance of hidden layers
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
- generate a script and correspoinding video as dataset

9.1. MISTAKES

In-Context Principle Learning from Mistakes
- induce model to make mistakes; then we reflect on these mistakes, and learn explicit task-specific “principles” from them which help solve similar mistakes

9.2. ACTUAL DATASET

MatSynth: Physically Based Rendering (PBR) materials dataset (4,000 ultra-high resolution)
FindingEmo: An Image Dataset for Emotion Recognition in the Wild
- annotated dimensions include: valence, arousal and emotion
English public domain books

9.2.1. HANDS DATASET

Annotated Hands for Generative Models
- with three additional channels that provide annotations to hands in the image, additional structure

9.3. ENHANCEMENT

AUDIO VISION
Learning to Identify Critical States for Reinforcement Learning from Videos
- mask-based sensitivity analysis to extract/identify important critical states =identify important=
- recognize relevant states/actions/rewards. = untagged videos
Let’s Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models
- extrapolating the errors made by a small model trained on the synthesized dataset using llm
GeNIe: Generative Hard Negative Images Through Diffusion (synthetic enhanced dataset)
- generate challenging samples for the target category
DistDiff: Distribution-Aware Data Expansion with Diffusion Models
- dataset expansion framework based on the distribution-aware diffusion model
- hierarchical prototypes to approximate the real data distribution

9.4. SIMULATION

madrona-engine: ECS-based game engine that runs 10,000s of environments in parallel on a single GPU
V-IRL: Grounding Virtual Intelligence in Real Life
- test foundation models in virtual real world cities, geospatial data and street view imagery

10. FINETUNING

Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning
- surrogate network to finetune a pretrained model with substantially reduced memory consumption
- comparable performance to conventional finetuning but with significantly less memory usage
Data-Free Generalized Zero-Shot Learning (using only it’s clip features)
Gradient Correlation Subspace Learning against Catastrophic Forgetting
- detects a subspace of the weights that is least affected by previous tasks trains the new task into said subspace
Evolutionary Optimization of Model Merging Recipes
- facilitates crossdomain merging, automated model composition
The Unreasonable Ineffectiveness of the Deeper Layers
- identify optimal block of layers to prune by considering similarity across layers
  - then, to “heal” the damage, we perform a small amount of finetuning

10.1. FINETUNES

10.1.1. YOLO

https://docs.ultralytics.com/yolov5/tutorials/train_custom_data/#train-on-custom-data

11. GAN ALTERNATIVE

ONE STEP DIFFUSION

train

Table of Contents

1. RESEARCH

2. SOFTWARE WISE

3. WITH REWARD

3.1. CLIP AS REWARD

3.2. REINFORCEMENT LEARNING

3.3. LLM AS REWARD

4. STRUCTURE

4.1. HYPERPARAMETER

5. CLASSIFIER

5.1. GZIP VS GPT

6. SMALLER

6.1. COMPRESSION

6.2. QUANTIZATION

7. OPTIMIZER

8. CHEAPNESS

9. DATASET

9.1. MISTAKES

9.2. ACTUAL DATASET

9.2.1. HANDS DATASET

9.3. ENHANCEMENT

9.4. SIMULATION

10. FINETUNING

10.1. FINETUNES

10.1.1. YOLO

11. GAN ALTERNATIVE