transformer
Table of Contents
- parent: train
- Bytes Are All You Need: Transformers Operating Directly On File Bytes
- NoPE: dont use positional encoding (PE) in Transformer decoders (GPTs)
- Meta-Transformer: A Unified Framework for Multimodal Learning
- a unified data tokenizer, a modality-shared encoder, and task-specific heads
1. IMPROVEMENTS ON
1.1. FASTER
- CoLT5: Faster Long-Range Transformers with Conditional Computation
- strong gains up to 64k input length
- SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
- reduces compute and memory, 4 to 8 times fewer attention matrices
- Agent Attention: balance between computational efficiency and representation power
- generalized linear attention, integrated with softmax; preserving global context modelling capability
1.2. CONTEXT
- contextual transformers (Algorithm Distillation), learns from itself, reinforcement learning
- Elastic Decision Transformer
- not optimal to use all history states as inputs for decision, instead shorter history
- Cached Transformers: Improving Transformers with Differentiable Memory Cache
- Gated Recurrent Cached (GRC), extend the self-attention mechanism
2. ABOUT ATTENTION
- What is Q,V,K? multihead attention?
- SpectFormer: Frequency and Attention is what you need in a Vision Transformer
- Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
- two-dimensional convolutions to jointly encode the source-target sequences (translation)
- On the Turing Completeness of Modern Neural Network Architectures; Attention is Turing-Complete
3. SHAPES
- Star-Transformer: https://arxiv.org/abs/1902.09113
- Hungry Hungry Hippos: State Space Models
- next: Hyena Hierarchy: gating (cache-d attention) Towards Larger Convolutional Language Models
- Hungry Hungry Hippos: State Space Models
- simpler transformer: One Wide Feedforward is All You Need
- Attention(interdependencies) and the Feed Forward Network(now removed from the decoder, cheaper params)
- Approximating Two-Layer Feedforward Networks for Efficient Transformers
- Mixtures of Experts (MoEs) vs dense transformers, more resource efficient
3.1. BEHAVIORAL TRANSFORMER
- PASTA: Pretrained Action-State Transformer Agents
- self supervised reinforcement learning
- learning behavioral, sensor adaptation trajectories
- no need to pretrain-tailor to specific downstream applications
3.2. ALTERNATIVE
- Retentive Network: A Successor to Transformer for Large Language Models (RetNet)
- low cost inference, training parallelism, strong performance
- Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
- potential to streamline complex architectures for sequence-to-sequence tasks
- WARM: On the Benefits of Weight Averaged Reward Models
- averaging weights to deal with inconsistencies
3.2.1. RNN
3.2.2. STATE SPACE
- H3: hungry hippos: state space model instead of transformers
- Perceiver few latents instead of transformers
- Repeat After Me Transformers are Better than State Space Models at Copying
- GSSMs(generalized state space models): fixed-size latent state that doesnt depend on sequence length
- limited compared to transformer models on tasks that require copying from the input context
- repeat after me, dilated convolutions are better than transformers?
- limited compared to transformer models on tasks that require copying from the input context
- GSSMs(generalized state space models): fixed-size latent state that doesnt depend on sequence length
3.2.2.1. MAMBA
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- architecture without attention or MLP
- allowing the model to selectively propagate or forget information
- outperforms Transformers of the same size, matches Transformers twice its size
- Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
- comparable performance to LLaVA with about 43% of the number of parameters
- MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection
- data-dependent weights that uses a dual selection mechanism across tokens and channels
- Jamba: A Hybrid Transformer-Mamba Language Model
- interleaves blocks of Transformer and Mamba layers
3.3. ATTENTION FREE
- gzip vs attention: GZIP VS GPT
- SimpleTRON: Simple Transformer with O(N) Complexity (no transformer)
- vs Metaformer (poolformer, pureformer)
- maybe not the same: the github
3.4. COMPOSABLE TRANSFORMERS composable
- Composable Function-preserving Expansions for Transformer Architectures
=best=
- training pipelines for larger models by progressively expanding the architecture throughout training
4. TRANSFORMER VISION
- parent: computervision
- SEED
- DINO: self-suppervised Vision Transformers https://youtu.be/h3ij3F3cPIk
- ConvNets Match Vision Transformers at Scale
- match the performance of Vision Transformers with comparable compute budgets
- Denoising Vision Transformers: removes artifacts improves quality
4.1. VIDEO
- MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
- memory mechanism of rapidly updated short-term memory and thus sustained long-term memory
- Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers
- video visual recognition = computational savings
- identifying the tokens that have changed significantly
- can be converted from existing transformers
- video visual recognition = computational savings
- ProPainter: Improving Propagation and Transformer for Video Inpainting
- inpainting = make mask, then remove
- image and feature warping, discard redundant tokens, attention to distant frames
4.2. TOKENS
- LongNet: Scaling Transformers to 1,000,000,000 Tokens
- SparseFormer: Sparse Visual Recognition via Limited Latent Tokens
- codebook for videos tokens, not optical flow, 49 tokens
- SeiT: Storage-efficient Vision Training with Tokens Using 1% of Pixel Storage, <1% of JPEG images
- Token-based Storage
4.2.1. VISUAL TOKENS
- Subobject-level Image Tokenization
- subobjects are represented by semantically meaningful image segments obtained by segmentation models
4.2.1.1. SEED
- SEED: Planting a SEED of Vision in Large Language Model
- unify visual and textual representations
- Image tokens independent of 2D patch positions, only 1D causal dependency
- high-level semantics consistent with semantic abstraction of words
- SEED-LLaMA: Making LLaMA SEE and Draw with SEED Tokenizer (autoregressive Transformer)
- comprehension and generation of images
4.3. HIERARCHICAL DETAILS
- Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention
- FasterViT: Fast Vision Transformers with Hierarchical Attention
- Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
- From Sparse to Soft Mixtures of Experts; sparse Transformer
- passing different weighted combinations of all input tokens to each expert