transformer

1. IMPROVEMENTS ON
- 1.1. FASTER
- 1.2. CONTEXT
2. ABOUT ATTENTION
3. SHAPES
4. TRANSFORMER VISION

parent: train
Bytes Are All You Need: Transformers Operating Directly On File Bytes
NoPE: dont use positional encoding (PE) in Transformer decoders (GPTs)
Meta-Transformer: A Unified Framework for Multimodal Learning
- a unified data tokenizer, a modality-shared encoder, and task-specific heads

1. IMPROVEMENTS ON

1.1. FASTER

CoLT5: Faster Long-Range Transformers with Conditional Computation
- strong gains up to 64k input length
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
- reduces compute and memory, 4 to 8 times fewer attention matrices
Agent Attention: balance between computational efficiency and representation power
- generalized linear attention, integrated with softmax; preserving global context modelling capability

1.2. CONTEXT

contextual transformers (Algorithm Distillation), learns from itself, reinforcement learning
Elastic Decision Transformer
- not optimal to use all history states as inputs for decision, instead shorter history
Cached Transformers: Improving Transformers with Differentiable Memory Cache
- Gated Recurrent Cached (GRC), extend the self-attention mechanism

2. ABOUT ATTENTION

What is Q,V,K? multihead attention?
SpectFormer: Frequency and Attention is what you need in a Vision Transformer
Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
- two-dimensional convolutions to jointly encode the source-target sequences (translation)
On the Turing Completeness of Modern Neural Network Architectures; Attention is Turing-Complete

3. SHAPES

Star-Transformer: https://arxiv.org/abs/1902.09113
- Hungry Hungry Hippos: State Space Models
  - next: Hyena Hierarchy: gating (cache-d attention) Towards Larger Convolutional Language Models
simpler transformer: One Wide Feedforward is All You Need
- Attention(interdependencies) and the Feed Forward Network(now removed from the decoder, cheaper params)
Approximating Two-Layer Feedforward Networks for Efficient Transformers
- Mixtures of Experts (MoEs) vs dense transformers, more resource efficient

3.1. BEHAVIORAL TRANSFORMER

PASTA: Pretrained Action-State Transformer Agents
- self supervised reinforcement learning
- learning behavioral, sensor adaptation trajectories
- no need to pretrain-tailor to specific downstream applications

3.2. ALTERNATIVE

Retentive Network: A Successor to Transformer for Large Language Models (RetNet)
- low cost inference, training parallelism, strong performance
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
- potential to streamline complex architectures for sequence-to-sequence tasks
WARM: On the Benefits of Weight Averaged Reward Models
- averaging weights to deal with inconsistencies

3.2.1. RNN

RNNs, Gated recurrent neural networks discover attention
RWKV: RNN instead of transformers
- nanoRWKV: minGPT like, does not require custom CUDA kernel to train

3.2.2. STATE SPACE

H3: hungry hippos: state space model instead of transformers
- Perceiver few latents instead of transformers
Repeat After Me Transformers are Better than State Space Models at Copying
- GSSMs(generalized state space models): fixed-size latent state that doesnt depend on sequence length
  - limited compared to transformer models on tasks that require copying from the input context
    - repeat after me, dilated convolutions are better than transformers?

3.2.2.1. MAMBA

Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- architecture without attention or MLP
- allowing the model to selectively propagate or forget information
- outperforms Transformers of the same size, matches Transformers twice its size
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
- comparable performance to LLaVA with about 43% of the number of parameters
MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection
- data-dependent weights that uses a dual selection mechanism across tokens and channels
Jamba: A Hybrid Transformer-Mamba Language Model
- interleaves blocks of Transformer and Mamba layers

3.3. ATTENTION FREE

gzip vs attention: GZIP VS GPT
SimpleTRON: Simple Transformer with O(N) Complexity (no transformer)
- vs Metaformer (poolformer, pureformer)
- maybe not the same: the github

3.4. COMPOSABLE TRANSFORMERS composable

Composable Function-preserving Expansions for Transformer Architectures =best=
- training pipelines for larger models by progressively expanding the architecture throughout training

4. TRANSFORMER VISION

parent: computer_vision
SEED
DINO: self-suppervised Vision Transformers https://youtu.be/h3ij3F3cPIk
ConvNets Match Vision Transformers at Scale
- match the performance of Vision Transformers with comparable compute budgets
Denoising Vision Transformers: removes artifacts improves quality

4.1. VIDEO

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
- memory mechanism of rapidly updated short-term memory and thus sustained long-term memory
Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers
- video visual recognition = computational savings
  - identifying the tokens that have changed significantly
- can be converted from existing transformers
ProPainter: Improving Propagation and Transformer for Video Inpainting
- inpainting = make mask, then remove
- image and feature warping, discard redundant tokens, attention to distant frames

4.2. TOKENS

LongNet: Scaling Transformers to 1,000,000,000 Tokens
SparseFormer: Sparse Visual Recognition via Limited Latent Tokens
- codebook for videos tokens, not optical flow, 49 tokens
SeiT: Storage-efficient Vision Training with Tokens Using 1% of Pixel Storage, <1% of JPEG images
- Token-based Storage

4.2.1. VISUAL TOKENS

Subobject-level Image Tokenization
- subobjects are represented by semantically meaningful image segments obtained by segmentation models

4.2.1.1. SEED

SEED: Planting a SEED of Vision in Large Language Model
- unify visual and textual representations
- Image tokens independent of 2D patch positions, only 1D causal dependency
- high-level semantics consistent with semantic abstraction of words
- SEED-LLaMA: Making LLaMA SEE and Draw with SEED Tokenizer (autoregressive Transformer)
  - comprehension and generation of images

4.3. HIERARCHICAL DETAILS

Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention
FasterViT: Fast Vision Transformers with Hierarchical Attention
Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
From Sparse to Soft Mixtures of Experts; sparse Transformer
- passing different weighted combinations of all input tokens to each expert

transformer

Table of Contents

1. IMPROVEMENTS ON

1.1. FASTER

1.2. CONTEXT

2. ABOUT ATTENTION

3. SHAPES

3.1. BEHAVIORAL TRANSFORMER

3.2. ALTERNATIVE

3.2.1. RNN

3.2.2. STATE SPACE

3.2.2.1. MAMBA

3.3. ATTENTION FREE

3.4. COMPOSABLE TRANSFORMERS composable

4. TRANSFORMER VISION

4.1. VIDEO

4.2. TOKENS

4.2.1. VISUAL TOKENS

4.2.1.1. SEED

4.3. HIERARCHICAL DETAILS