transformer

Table of Contents

1. IMPROVEMENTS ON

1.1. FASTER

  • CoLT5: Faster Long-Range Transformers with Conditional Computation
  • SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
    • reduces compute and memory, 4 to 8 times fewer attention matrices
  • Agent Attention: balance between computational efficiency and representation power
    • generalized linear attention, integrated with softmax; preserving global context modelling capability

1.2. CONTEXT

  • contextual transformers (Algorithm Distillation), learns from itself, reinforcement learning
  • Elastic Decision Transformer
    • not optimal to use all history states as inputs for decision, instead shorter history
  • Cached Transformers: Improving Transformers with Differentiable Memory Cache
    • Gated Recurrent Cached (GRC), extend the self-attention mechanism

2. ABOUT ATTENTION

  • What is Q,V,K? multihead attention?
  • SpectFormer: Frequency and Attention is what you need in a Vision Transformer
  • Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
    • two-dimensional convolutions to jointly encode the source-target sequences (translation)
  • On the Turing Completeness of Modern Neural Network Architectures; Attention is Turing-Complete

3. SHAPES

  • Star-Transformer: https://arxiv.org/abs/1902.09113
  • simpler transformer: One Wide Feedforward is All You Need
    • Attention(interdependencies) and the Feed Forward Network(now removed from the decoder, cheaper params)
  • Approximating Two-Layer Feedforward Networks for Efficient Transformers
    • Mixtures of Experts (MoEs) vs dense transformers, more resource efficient

3.1. BEHAVIORAL TRANSFORMER

  • PASTA: Pretrained Action-State Transformer Agents
    • self supervised reinforcement learning
    • learning behavioral, sensor adaptation trajectories
    • no need to pretrain-tailor to specific downstream applications

3.2. ALTERNATIVE

  • Retentive Network: A Successor to Transformer for Large Language Models (RetNet)
    • low cost inference, training parallelism, strong performance
  • Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
    • potential to streamline complex architectures for sequence-to-sequence tasks
  • WARM: On the Benefits of Weight Averaged Reward Models
    • averaging weights to deal with inconsistencies

3.2.1. RNN

  • RNNs, Gated recurrent neural networks discover attention
  • RWKV: RNN instead of transformers
    • nanoRWKV: minGPT like, does not require custom CUDA kernel to train

3.2.2. STATE SPACE

  • H3: hungry hippos: state space model instead of transformers
    • Perceiver few latents instead of transformers
  • Repeat After Me Transformers are Better than State Space Models at Copying
    • GSSMs(generalized state space models): fixed-size latent state that doesnt depend on sequence length
      • limited compared to transformer models on tasks that require copying from the input context
        • repeat after me, dilated convolutions are better than transformers?
3.2.2.1. MAMBA
  • Mamba: Linear-Time Sequence Modeling with Selective State Spaces
    • architecture without attention or MLP
    • allowing the model to selectively propagate or forget information
    • outperforms Transformers of the same size, matches Transformers twice its size
  • Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
    • comparable performance to LLaVA with about 43% of the number of parameters
  • MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection
    • data-dependent weights that uses a dual selection mechanism across tokens and channels
  • Jamba: A Hybrid Transformer-Mamba Language Model
    • interleaves blocks of Transformer and Mamba layers

3.3. ATTENTION FREE

3.4. COMPOSABLE TRANSFORMERS   composable

  • Composable Function-preserving Expansions for Transformer Architectures =best=
    • training pipelines for larger models by progressively expanding the architecture throughout training

4. TRANSFORMER VISION

4.1. VIDEO

  • MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
    • memory mechanism of rapidly updated short-term memory and thus sustained long-term memory
  • Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers
    • video visual recognition = computational savings
      • identifying the tokens that have changed significantly
    • can be converted from existing transformers
  • ProPainter: Improving Propagation and Transformer for Video Inpainting
    • inpainting = make mask, then remove
    • image and feature warping, discard redundant tokens, attention to distant frames

4.2. TOKENS

  • LongNet: Scaling Transformers to 1,000,000,000 Tokens
  • SparseFormer: Sparse Visual Recognition via Limited Latent Tokens
    • codebook for videos tokens, not optical flow, 49 tokens
  • SeiT: Storage-efficient Vision Training with Tokens Using 1% of Pixel Storage, <1% of JPEG images
    • Token-based Storage

4.2.1. VISUAL TOKENS

  • Subobject-level Image Tokenization
    • subobjects are represented by semantically meaningful image segments obtained by segmentation models
4.2.1.1. SEED
  • SEED: Planting a SEED of Vision in Large Language Model
    • unify visual and textual representations
    • Image tokens independent of 2D patch positions, only 1D causal dependency
    • high-level semantics consistent with semantic abstraction of words
    • SEED-LLaMA: Making LLaMA SEE and Draw with SEED Tokenizer (autoregressive Transformer)
      • comprehension and generation of images

4.3. HIERARCHICAL DETAILS

  • Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention
  • FasterViT: Fast Vision Transformers with Hierarchical Attention
  • Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
  • From Sparse to Soft Mixtures of Experts; sparse Transformer
    • passing different weighted combinations of all input tokens to each expert

Author: Tekakutli

Created: 2024-04-07 Sun 13:56