text

Table of Contents

1. ADDED - EXTRAS TO LLM

1.1. VECTOR DB

  • langchain, and https://github.com/srush/MiniChain
    • PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents
  • MemGPT: manages memory tiers to effectively provide extended context within llm limited context window
    • llm taught to manage their own memory, resembles paging in OS (main context, external context) =best=
    • trained to generate function calls

2. SPECIALIZED USES

  • QUERING MODELS - MULTIMODAL
  • Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding; medical, doctor
  • Personality Traits in Large Language Models, quantifying personalities
  • ChipNeMo: Domain-Adapted LLMs for Chip Design
  • LARP: Language-Agent Role Play for Open-World Games
    • decision-making assistant, framework refines interactions between users and agents

2.1. LAYOUT LLM

  • PosterLlama: Bridging Design Ability of Langauge Model to Contents-Aware Layout Generation
    • reformatting layout elements into HTML code
    • unconditional layout generation, element conditional layout generation, layout completion

2.2. PLOT

  • Pix2Struct: text to plot
    • DePlot: plot-to-text model helping LLMs understand plots
    • MatCha: great chart & math capabilities by plot deconstruction & numerical reasoning objectives
  • StructLM: Towards Building Generalist Models for Structured Knowledge Grounding
    • based on the Code-LLaMA architecture

2.3. LEGAL

  • SaulLM-7B: A pioneering Large Language Model for Law
    • designed explicitly for legal text comprehension and generation

2.4. VISUAL

  • Pixel Aligned Language Models
    • can take locations (set of points, boxes) as inputs or outputs
    • location-aware vision-language tasks

2.5. CODE ASSISTANT

2.5.1. MATH

  • Llemma: An Open Language Model For Mathematics
    • capable of tool use and formal theorem proving
  • Large Language Models for Mathematicians (academic)
    • mathematical description of the transformer model used in all modern language models
  • Chronos: Learning the Language of Time Series
    • improve zero-shot accuracy on unseen forecasting tasks; forecasting pipeline
  • MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
    • extract crucial reasoning steps, to reveal the intermediate reasoning quality
    • MLLMs

2.5.2. CODE COMPLETION

  • DeciCoder: decoder-only code completion model
    • approach of grouping tokens into clusters and having each token attend to others only within its cluster
  • Magicoder: Source Code Is All You Need
    • MagicoderS-CL-7B based on CodeLlama
  • StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
    • breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks
      • while masking segments to properly optimize
2.5.2.1. OPERATOR
  • Enhancing Network Management Using Code Generated by Large Language Models
    • program synthesis: generate task-specific code from natural language queries
      • analyzing network topologies and communication graphs

2.5.3. DIFFUSION

  • CodeFusion: A Pre-trained Diffusion Model for Code Generation =diffusion= (75M vs 1B auto-regressive)
    • iterative denoising, no need to start from scratch
  • Text Rendering Strategies for Pixel Language Models
    • characters as images, handle any script; PIXEL model

2.5.4. TOOLS-USE TOOLS

  • Grammar Prompting for Domain-Specific Language Generation with Large Language Models
    • like programming languages
    • predict a BNF grammar given an input, then generates the output according to the rules of that grammar
  • Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models
    • zero-shot prompts with only documentation are sufficient for tool usage
    • tool documentation > demonstrations
  • ControlLLM: Augment Language Models with Tools by Searching on Graphs
    • breaks down a complex task into clear subtasks, then optimal solution path
  • Fay: integrating language models and digital characters

2.6. TRANSLATION

2.7. OPTIMIZATION

  • OPRO: Optimization by PROmpting, Large Language Models as Optimizers
    • each step = generates new solutions from previously generated solutions
  • Large Language Models for Compiler Optimization
    • reducing instruction counts over the compiler
  • EvoPrompt: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers

2.7.1. CACHE

  • SparQ Attention: Bandwidth-Efficient LLM Inference
    • reducing memory bandwidth requirements within the attention blocks through selective fetching of the cached history (up to eight times)

2.8. SUMMARIZATION

  • thread summarizer https://labs.kagi.com/ai/sum?url=%3E%3E248633369
  • LLM Use Case: Summarization (using langchain)
  • From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
    • iteratively incorporating missing salient entities without increasing the length
  • LMDX: Language Model-based Document Information Extraction and Localization
    • methodology to adapt arbitrary LLMs for document information extraction (without hallucination)

3. TEXT DIFFUSION

  • parent: diffusion
  • GENIE: Large Scale Pre-training for Text Generation with Diffusion Model
  • TESS: Text-to-Text Self-Conditioned Simplex Diffusion
    • AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation
  • PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model
  • DiffusionDialog: A Diffusion Model for Diverse Dialog Generation with Latent Space
    • enhances the diversity of dialog responses while maintaining coherence

4. TEXT GENERATION

4.1. INFERENCE

4.1.1. BETTER

4.1.1.1. FOCUS THE ATTENTION
  • PASTA: Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs
    • identifies a small subset of attention heads, then applies precise attention reweighting on them
    • applied next to prompting
  • S2A: System 2 Attention (is something you might need too)
    • regenerates context to only include the relevant portions before responding

4.1.2. FASTER

  • Accelerating LLM Inference with Staged Speculative Decoding
    • restructure the speculative batch as a tree
  • MobileNMT: Enabling Translation in 15MB and 30ms
  • FlashDecoding++: Faster Large Language Model Inference on GPUs
    • inference engine, 4-2x speedup; no more matrix flatness
  • Exponentially Faster Language Modelling
    • replacing feedforward networks with fast feedforward networks (FFFs)
    • engages just 12 out of 4095 neurons for each layer inference, 78x speedup
  • EAGLE: LLM decoding based on compression (and others with comparison: Medusa, Lookahead, Vanilla)
    • sequence of second-top-layer features is compressible, making the prediction of subsequent feature vectors from previous ones easy by a small model

4.1.3. MODELS

4.1.3.1. QWEN
  • Qwen-7B: surpasses both LLaMA 2 7B and 13B on MMLU score, math and code
  • Qwen-1.5 space
4.1.3.2. LLAMA
  1. ALTERNATIVES
    • Open LLama, Open-Source Reproduction, permissively licensed; Lit-LLaMA, RedPajama dataset
    • Falcon: new family, open-source =instruct finetuned too=
    • LLaMA Pro: Progressive LLaMA with Block Expansion
      • take pretrained model freeze params, then add new blocks
      • model with new data without forgetting old
    • LiteLlama: has 460M parameters trained with 1T tokens.
    • MobiLlama: Small Language Models (SLMs), open-source 0.5 billion (0.5B) parameter
4.1.3.3. MISTRAL

4.2. TRAINING

  • Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference
  • Training Large Language Models Efficiently with Sparsity and Dataflow
  • Randomized Positional Encodings Boost Length Generalization of Transformers
  • MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies
    • reverse cross-entropy RATHER THAN maximum likelihood estimation (MLE)
  • Neurons in Large Language Models: Dead, N-gram, Positional
    • study: 70 neurons per layer are dead, some neurons specialize in removing the information from input
  • Backpack Language Models: non-contextual sense vectors, which specialize encoding different aspects word
  • In-Context Learning Creates Task Vectors
    • In-context learning = compressing training set into a single task vector, then using it to modulate transformer to produce the output
  • Efficient Streaming Language Models with Attention Sinks (=better inference or trainning=)
    • =context window cache is bad=, just keep first tokens around (as is)
      • or it is better to have a static null token at begining of window
    • reated to “Vision Transformers need registers” paper

4.2.1. CHEAPNESS

  • JetMoE: Reaching LLaMA2 Performance with 0.1M Dollars
    • and can be finetuned with a very limited computing budget

4.2.2. STRUCTURE

  • From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought
    • probabilistic programming language = commonsense reasoning, linguistics
  • Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
    • learn to generate rationales at each token to explain future text, improving their predictions
4.2.2.1. MERGING
  • LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion
    • (specialized) text model merging (using rankings)
  • FuseChat: Knowledge Fusion of Chat Models
    • knowledge fusion for structurally diverse architectures and scales llms
4.2.2.2. SKELETON
  • Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
    • first skeleton, then parallel filling; faster and better
  • ART: Automatic multi-step reasoning and tool-use for large language models
    • bubbles of logic
  • Orca 2: Teaching Small Language Models How to Reason
    • reasoning techniques: step-by-step, recall then generate, recall-reason-generate, direct answer
  • PathFinder: Guided Search over Multi-Step Reasoning Paths
    • tree-search-based reasoning path generation approach (beam search algorith)
    • improved commonsense reasoning tasks and complex arithmetic
  • Stream of Search (SoS): Learning to Search in Language
    • models can be taught to search by representing the process of search in language, as a flattened string
  1. META-PROCESS TOKENS
    • Teach LLMs to Personalize – An Approach inspired by Writing Education
      • retrieval, ranking, summarization, synthesis, and generation
    • Link-Context Learning for Multimodal LLMs
      • causal associations between data points = cause and effect
      • In-Context Learning (ICL) = learn to learn
      • from limited tasks (providing demonstrations) and generalize to unseen tasks
    • LoGiPT: Language Models can be Logical Solvers
      • parse natural language logical questions into symbolic representations, emulates logical solvers
4.2.2.3. CORPUS STRUCTURE, RETRIEVAL
  1. LLM AS ENCODER
    • GZIP VS GPT
      • Copy Is All You Need
        • task of text generation decomposed into a series of copy-and-paste operations
        • text spans rather than vocabulary
        • learning = text compression algorithm ?
        • Decoding the ACL Paper: Gzip and KNN Rival BERT in Text Classification
    • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • LLMs can be effectively transformed into universal text encoders without the need for expensive adaptation

4.2.3. QUANTIZATION

  • int-3 quantization: https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and twitter
  • llama.cpp quantization
  • AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
    • outperforms GPTQ in 4-bit and 3-bit with 1.45x speedup and works with multimodal LLMs
    • SpQR method for LLM compression: highly sensitive parameters are not quantized
  • OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
    • no more hand-craft-ed quantization parameters
  • LLM-FP4: 4-Bit Floating-Point Quantized Transformers, 5.8% lower on reasoning than the full-precision model
  • BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
    • identifies and structurally selects salient weights
      • 7 billion weights within 0.5 hours
  • EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
    • leave the outliers (less than 1%) unchanged, implemented in parallel
4.2.3.1. 1-BIT
  • BitNet: Scaling 1-bit Transformers for Large Language Models
    • vs 8-bit quantization architectures
  • QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
    • can compress 1.6 trillion parameter model to less than 160GB (20x compression, 0.8 bits per parameter)
4.2.3.2. LORA WITH QUANTIZATION
  • QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
  • QLoRA: Efficient Finetuning of Quantized LLMs, 24 hours 1 gpu 48g
    • LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
      • outperforms than QLora

4.2.4. FINETUNNING

4.2.4.1. FEEDBACK AS TARGET
  • 4.2.4.2.1
  • rlhf = Reinforcement Learning with Human Feedback
  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO)
    • can fine-tune LMs to align with human preferences, better than RLHF
  • RAD: Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
    • generation which uses extra reward model to generate text with certain properties
  • ReFT: Reasoning with Reinforced Fine-Tuning
    • learn from multiple annotated reasoning paths
    • rewards are naturally derived from the ground-truth answers (like math)
  1. SELF TRAIN
    • TriPosT: Teaching Language Models to Self-Improve through Interactive Demonstrations
      • self-improvement for small models ability, revise own outputs correcting its own mistakes
    • Self-Refine: Iterative Refinement with Self-Feedback
4.2.4.2. CHEAPNESS
  1. MULTIPLE LLM
    • EFT: An Emulator for Fine-Tuning Large Language Models using Small Language Models
      • avoid resource-intensive fine-tuning of llm by ensembling them with small fine-tuned models
      • also: scaling up finetuning improves helpfulness, scaling up pre-training improves factuality
    • Tuna: Instruction Tuning using Feedback from Large Language Models
      • finetuning with contextual ranking
    • AutoMix: Automatically Mixing Language Models
      • strategically routes queries to larger llm, based on the outputs from a smaller LM
4.2.4.3. ADDITIVE METHODS
  • 4.2.3.2
  • LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
    • LoRA composability for cross-task generalization; neither more parameters nor gradients
  • Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
  1. LORA

4.2.5. MEMORY

  • Memorizing Transformers repo
    • Memorizing Transformer does not need to be pre-trained from scratch; possible adding memory to an existing pre-trained model, and then fine-tuning it
  • Think Before You Act: Decision Transformers with Internal Working Memory, task specialized memory
  • Memory Augmented Language Models through Mixture of Word Experts
    • Mixture of Word Experts (MoWE) (Mixture-of-Experts (MoE))
    • set of word-specific experts play the role of a sparse memory, similar performance to more complex memory augmented
  • Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
    • minimize the data movement between the CPU and GPU.
    • Mixtral-8x7B model, 90GB parameters, over 3 tokens per second on a single GPU with 24GB memory
  • GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection =best=
    • feasibility of pre-training a 7B model on GPUs with 24GB memory; unlike lora
      • 82.5% reduction in memory
4.2.5.1. CONTEXT LENGTH
  • 1.1
  • Augmenting Language Models with Long-Term Memory (unlimited context)
  • YaRN: Efficient Context Window Extension of Large Language Models
  • Efficient Memory Management for Large Language Model Serving with PagedAttention
    • vLLM: near-zero waste in KV cache memory, and flexible
  • Flash-Decoding: make long-context LLM inference up to 8x faster
    • load the KV cache in parallel as fast as possible, then separately rescale to combine the results
  • Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
    • LLM serving system dynamically managing KV Cache, orchestrates across the data center
  • Extending LLMs’ Context Window with 100 Samples
    • introduce a novel extension to RoPE so that it can adapt to larger context windows (efficiently)
    • exampled on llama

4.2.6. DATASET

  • 4.2.2.2
  • LIMA: Less Is More for Alignment
    • trained only 1,000 carefully curated prompts and responses
  • q2d: Turning Questions into Dialogs to Teach Models How to Search
    • synthetically-generated data achieve 90%–97% of the performance of training on human-generated data
  • Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing
    • high-quality model and dataset from a low-quality teacher model
  • Simple synthetic data reduces sycophancy in large language models
    • sycophancy = adapting views once a user reveals their views, to statements that are objectively incorrect
    • lightweight finetuning step
  • GPT Can Solve Mathematical Problems Without a Calculator; with training data = multi-digit arithmetic
  • TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise
    • anotating the dataset with “why” instead of only “what”
    • Lema: Learning From Mistakes Makes LLM Better Reasoner
      • identify, explain, correct mistakes using the llm itself to fintune (learn from mistakes)
  • Ziya2: Data-centric Learning is All LLMs Need
    • focus on pre-training techniques and data-centric optimization to enhance learning process

Author: Tekakutli

Created: 2024-04-13 Sat 04:35