text
Table of Contents
- OpenSource Model but for new Hardware
- cpp geeration library and list of supported models(gpt, RWKV): ggml
- Language Model Inversion
- given output reconstruct the original prompt
- LoRA or QLoRA by Google
1. ADDED - EXTRAS TO LLM
- llama plugins: https://twitter.com/algo_diver/status/1639681733468753925
- llama tools: https://github.com/OpenBMB/ToolBench
- streaming vs non-streaming generation
1.1. VECTOR DB
- langchain, and https://github.com/srush/MiniChain
- PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents
- MemGPT: manages memory tiers to effectively provide extended context within llm limited context window
- llm taught to manage their own memory, resembles paging in OS (main context, external context)
=best=
- trained to generate function calls
- llm taught to manage their own memory, resembles paging in OS (main context, external context)
2. SPECIALIZED USES
- QUERING MODELS - MULTIMODAL
- Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding; medical, doctor
- Personality Traits in Large Language Models, quantifying personalities
- ChipNeMo: Domain-Adapted LLMs for Chip Design
- LARP: Language-Agent Role Play for Open-World Games
- decision-making assistant, framework refines interactions between users and agents
2.1. LAYOUT LLM
- PosterLlama: Bridging Design Ability of Langauge Model to Contents-Aware Layout Generation
- reformatting layout elements into HTML code
- unconditional layout generation, element conditional layout generation, layout completion
2.2. PLOT
- Pix2Struct: text to plot
- DePlot: plot-to-text model helping LLMs understand plots
- MatCha: great chart & math capabilities by plot deconstruction & numerical reasoning objectives
- StructLM: Towards Building Generalist Models for Structured Knowledge Grounding
- based on the Code-LLaMA architecture
2.3. LEGAL
- SaulLM-7B: A pioneering Large Language Model for Law
- designed explicitly for legal text comprehension and generation
2.4. VISUAL
- Pixel Aligned Language Models
- can take locations (set of points, boxes) as inputs or outputs
- location-aware vision-language tasks
2.5. CODE ASSISTANT
- ROBOTS WEB MOCKING
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
- cross-file contextual understanding
- mistral-8X-7B > codellama-34B (on humaneval)
2.5.1. MATH
- Llemma: An Open Language Model For Mathematics
- capable of tool use and formal theorem proving
- Large Language Models for Mathematicians (academic)
- mathematical description of the transformer model used in all modern language models
- Chronos: Learning the Language of Time Series
- improve zero-shot accuracy on unseen forecasting tasks; forecasting pipeline
- MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
- extract crucial reasoning steps, to reveal the intermediate reasoning quality
- MLLMs
2.5.2. CODE COMPLETION
- DeciCoder: decoder-only code completion model
- approach of grouping tokens into clusters and having each token attend to others only within its cluster
- Magicoder: Source Code Is All You Need
- MagicoderS-CL-7B based on CodeLlama
- StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
- breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks
- while masking segments to properly optimize
- breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks
2.5.2.1. OPERATOR
- Enhancing Network Management Using Code Generated by Large Language Models
- program synthesis: generate task-specific code from natural language queries
- analyzing network topologies and communication graphs
- program synthesis: generate task-specific code from natural language queries
2.5.3. DIFFUSION
- CodeFusion: A Pre-trained Diffusion Model for Code Generation
=diffusion=
(75M vs 1B auto-regressive)- iterative denoising, no need to start from scratch
- Text Rendering Strategies for Pixel Language Models
- characters as images, handle any script; PIXEL model
2.5.4. TOOLS-USE TOOLS
- Grammar Prompting for Domain-Specific Language Generation with Large Language Models
- like programming languages
- predict a BNF grammar given an input, then generates the output according to the rules of that grammar
- Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models
- zero-shot prompts with only documentation are sufficient for tool usage
- tool documentation > demonstrations
- ControlLLM: Augment Language Models with Tools by Searching on Graphs
- breaks down a complex task into clear subtasks, then optimal solution path
- Fay: integrating language models and digital characters
2.6. TRANSLATION
- elit: provides NLPs for tokenization, tagging, recognition of languages
- translation prompt: https://boards.4channel.org/g/thread/92468569#p92470651
- EMMA: Efficient Monotonic Multihead Attention
- simultaneous speech-to-text translation on the Spanish and English translation task
2.7. OPTIMIZATION
- OPRO: Optimization by PROmpting, Large Language Models as Optimizers
- each step = generates new solutions from previously generated solutions
- Large Language Models for Compiler Optimization
- reducing instruction counts over the compiler
- EvoPrompt: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers
2.7.1. CACHE
- SparQ Attention: Bandwidth-Efficient LLM Inference
- reducing memory bandwidth requirements within the attention blocks through selective fetching of the cached history (up to eight times)
2.8. SUMMARIZATION
- thread summarizer https://labs.kagi.com/ai/sum?url=%3E%3E248633369
- LLM Use Case: Summarization (using langchain)
- From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
- iteratively incorporating missing salient entities without increasing the length
- LMDX: Language Model-based Document Information Extraction and Localization
- methodology to adapt arbitrary LLMs for document information extraction (without hallucination)
3. TEXT DIFFUSION
- parent: diffusion
- GENIE: Large Scale Pre-training for Text Generation with Diffusion Model
- TESS: Text-to-Text Self-Conditioned Simplex Diffusion
- AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation
- PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model
- DiffusionDialog: A Diffusion Model for Diverse Dialog Generation with Latent Space
- enhances the diversity of dialog responses while maintaining coherence
4. TEXT GENERATION
4.1. INFERENCE
4.1.1. BETTER
4.1.1.1. FOCUS THE ATTENTION
- PASTA: Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs
- identifies a small subset of attention heads, then applies precise attention reweighting on them
- applied next to prompting
- S2A: System 2 Attention (is something you might need too)
- regenerates context to only include the relevant portions before responding
4.1.2. FASTER
- Accelerating LLM Inference with Staged Speculative Decoding
- restructure the speculative batch as a tree
- MobileNMT: Enabling Translation in 15MB and 30ms
- FlashDecoding++: Faster Large Language Model Inference on GPUs
- inference engine, 4-2x speedup; no more matrix flatness
- Exponentially Faster Language Modelling
- replacing feedforward networks with fast feedforward networks (FFFs)
- engages just 12 out of 4095 neurons for each layer inference, 78x speedup
- EAGLE: LLM decoding based on compression (and others with comparison: Medusa, Lookahead, Vanilla)
- sequence of second-top-layer features is compressible, making the prediction of subsequent feature vectors from previous ones easy by a small model
4.1.3. MODELS
4.1.3.1. QWEN
4.1.3.2. LLAMA
- LLaMa ipfs
- in browser (there is also the cpp one)
- tain all Llama-2 models on your own data
- ALTERNATIVES
- Open LLama, Open-Source Reproduction, permissively licensed; Lit-LLaMA, RedPajama dataset
- Falcon: new family, open-source
=instruct finetuned too=
- LLaMA Pro: Progressive LLaMA with Block Expansion
- take pretrained model freeze params, then add new blocks
- model with new data without forgetting old
- LiteLlama: has 460M parameters trained with 1T tokens.
- MobiLlama: Small Language Models (SLMs), open-source 0.5 billion (0.5B) parameter
4.1.3.3. MISTRAL
- Mistral-7B: outperforms Llama 2 13B, MIT-Apache Licensed
- BakLLaVA: mistral + vision model
- zephyr: fine-tuned using Direct Performance Optimization
- dataset ranked by a teacher model with intent alignment, smaller: 7b vs 70b llama
- OpenHermes-2: roleplay, gpt4 dataset
- https://huggingface.co/TheBloke/openchat_3.5-GGUF
- notux: chat data
4.2. TRAINING
- Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference
- Training Large Language Models Efficiently with Sparsity and Dataflow
- Randomized Positional Encodings Boost Length Generalization of Transformers
- MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies
- reverse cross-entropy RATHER THAN maximum likelihood estimation (MLE)
- Neurons in Large Language Models: Dead, N-gram, Positional
- study: 70 neurons per layer are dead, some neurons specialize in removing the information from input
- Backpack Language Models: non-contextual sense vectors, which specialize encoding different aspects word
- In-Context Learning Creates Task Vectors
- In-context learning = compressing training set into a single task vector, then using it to modulate transformer to produce the output
- Efficient Streaming Language Models with Attention Sinks (
=better inference or trainning=
)=context window cache is bad=
, just keep first tokens around (as is)- or it is better to have a static null token at begining of window
- reated to “Vision Transformers need registers” paper
4.2.1. CHEAPNESS
4.2.2. STRUCTURE
- From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought
- probabilistic programming language = commonsense reasoning, linguistics
- Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
- learn to generate rationales at each token to explain future text, improving their predictions
4.2.2.1. MERGING
- LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion
- (specialized) text model merging (using rankings)
- FuseChat: Knowledge Fusion of Chat Models
- knowledge fusion for structurally diverse architectures and scales llms
4.2.2.2. SKELETON
- Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
- first skeleton, then parallel filling; faster and better
- ART: Automatic multi-step reasoning and tool-use for large language models
- bubbles of logic
- Orca 2: Teaching Small Language Models How to Reason
- reasoning techniques: step-by-step, recall then generate, recall-reason-generate, direct answer
- PathFinder: Guided Search over Multi-Step Reasoning Paths
- tree-search-based reasoning path generation approach (beam search algorith)
- improved commonsense reasoning tasks and complex arithmetic
- Stream of Search (SoS): Learning to Search in Language
- models can be taught to search by representing the process of search in language, as a flattened string
- META-PROCESS TOKENS
- Teach LLMs to Personalize – An Approach inspired by Writing Education
- retrieval, ranking, summarization, synthesis, and generation
- Link-Context Learning for Multimodal LLMs
- causal associations between data points = cause and effect
- In-Context Learning (ICL) = learn to learn
- from limited tasks (providing demonstrations) and generalize to unseen tasks
- LoGiPT: Language Models can be Logical Solvers
- parse natural language logical questions into symbolic representations, emulates logical solvers
- Teach LLMs to Personalize – An Approach inspired by Writing Education
4.2.2.3. CORPUS STRUCTURE, RETRIEVAL
- NPM: Nonparametric Masked Language Modeling, vs GPTv3, text corpus based
- other code implementations https://www.catalyzex.com/paper/arxiv:2212.01349/code
- RAVEN: In-Context Learning with Retrieval Augmented Encoder-Decoder Language Models
- context learning in retrieval-augmented language models
- LLM AS ENCODER
- GZIP VS GPT
- Copy Is All You Need
- task of text generation decomposed into a series of copy-and-paste operations
- text spans rather than vocabulary
- learning = text compression algorithm ?
- Decoding the ACL Paper: Gzip and KNN Rival BERT in Text Classification
- Copy Is All You Need
- LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
- LLMs can be effectively transformed into universal text encoders without the need for expensive adaptation
- GZIP VS GPT
4.2.3. QUANTIZATION
- int-3 quantization: https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and twitter
- llama.cpp quantization
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
- OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
- no more hand-craft-ed quantization parameters
- LLM-FP4: 4-Bit Floating-Point Quantized Transformers, 5.8% lower on reasoning than the full-precision model
- BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
- identifies and structurally selects salient weights
- 7 billion weights within 0.5 hours
- identifies and structurally selects salient weights
- EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
- leave the outliers (less than 1%) unchanged, implemented in parallel
4.2.3.1. 1-BIT
4.2.4. FINETUNNING
- LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
- In-Context Instruction Learning (ICIL)
- LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
- distillation
4.2.4.1. FEEDBACK AS TARGET
- 4.2.4.2.1
- rlhf = Reinforcement Learning with Human Feedback
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO)
- can fine-tune LMs to align with human preferences, better than RLHF
- RAD: Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
- generation which uses extra reward model to generate text with certain properties
- ReFT: Reasoning with Reinforced Fine-Tuning
- learn from multiple annotated reasoning paths
- rewards are naturally derived from the ground-truth answers (like math)
- SELF TRAIN
- TriPosT: Teaching Language Models to Self-Improve through Interactive Demonstrations
- self-improvement for small models ability, revise own outputs correcting its own mistakes
- Self-Refine: Iterative Refinement with Self-Feedback
- TriPosT: Teaching Language Models to Self-Improve through Interactive Demonstrations
4.2.4.2. CHEAPNESS
- Fine-Tuning Language Models with Just Forward Passes, less ram
- Full Parameter Fine-tuning for Large Language Models with Limited Resources, low-memory optimizer
- MULTIPLE LLM
- EFT: An Emulator for Fine-Tuning Large Language Models using Small Language Models
- avoid resource-intensive fine-tuning of llm by ensembling them with small fine-tuned models
- also: scaling up finetuning improves helpfulness, scaling up pre-training improves factuality
- Tuna: Instruction Tuning using Feedback from Large Language Models
- finetuning with contextual ranking
- AutoMix: Automatically Mixing Language Models
- strategically routes queries to larger llm, based on the outputs from a smaller LM
- EFT: An Emulator for Fine-Tuning Large Language Models using Small Language Models
4.2.4.3. ADDITIVE METHODS
- 4.2.3.2
- LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
- LoRA composability for cross-task generalization; neither more parameters nor gradients
- Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
4.2.5. MEMORY
- Memorizing Transformers repo
- Memorizing Transformer does not need to be pre-trained from scratch; possible adding memory to an existing pre-trained model, and then fine-tuning it
- Think Before You Act: Decision Transformers with Internal Working Memory, task specialized memory
- Memory Augmented Language Models through Mixture of Word Experts
- Mixture of Word Experts (MoWE) (Mixture-of-Experts (MoE))
- set of word-specific experts play the role of a sparse memory, similar performance to more complex memory augmented
- Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
- minimize the data movement between the CPU and GPU.
- Mixtral-8x7B model, 90GB parameters, over 3 tokens per second on a single GPU with 24GB memory
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
=best=
- feasibility of pre-training a 7B model on GPUs with 24GB memory; unlike lora
- 82.5% reduction in memory
- feasibility of pre-training a 7B model on GPUs with 24GB memory; unlike lora
4.2.5.1. CONTEXT LENGTH
- 1.1
- Augmenting Language Models with Long-Term Memory (unlimited context)
- YaRN: Efficient Context Window Extension of Large Language Models
- Efficient Memory Management for Large Language Model Serving with PagedAttention
- vLLM: near-zero waste in KV cache memory, and flexible
- Flash-Decoding: make long-context LLM inference up to 8x faster
- load the KV cache in parallel as fast as possible, then separately rescale to combine the results
- Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
- LLM serving system dynamically managing KV Cache, orchestrates across the data center
- Extending LLMs’ Context Window with 100 Samples
- introduce a novel extension to RoPE so that it can adapt to larger context windows (efficiently)
- exampled on llama
4.2.6. DATASET
- 4.2.2.2
- LIMA: Less Is More for Alignment
- trained only 1,000 carefully curated prompts and responses
- q2d: Turning Questions into Dialogs to Teach Models How to Search
- synthetically-generated data achieve 90%–97% of the performance of training on human-generated data
- Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing
- high-quality model and dataset from a low-quality teacher model
- Simple synthetic data reduces sycophancy in large language models
- sycophancy = adapting views once a user reveals their views, to statements that are objectively incorrect
- lightweight finetuning step
- GPT Can Solve Mathematical Problems Without a Calculator; with training data = multi-digit arithmetic
- TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise
- anotating the dataset with “why” instead of only “what”
- Lema: Learning From Mistakes Makes LLM Better Reasoner
- identify, explain, correct mistakes using the llm itself to fintune (learn from mistakes)
- Ziya2: Data-centric Learning is All LLMs Need
- focus on pre-training techniques and data-centric optimization to enhance learning process