text

1. ADDED - EXTRAS TO LLM
- 1.1. VECTOR DB
2. SPECIALIZED USES
3. TEXT DIFFUSION
4. TEXT GENERATION
- 4.1. INFERENCE
- 4.2. TRAINING

OpenSource Model but for new Hardware
cpp geeration library and list of supported models(gpt, RWKV): ggml
Language Model Inversion
- given output reconstruct the original prompt
LoRA or QLoRA by Google

1. ADDED - EXTRAS TO LLM

llama plugins: https://twitter.com/algo_diver/status/1639681733468753925
llama tools: https://github.com/OpenBMB/ToolBench
streaming vs non-streaming generation

1.1. VECTOR DB

langchain, and https://github.com/srush/MiniChain
- PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents
MemGPT: manages memory tiers to effectively provide extended context within llm limited context window
- llm taught to manage their own memory, resembles paging in OS (main context, external context) =best=
- trained to generate function calls

2. SPECIALIZED USES

QUERING MODELS - MULTIMODAL
Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding; medical, doctor
Personality Traits in Large Language Models, quantifying personalities
ChipNeMo: Domain-Adapted LLMs for Chip Design
LARP: Language-Agent Role Play for Open-World Games
- decision-making assistant, framework refines interactions between users and agents

2.1. LAYOUT LLM

PosterLlama: Bridging Design Ability of Langauge Model to Contents-Aware Layout Generation
- reformatting layout elements into HTML code
- unconditional layout generation, element conditional layout generation, layout completion

2.2. PLOT

Pix2Struct: text to plot
- DePlot: plot-to-text model helping LLMs understand plots
- MatCha: great chart & math capabilities by plot deconstruction & numerical reasoning objectives
StructLM: Towards Building Generalist Models for Structured Knowledge Grounding
- based on the Code-LLaMA architecture

2.3. LEGAL

SaulLM-7B: A pioneering Large Language Model for Law
- designed explicitly for legal text comprehension and generation

2.4. VISUAL

Pixel Aligned Language Models
- can take locations (set of points, boxes) as inputs or outputs
- location-aware vision-language tasks

2.5. CODE ASSISTANT

ROBOTS WEB MOCKING
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
- cross-file contextual understanding
mistral-8X-7B > codellama-34B (on humaneval)

2.5.1. MATH

Llemma: An Open Language Model For Mathematics
- capable of tool use and formal theorem proving
Large Language Models for Mathematicians (academic)
- mathematical description of the transformer model used in all modern language models
Chronos: Learning the Language of Time Series
- improve zero-shot accuracy on unseen forecasting tasks; forecasting pipeline
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
- extract crucial reasoning steps, to reveal the intermediate reasoning quality
- MLLMs

2.5.2. CODE COMPLETION

DeciCoder: decoder-only code completion model
- approach of grouping tokens into clusters and having each token attend to others only within its cluster
Magicoder: Source Code Is All You Need
- MagicoderS-CL-7B based on CodeLlama
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
- breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks
  - while masking segments to properly optimize

2.5.2.1. OPERATOR

Enhancing Network Management Using Code Generated by Large Language Models
- program synthesis: generate task-specific code from natural language queries
  - analyzing network topologies and communication graphs

2.5.3. DIFFUSION

CodeFusion: A Pre-trained Diffusion Model for Code Generation =diffusion= (75M vs 1B auto-regressive)
- iterative denoising, no need to start from scratch
Text Rendering Strategies for Pixel Language Models
- characters as images, handle any script; PIXEL model

2.5.4. TOOLS-USE TOOLS

Grammar Prompting for Domain-Specific Language Generation with Large Language Models
- like programming languages
- predict a BNF grammar given an input, then generates the output according to the rules of that grammar
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models
- zero-shot prompts with only documentation are sufficient for tool usage
- tool documentation > demonstrations
ControlLLM: Augment Language Models with Tools by Searching on Graphs
- breaks down a complex task into clear subtasks, then optimal solution path
Fay: integrating language models and digital characters

2.6. TRANSLATION

elit: provides NLPs for tokenization, tagging, recognition of languages
translation prompt: https://boards.4channel.org/g/thread/92468569#p92470651
EMMA: Efficient Monotonic Multihead Attention
- simultaneous speech-to-text translation on the Spanish and English translation task

2.7. OPTIMIZATION

OPRO: Optimization by PROmpting, Large Language Models as Optimizers
- each step = generates new solutions from previously generated solutions
Large Language Models for Compiler Optimization
- reducing instruction counts over the compiler
EvoPrompt: Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers

2.7.1. CACHE

SparQ Attention: Bandwidth-Efficient LLM Inference
- reducing memory bandwidth requirements within the attention blocks through selective fetching of the cached history (up to eight times)

2.8. SUMMARIZATION

thread summarizer https://labs.kagi.com/ai/sum?url=%3E%3E248633369
LLM Use Case: Summarization (using langchain)
From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
- iteratively incorporating missing salient entities without increasing the length
LMDX: Language Model-based Document Information Extraction and Localization
- methodology to adapt arbitrary LLMs for document information extraction (without hallucination)

3. TEXT DIFFUSION

parent: diffusion
GENIE: Large Scale Pre-training for Text Generation with Diffusion Model
TESS: Text-to-Text Self-Conditioned Simplex Diffusion
- AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation
PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model
DiffusionDialog: A Diffusion Model for Diverse Dialog Generation with Latent Space
- enhances the diversity of dialog responses while maintaining coherence

4. TEXT GENERATION

4.1. INFERENCE

4.1.1. BETTER

4.1.1.1. FOCUS THE ATTENTION

PASTA: Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs
- identifies a small subset of attention heads, then applies precise attention reweighting on them
- applied next to prompting
S2A: System 2 Attention (is something you might need too)
- regenerates context to only include the relevant portions before responding

4.1.2. FASTER

Accelerating LLM Inference with Staged Speculative Decoding
- restructure the speculative batch as a tree
MobileNMT: Enabling Translation in 15MB and 30ms
FlashDecoding++: Faster Large Language Model Inference on GPUs
- inference engine, 4-2x speedup; no more matrix flatness
Exponentially Faster Language Modelling
- replacing feedforward networks with fast feedforward networks (FFFs)
- engages just 12 out of 4095 neurons for each layer inference, 78x speedup
EAGLE: LLM decoding based on compression (and others with comparison: Medusa, Lookahead, Vanilla)
- sequence of second-top-layer features is compressible, making the prediction of subsequent feature vectors from previous ones easy by a small model

4.1.3. MODELS

jina-embeddings-v2: 8k context length, bert architecture
Yi-34B: 6B and 34B, better than llama2 (has benchmarks list)
- Yi: Open Foundation Models by 01.AI

4.1.3.1. QWEN

Qwen-7B: surpasses both LLaMA 2 7B and 13B on MMLU score, math and code
Qwen-1.5 space

4.1.3.2. LLAMA

LLaMa ipfs
in browser (there is also the cpp one)
tain all Llama-2 models on your own data

ALTERNATIVES
- Open LLama, Open-Source Reproduction, permissively licensed; Lit-LLaMA, RedPajama dataset
- Falcon: new family, open-source =instruct finetuned too=
- LLaMA Pro: Progressive LLaMA with Block Expansion
  - take pretrained model freeze params, then add new blocks
  - model with new data without forgetting old
- LiteLlama: has 460M parameters trained with 1T tokens.
- MobiLlama: Small Language Models (SLMs), open-source 0.5 billion (0.5B) parameter

4.1.3.3. MISTRAL

Mistral-7B: outperforms Llama 2 13B, MIT-Apache Licensed
- BakLLaVA: mistral + vision model
- zephyr: fine-tuned using Direct Performance Optimization
  - dataset ranked by a teacher model with intent alignment, smaller: 7b vs 70b llama
- OpenHermes-2: roleplay, gpt4 dataset
- https://huggingface.co/TheBloke/openchat_3.5-GGUF
- notux: chat data

4.2. TRAINING

Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference
- LLaMa-Adapter Multimodal! (vision)
Training Large Language Models Efficiently with Sparsity and Dataflow
Randomized Positional Encodings Boost Length Generalization of Transformers
MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies
- reverse cross-entropy RATHER THAN maximum likelihood estimation (MLE)
Neurons in Large Language Models: Dead, N-gram, Positional
- study: 70 neurons per layer are dead, some neurons specialize in removing the information from input
Backpack Language Models: non-contextual sense vectors, which specialize encoding different aspects word
In-Context Learning Creates Task Vectors
- In-context learning = compressing training set into a single task vector, then using it to modulate transformer to produce the output
Efficient Streaming Language Models with Attention Sinks (=better inference or trainning=)
- =context window cache is bad=, just keep first tokens around (as is)
  - or it is better to have a static null token at begining of window
- reated to “Vision Transformers need registers” paper

4.2.1. CHEAPNESS

JetMoE: Reaching LLaMA2 Performance with 0.1M Dollars
- and can be finetuned with a very limited computing budget

4.2.2. STRUCTURE

From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought
- probabilistic programming language = commonsense reasoning, linguistics
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
- learn to generate rationales at each token to explain future text, improving their predictions

4.2.2.1. MERGING

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion
- (specialized) text model merging (using rankings)
FuseChat: Knowledge Fusion of Chat Models
- knowledge fusion for structurally diverse architectures and scales llms

4.2.2.2. SKELETON

Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
- first skeleton, then parallel filling; faster and better
ART: Automatic multi-step reasoning and tool-use for large language models
- bubbles of logic
Orca 2: Teaching Small Language Models How to Reason
- reasoning techniques: step-by-step, recall then generate, recall-reason-generate, direct answer
PathFinder: Guided Search over Multi-Step Reasoning Paths
- tree-search-based reasoning path generation approach (beam search algorith)
- improved commonsense reasoning tasks and complex arithmetic
Stream of Search (SoS): Learning to Search in Language
- models can be taught to search by representing the process of search in language, as a flattened string

META-PROCESS TOKENS
- Teach LLMs to Personalize – An Approach inspired by Writing Education
  - retrieval, ranking, summarization, synthesis, and generation
- Link-Context Learning for Multimodal LLMs
  - causal associations between data points = cause and effect
  - In-Context Learning (ICL) = learn to learn
  - from limited tasks (providing demonstrations) and generalize to unseen tasks
- LoGiPT: Language Models can be Logical Solvers
  - parse natural language logical questions into symbolic representations, emulates logical solvers

4.2.2.3. CORPUS STRUCTURE, RETRIEVAL

NPM: Nonparametric Masked Language Modeling, vs GPTv3, text corpus based
- other code implementations https://www.catalyzex.com/paper/arxiv:2212.01349/code
RAVEN: In-Context Learning with Retrieval Augmented Encoder-Decoder Language Models
- context learning in retrieval-augmented language models

LLM AS ENCODER
- GZIP VS GPT
  - Copy Is All You Need
    - task of text generation decomposed into a series of copy-and-paste operations
    - text spans rather than vocabulary
    - learning = text compression algorithm ?
    - Decoding the ACL Paper: Gzip and KNN Rival BERT in Text Classification
- LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
  - LLMs can be effectively transformed into universal text encoders without the need for expensive adaptation

4.2.3. QUANTIZATION

int-3 quantization: https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and twitter
llama.cpp quantization
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
- outperforms GPTQ in 4-bit and 3-bit with 1.45x speedup and works with multimodal LLMs
- SpQR method for LLM compression: highly sensitive parameters are not quantized
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
- no more hand-craft-ed quantization parameters
LLM-FP4: 4-Bit Floating-Point Quantized Transformers, 5.8% lower on reasoning than the full-precision model
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
- identifies and structurally selects salient weights
  - 7 billion weights within 0.5 hours
EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
- leave the outliers (less than 1%) unchanged, implemented in parallel

4.2.3.1. 1-BIT

BitNet: Scaling 1-bit Transformers for Large Language Models
- vs 8-bit quantization architectures
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
- can compress 1.6 trillion parameter model to less than 160GB (20x compression, 0.8 bits per parameter)

4.2.3.2. LORA WITH QUANTIZATION

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
QLoRA: Efficient Finetuning of Quantized LLMs, 24 hours 1 gpu 48g
- LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
  - outperforms than QLora

4.2.4. FINETUNNING

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
- In-Context Instruction Learning (ICIL)
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
- distillation

4.2.4.1. FEEDBACK AS TARGET

4.2.4.2.1
rlhf = Reinforcement Learning with Human Feedback
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (DPO)
- can fine-tune LMs to align with human preferences, better than RLHF
RAD: Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
- generation which uses extra reward model to generate text with certain properties
ReFT: Reasoning with Reinforced Fine-Tuning
- learn from multiple annotated reasoning paths
- rewards are naturally derived from the ground-truth answers (like math)

SELF TRAIN
- TriPosT: Teaching Language Models to Self-Improve through Interactive Demonstrations
  - self-improvement for small models ability, revise own outputs correcting its own mistakes
- Self-Refine: Iterative Refinement with Self-Feedback

4.2.4.2. CHEAPNESS

Fine-Tuning Language Models with Just Forward Passes, less ram
Full Parameter Fine-tuning for Large Language Models with Limited Resources, low-memory optimizer

MULTIPLE LLM
- EFT: An Emulator for Fine-Tuning Large Language Models using Small Language Models
  - avoid resource-intensive fine-tuning of llm by ensembling them with small fine-tuned models
  - also: scaling up finetuning improves helpfulness, scaling up pre-training improves factuality
- Tuna: Instruction Tuning using Feedback from Large Language Models
  - finetuning with contextual ranking
- AutoMix: Automatically Mixing Language Models
  - strategically routes queries to larger llm, based on the outputs from a smaller LM

4.2.4.3. ADDITIVE METHODS

4.2.3.2
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
- LoRA composability for cross-task generalization; neither more parameters nor gradients
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization

LORA
- alpaca-lora
- sentence transformers: SetFit
  - efficient few-shot learning
- peft twitter repo
  - PEFT w/ Multi LoRA explained (LLM fine-tuning)

4.2.5. MEMORY

Memorizing Transformers repo
- Memorizing Transformer does not need to be pre-trained from scratch; possible adding memory to an existing pre-trained model, and then fine-tuning it
Think Before You Act: Decision Transformers with Internal Working Memory, task specialized memory
Memory Augmented Language Models through Mixture of Word Experts
- Mixture of Word Experts (MoWE) (Mixture-of-Experts (MoE))
- set of word-specific experts play the role of a sparse memory, similar performance to more complex memory augmented
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
- minimize the data movement between the CPU and GPU.
- Mixtral-8x7B model, 90GB parameters, over 3 tokens per second on a single GPU with 24GB memory
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection =best=
- feasibility of pre-training a 7B model on GPUs with 24GB memory; unlike lora
  - 82.5% reduction in memory

4.2.5.1. CONTEXT LENGTH

1.1
Augmenting Language Models with Long-Term Memory (unlimited context)
YaRN: Efficient Context Window Extension of Large Language Models
Efficient Memory Management for Large Language Model Serving with PagedAttention
- vLLM: near-zero waste in KV cache memory, and flexible
Flash-Decoding: make long-context LLM inference up to 8x faster
- load the KV cache in parallel as fast as possible, then separately rescale to combine the results
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
- LLM serving system dynamically managing KV Cache, orchestrates across the data center
Extending LLMs’ Context Window with 100 Samples
- introduce a novel extension to RoPE so that it can adapt to larger context windows (efficiently)
- exampled on llama

4.2.6. DATASET

4.2.2.2
LIMA: Less Is More for Alignment
- trained only 1,000 carefully curated prompts and responses
q2d: Turning Questions into Dialogs to Teach Models How to Search
- synthetically-generated data achieve 90%–97% of the performance of training on human-generated data
Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing
- high-quality model and dataset from a low-quality teacher model
Simple synthetic data reduces sycophancy in large language models
- sycophancy = adapting views once a user reveals their views, to statements that are objectively incorrect
- lightweight finetuning step
GPT Can Solve Mathematical Problems Without a Calculator; with training data = multi-digit arithmetic
TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise
- anotating the dataset with “why” instead of only “what”
- Lema: Learning From Mistakes Makes LLM Better Reasoner
  - identify, explain, correct mistakes using the llm itself to fintune (learn from mistakes)
Ziya2: Data-centric Learning is All LLMs Need
- focus on pre-training techniques and data-centric optimization to enhance learning process

text

Table of Contents

1. ADDED - EXTRAS TO LLM

1.1. VECTOR DB

2. SPECIALIZED USES

2.1. LAYOUT LLM

2.2. PLOT

2.3. LEGAL

2.4. VISUAL

2.5. CODE ASSISTANT

2.5.1. MATH

2.5.2. CODE COMPLETION

2.5.2.1. OPERATOR

2.5.3. DIFFUSION

2.5.4. TOOLS-USE TOOLS

2.6. TRANSLATION

2.7. OPTIMIZATION

2.7.1. CACHE

2.8. SUMMARIZATION

3. TEXT DIFFUSION

4. TEXT GENERATION

4.1. INFERENCE

4.1.1. BETTER

4.1.1.1. FOCUS THE ATTENTION

4.1.2. FASTER

4.1.3. MODELS

4.1.3.1. QWEN

4.1.3.2. LLAMA

4.1.3.3. MISTRAL

4.2. TRAINING

4.2.1. CHEAPNESS

4.2.2. STRUCTURE

4.2.2.1. MERGING

4.2.2.2. SKELETON

4.2.2.3. CORPUS STRUCTURE, RETRIEVAL

4.2.3. QUANTIZATION

4.2.3.1. 1-BIT

4.2.3.2. LORA WITH QUANTIZATION

4.2.4. FINETUNNING

4.2.4.1. FEEDBACK AS TARGET

4.2.4.2. CHEAPNESS

4.2.4.3. ADDITIVE METHODS

4.2.5. MEMORY

4.2.5.1. CONTEXT LENGTH

4.2.6. DATASET