computervision
Table of Contents
- UNDERSTANDING
- VISSL: computer VIsion library for Self-Supervised Learning
- OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
- FastV: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
- plug-and-play inference acceleration method relying on redundant visual tokensa
1. CUSTOMIZATIOn
2. MAP AS OUTPUT
3. LEARNING FROM VIDEO
4. DOCUMENTS
- DocLLM: A layout-aware generative language model for multimodal document understanding (JPMorgan)
- taking into account both textual semantics and spatial layout
- learns to infill text segments
5. TOKENIZER
- with image detector, image tokenizer https://github.com/Vision-CAIR/MiniGPT-4
- MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers
- dynamic tokenizer for ViTs, where the scale at which an image is processed varies based on semantic details
- DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion
- fusing local information by convolution(cnn) and global information(vit) by self-attention-based as attention structure
- DIFFUSION AS ENCODER
6. QUERING MODELS - MULTIMODAL
- models:
=Llava, Qwen-VL=
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
- image understanding
- Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
- reasoning over independent vision modules
- NeVA: NeMo Vision and Language Assistant, informative responses (wiki-like answers)
- Fuyu-8B twitter
- has no image encoder, interleaving of text and images at arbitrary image resolutions
- understanding diagrams, charts, and graphs
=best=
- Answering UI-based questions, bounding boxes, OCR
- OtterHD: A High-Resolution Multi-modality Model
- to interpret high-resolution visual inputs with granular precision
- CogVLM: Visual Expert for Pretrained Language Models
- frozen llm and image encoder connected with a trainable visual expert module
- MoE-LLaVA: Mixture-of-Experts for Large Vision-Language Models
- sparse model with an outrageous number of parameter but a constant computational cost
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
- InstaGen: Enhancing Object Detection by Training on Synthetic Dataset
- training on synthetic dataset generated from diffusion models
- self-training scheme on (novel) categories not covered by the detector
6.1. WITH OTHER VISUAL SOMETHING
- Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
- accepts and may output bounding box
- Qwen-Audio: audio-based QA
- Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
- image transformation instructions, has image segmentation
- Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
- can also use numbered marks
6.1.1. WEB MOCKING
- WebSight: Web Screenshots into HTML Code
6.1.2. GROUNDING
- PaLI-3 Vision Language Models: Smaller, Faster, Stronger
- better localization and visually-situated text understanding
- BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
- with architectural comparison to previous ones
- query embedings mapped to visual-patch embeddings
- GLaMM: Pixel Grounding Large Multimodal Model (generates masks for objects)
- Ferret: Refer and Ground Anything Anywhere at Any Granularity
- both boxes and free-form shapes
- GEOMETRY INTERACTIONS
6.2. 2D+ VISION
6.2.1. 3D VISION
- CoDA for 3D object detection: discover and classify novel objects, without 2d model
- SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model
- get layout(boxes) from 3d view of scene (dynamic, in game)
6.2.2. VIDEO VISION
- Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
- existing approaches encode images and videos into separate feature spaces
- mixed dataset of images and videos
- Soft Video Understanding
- audio is crucial for the overall understanding to help the LLM generate a resume
7. CLASSIFICATION
- GeneCIS: A Benchmark for General Conditional Image Similarity
- models, should adapt to notion of similarity dynamically
- Vocabulary-free Image Classification
7.1. CAPTIONING CLIPREGION
- SPEECH RECOGNITION STORYTELLING CAPTIONING
- SITTA: A Semantic Image-Text Alignment for Image Captioning
- linear semantic mappings = image captioning without access to gradient information; less computation
- Guiding Image Captioning Models Toward More Specific Captions
- CIC: A framework for Culturally-aware Image Captioning
- extracts cultural visual elements from Visual Question Answering (VQA)
7.1.1. CAPTIONING VIDEO
- Video ReCap: Recursive Captioning of Hour-Long Videos
- video has hierarchical structure spanning different temporal granularities
- exploit the synergy between different video hierarchies
- LoSA: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization
- classifying action snippets in an untrimmed video
- memory-and-parameter-efficient backbone
7.2. IMAGE CLUSTERING
- ATLAS
- Text-Guided Image Clustering
- VQA obtained text representations often outperform image features
7.2.1. DIFFUSION FEATURES
- DIffusion FeaTures (DIFT): Emergent Correspondence from Image Diffusion
- Your Diffusion Model is Secretly a Zero-Shot Classifier
8. AUDIO VISION
- Clamp: clip for music
- CLAP (Contrastive Language-Audio Pretraining)
- FLAP: Fast Language-Audio Pre-training
- learns to reconstruct the masked portion of audio tokens
- FLAP: Fast Language-Audio Pre-training
- CLAP (Contrastive Language-Audio Pretraining)
- Pengi: An Audio Language Model for Audio Tasks
- audio understanding
- MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models
- decompose user requests into multiple sub-tasks and invoke corresponding music tools
8.1. WHISPER
- whisper translator fast
- Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers
- audio representation is actually not noise-invariant
- audio tagging model on top, <1% extra computational, a single forward pass
- Distil-Whisper: Distilled Whisper 6x faster, 50% smaller
- Whisper Large-v3:
- audio representation is actually not noise-invariant
- word-level timestamps w/ whisper
- fast whisper now with Speaker Diarisation
- Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers
- SeamlessM4T: Speech-to-speech, speech-to-text, text-to-speech, text-to-text translation, and automatic speech recognition
- Inverted Whisper = Whisper Speech (everything opensourced)
=best=