computer_vision

1. CUSTOMIZATIOn
2. MAP AS OUTPUT
3. LEARNING FROM VIDEO
4. DOCUMENTS
5. TOKENIZER
6. QUERING MODELS - MULTIMODAL
- 6.1. WITH OTHER VISUAL SOMETHING
  - 6.1.1. WEB MOCKING
  - 6.1.2. GROUNDING
- 6.2. 2D+ VISION
  - 6.2.1. 3D VISION
  - 6.2.2. VIDEO VISION
7. CLASSIFICATION
- 7.1. CAPTIONING CLIPREGION
  - 7.1.1. CAPTIONING VIDEO
  - 7.1.2. REGIONS
- 7.2. IMAGE CLUSTERING
  - 7.2.1. DIFFUSION FEATURES
8. AUDIO VISION
- 8.1. WHISPER

UNDERSTANDING
VISSL: computer VIsion library for Self-Supervised Learning
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
FastV: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
- plug-and-play inference acceleration method relying on redundant visual tokensa

1. CUSTOMIZATIOn

Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance
MyVLM: Personalizing VLMs for User-Specific Queries
- personalization of VLMs, enabling them to learn and reason over user-provided concepts

2. MAP AS OUTPUT

PolyMaX: General Dense Prediction with Mask Transformer
- cluster-prediction instaed of per-pixel
- segmentation, depth and normal from single image
DSINE: Rethinking Inductive Biases for Surface Normal Estimation (single image)
- per-pixel ray direction as an additional input to the network

3. LEARNING FROM VIDEO

V-JEPA: teaching to understand and model the physical world by watching videos
- learn masked parts of the video, learn to inpaint
World Model on Million-Length Video And Language With RingAttention
- gradually increase context size from 4K to 1M tokens

4. DOCUMENTS

DocLLM: A layout-aware generative language model for multimodal document understanding (JPMorgan)
- taking into account both textual semantics and spatial layout
- learns to infill text segments

5. TOKENIZER

with image detector, image tokenizer https://github.com/Vision-CAIR/MiniGPT-4
- llava: https://llava-vl.github.io/ https://arxiv.org/abs/2304.08485
MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers
- dynamic tokenizer for ViTs, where the scale at which an image is processed varies based on semantic details
DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion
- fusing local information by convolution(cnn) and global information(vit) by self-attention-based as attention structure
DIFFUSION AS ENCODER

6. QUERING MODELS - MULTIMODAL

models: =Llava, Qwen-VL=
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
- image understanding
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
- reasoning over independent vision modules
NeVA: NeMo Vision and Language Assistant, informative responses (wiki-like answers)
Fuyu-8B twitter
- has no image encoder, interleaving of text and images at arbitrary image resolutions
- understanding diagrams, charts, and graphs =best=
- Answering UI-based questions, bounding boxes, OCR
- OtterHD: A High-Resolution Multi-modality Model
  - to interpret high-resolution visual inputs with granular precision
CogVLM: Visual Expert for Pretrained Language Models
- frozen llm and image encoder connected with a trainable visual expert module
MoE-LLaVA: Mixture-of-Experts for Large Vision-Language Models
- sparse model with an outrageous number of parameter but a constant computational cost
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
InstaGen: Enhancing Object Detection by Training on Synthetic Dataset
- training on synthetic dataset generated from diffusion models
- self-training scheme on (novel) categories not covered by the detector

6.1. WITH OTHER VISUAL SOMETHING

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
- accepts and may output bounding box
- Qwen-Audio: audio-based QA
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
- image transformation instructions, has image segmentation
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
- can also use numbered marks

6.1.1. WEB MOCKING

WebSight: Web Screenshots into HTML Code

6.1.2. GROUNDING

PaLI-3 Vision Language Models: Smaller, Faster, Stronger
- better localization and visually-situated text understanding
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
- with architectural comparison to previous ones
- query embedings mapped to visual-patch embeddings
GLaMM: Pixel Grounding Large Multimodal Model (generates masks for objects)
Ferret: Refer and Ground Anything Anywhere at Any Granularity
- both boxes and free-form shapes
GEOMETRY INTERACTIONS

6.2. 2D+ VISION

6.2.1. 3D VISION

CoDA for 3D object detection: discover and classify novel objects, without 2d model
SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model
- get layout(boxes) from 3d view of scene (dynamic, in game)

6.2.2. VIDEO VISION

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
- existing approaches encode images and videos into separate feature spaces
- mixed dataset of images and videos
Soft Video Understanding
- audio is crucial for the overall understanding to help the LLM generate a resume

7. CLASSIFICATION

GeneCIS: A Benchmark for General Conditional Image Similarity
- models, should adapt to notion of similarity dynamically
Vocabulary-free Image Classification

7.1. CAPTIONING CLIPREGION

SPEECH RECOGNITION STORYTELLING CAPTIONING
SITTA: A Semantic Image-Text Alignment for Image Captioning
- linear semantic mappings = image captioning without access to gradient information; less computation
Guiding Image Captioning Models Toward More Specific Captions
CIC: A framework for Culturally-aware Image Captioning
- extracts cultural visual elements from Visual Question Answering (VQA)

7.1.1. CAPTIONING VIDEO

Video ReCap: Recursive Captioning of Hour-Long Videos
- video has hierarchical structure spanning different temporal granularities
- exploit the synergy between different video hierarchies
LoSA: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization
- classifying action snippets in an untrimmed video
- memory-and-parameter-efficient backbone

7.1.2. REGIONS

Segment and Caption Anything
- generate regional captions
RegionGPT: Towards Region Understanding Vision Language Model
- region-level captions, description, reasoning, object classification, and referring expressions comprehension

7.2. IMAGE CLUSTERING

ATLAS
Text-Guided Image Clustering
- VQA obtained text representations often outperform image features

7.2.1. DIFFUSION FEATURES

DIffusion FeaTures (DIFT): Emergent Correspondence from Image Diffusion
- Your Diffusion Model is Secretly a Zero-Shot Classifier

8. AUDIO VISION

Clamp: clip for music
- CLAP (Contrastive Language-Audio Pretraining)
  - FLAP: Fast Language-Audio Pre-training
    - learns to reconstruct the masked portion of audio tokens
Pengi: An Audio Language Model for Audio Tasks
- audio understanding
MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models
- decompose user requests into multiple sub-tasks and invoke corresponding music tools

8.1. WHISPER

whisper translator fast
- Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers
  - audio representation is actually not noise-invariant
    - audio tagging model on top, <1% extra computational, a single forward pass
  - Distil-Whisper: Distilled Whisper 6x faster, 50% smaller
  - Whisper Large-v3:
- word-level timestamps w/ whisper
- fast whisper now with Speaker Diarisation
SeamlessM4T: Speech-to-speech, speech-to-text, text-to-speech, text-to-text translation, and automatic speech recognition
Inverted Whisper = Whisper Speech (everything opensourced) =best=

computervision

Table of Contents

1. CUSTOMIZATIOn

2. MAP AS OUTPUT

3. LEARNING FROM VIDEO

4. DOCUMENTS

5. TOKENIZER

6. QUERING MODELS - MULTIMODAL

6.1. WITH OTHER VISUAL SOMETHING

6.1.1. WEB MOCKING

6.1.2. GROUNDING

6.2. 2D+ VISION

6.2.1. 3D VISION

6.2.2. VIDEO VISION

7. CLASSIFICATION

7.1. CAPTIONING CLIPREGION

7.1.1. CAPTIONING VIDEO

7.1.2. REGIONS

7.2. IMAGE CLUSTERING

7.2.1. DIFFUSION FEATURES

8. AUDIO VISION

8.1. WHISPER

computer_vision