computervision

Table of Contents

1. CUSTOMIZATIOn

  • Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance
  • MyVLM: Personalizing VLMs for User-Specific Queries
    • personalization of VLMs, enabling them to learn and reason over user-provided concepts

2. MAP AS OUTPUT

  • PolyMaX: General Dense Prediction with Mask Transformer
    • cluster-prediction instaed of per-pixel
    • segmentation, depth and normal from single image
  • DSINE: Rethinking Inductive Biases for Surface Normal Estimation (single image)
    • per-pixel ray direction as an additional input to the network

3. LEARNING FROM VIDEO

  • V-JEPA: teaching to understand and model the physical world by watching videos
    • learn masked parts of the video, learn to inpaint
  • World Model on Million-Length Video And Language With RingAttention
    • gradually increase context size from 4K to 1M tokens

4. DOCUMENTS

  • DocLLM: A layout-aware generative language model for multimodal document understanding (JPMorgan)
    • taking into account both textual semantics and spatial layout
    • learns to infill text segments

5. TOKENIZER

6. QUERING MODELS - MULTIMODAL

  • models: =Llava, Qwen-VL=
  • InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
    • image understanding
  • Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
    • reasoning over independent vision modules
  • NeVA: NeMo Vision and Language Assistant, informative responses (wiki-like answers)
  • Fuyu-8B twitter
    • has no image encoder, interleaving of text and images at arbitrary image resolutions
    • understanding diagrams, charts, and graphs =best=
    • Answering UI-based questions, bounding boxes, OCR
    • OtterHD: A High-Resolution Multi-modality Model
      • to interpret high-resolution visual inputs with granular precision
  • CogVLM: Visual Expert for Pretrained Language Models
    • frozen llm and image encoder connected with a trainable visual expert module
  • MoE-LLaVA: Mixture-of-Experts for Large Vision-Language Models
    • sparse model with an outrageous number of parameter but a constant computational cost
  • InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
  • InstaGen: Enhancing Object Detection by Training on Synthetic Dataset
    • training on synthetic dataset generated from diffusion models
    • self-training scheme on (novel) categories not covered by the detector

6.1. WITH OTHER VISUAL SOMETHING

  • Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
    • accepts and may output bounding box
    • Qwen-Audio: audio-based QA
  • Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
    • image transformation instructions, has image segmentation
  • Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
    • can also use numbered marks

6.1.1. WEB MOCKING

  • WebSight: Web Screenshots into HTML Code

6.1.2. GROUNDING

  • PaLI-3 Vision Language Models: Smaller, Faster, Stronger
    • better localization and visually-situated text understanding
  • BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
    • with architectural comparison to previous ones
    • query embedings mapped to visual-patch embeddings
  • GLaMM: Pixel Grounding Large Multimodal Model (generates masks for objects)
  • Ferret: Refer and Ground Anything Anywhere at Any Granularity
    • both boxes and free-form shapes
  • GEOMETRY INTERACTIONS

6.2. 2D+ VISION

6.2.1. 3D VISION

  • CoDA for 3D object detection: discover and classify novel objects, without 2d model
  • SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model
    • get layout(boxes) from 3d view of scene (dynamic, in game)

6.2.2. VIDEO VISION

  • Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
    • existing approaches encode images and videos into separate feature spaces
    • mixed dataset of images and videos
  • Soft Video Understanding
    • audio is crucial for the overall understanding to help the LLM generate a resume

7. CLASSIFICATION

  • GeneCIS: A Benchmark for General Conditional Image Similarity
    • models, should adapt to notion of similarity dynamically
  • Vocabulary-free Image Classification

7.1. CAPTIONING   CLIPREGION

  • SPEECH RECOGNITION STORYTELLING CAPTIONING
  • SITTA: A Semantic Image-Text Alignment for Image Captioning
    • linear semantic mappings = image captioning without access to gradient information; less computation
  • Guiding Image Captioning Models Toward More Specific Captions
  • CIC: A framework for Culturally-aware Image Captioning
    • extracts cultural visual elements from Visual Question Answering (VQA)

7.1.1. CAPTIONING VIDEO

  • Video ReCap: Recursive Captioning of Hour-Long Videos
    • video has hierarchical structure spanning different temporal granularities
    • exploit the synergy between different video hierarchies
  • LoSA: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization
    • classifying action snippets in an untrimmed video
    • memory-and-parameter-efficient backbone

7.1.2. REGIONS

  • Segment and Caption Anything
    • generate regional captions
  • RegionGPT: Towards Region Understanding Vision Language Model
    • region-level captions, description, reasoning, object classification, and referring expressions comprehension

7.2. IMAGE CLUSTERING

  • ATLAS
  • Text-Guided Image Clustering
    • VQA obtained text representations often outperform image features

7.2.1. DIFFUSION FEATURES

8. AUDIO VISION

  • Clamp: clip for music
    • CLAP (Contrastive Language-Audio Pretraining)
      • FLAP: Fast Language-Audio Pre-training
        • learns to reconstruct the masked portion of audio tokens
  • Pengi: An Audio Language Model for Audio Tasks
    • audio understanding
  • MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models
    • decompose user requests into multiple sub-tasks and invoke corresponding music tools

8.1. WHISPER

  • whisper translator fast
    • Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers
      • audio representation is actually not noise-invariant
        • audio tagging model on top, <1% extra computational, a single forward pass
      • Distil-Whisper: Distilled Whisper 6x faster, 50% smaller
      • Whisper Large-v3:
    • word-level timestamps w/ whisper
    • fast whisper now with Speaker Diarisation
  • SeamlessM4T: Speech-to-speech, speech-to-text, text-to-speech, text-to-text translation, and automatic speech recognition
  • Inverted Whisper = Whisper Speech (everything opensourced) =best=

Author: Tekakutli

Created: 2024-04-08 Mon 12:57