clip

Table of Contents

1. NOT TO WORDS

  • CLIP Can Understand Depth
    • extended to non-human language prompts
  • CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment
    • quick editing image styles

2. TRAIN CLIP

  • SpLiCE: decomposes CLIP embeddings into sparse combinations of human-interpretable, semantic concepts
    • can be used for concept bottleneck models and spurious correlation detection
  • Any-Shift Prompting for Generalization over Distributions
    • encode the distribution information and their relationships
      • guide the generalization of the CLIP image-language model from training to any test distribution
      • faster testing
  • CoN-CLIP: Learn “No” to Say “Yes” Better: Improving Vision-Language Models via Negations
    • highlights limitations of popular VLMs such as CLIP, at understanding the implications of negations,
    • showcases emergent compositional understanding of objects, relations, and attributes in text

3. 3D+ CLIP

3.1. LIFT3D

  • Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D
    • method trains to predict unseen views on feature spaces generated by vision models (i.e. DINO, CLIP)
      • but then generalizes to novel vision operators and tasks, such as style transfer, super-resolution, open vocabulary segmentation and image colorization

4. VIDEO CLIP

  • EZ-CLIP: Efficient Zeroshot Video Action Recognition
    • no fundamental alterations to clip, guides visual prompts to focus on capturing motion
  • VideoCLIP
    • compute similarity with text and perform vector retrieval

5. PRIOR ALTERNATIVES

  • SEECODERS
  • better clip, nearest neighbor
  • Image-and-Language, pixels only no strings: CLIPPO
    • ByT5, token free, no tokenizer
      • character aware models: can spell, like by ByT5
        • maybe hands-aware models?
  • Hyperbolic Contrastive Learning for Visual Representations beyond Objects
  • Retrieval-Enhanced Contrastive Vision-Text Models
    • train frozen clip to retrieve knowledge from an external memory
  • Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
    • ability of open-vocabulary classification and also strong mask generator
  • ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation
    • learns pseudo labels, then refines them
    • source-free domain adaptation, mitigates misaligned embeddings
  • SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding
    • CLIP and SAM (good at identifying objects positions) merged into model
  • Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
    • auxiliary alpha channel to suggest attentive regions, control over the emphasis

5.1. TEXT MANIPULATION

  • Improving CLIP Training with Language Rewrites
    • rewrite the text descriptions associated with each image using an LLM
  • Language models are weak learners
    • better-than-random performance, boosting component for other models

5.2. CLIP YET BETTER

  • MetaCLIP: a fully open-source replication of CLIP
  • TiC-CLIP: Continual Training of CLIP Models
    • continues training from the last checkpoint by replaying old data, reduces compute by 2.5times vs from scratch
  • ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
    • performance on par with bigger models
    • distill clip knowledge into the prior model

6. CHEAPNESS

  • RECLIP: Resource-efficient CLIP by Training with Small Images
  • CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget
  • AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models (full unsupervised)

7. SCALENESS

8. FASTNESS

Author: Tekakutli

Created: 2024-04-08 Mon 12:57