clip

parent: computer_vision
CLIP AS REWARD
simo take on better clip papers: https://twitter.com/cloneofsimo/status/1666086583005769728
COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?
- attributes(adjectives) with its subjects properly identified
What does CLIP know about a red circle? Visual prompt engineering for VLMs
- direct the model attention to that region, while also maintaining global information
Parrot Captions Teach CLIP to Spot Text
- urgent to redesign CLIP-like models so they ignore captions
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding
- image-level, region-level, and pixel-level captions/tags
CLIP Model for Images to Textual Prompts Based on Top-k Neighbors
- CLIP model with K-nearest neighbors (KNN) algorithm
ParaCLIP: Fine-tuning CLIP Text Encoders with Two-step Paraphrasing
- finetuning paraphrases while freezing the image encoder

1. NOT TO WORDS

CLIP Can Understand Depth
- extended to non-human language prompts
CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment
- quick editing image styles

SpLiCE: decomposes CLIP embeddings into sparse combinations of human-interpretable, semantic concepts
- can be used for concept bottleneck models and spurious correlation detection
Any-Shift Prompting for Generalization over Distributions
- encode the distribution information and their relationships
  - guide the generalization of the CLIP image-language model from training to any test distribution
  - faster testing
CoN-CLIP: Learn “No” to Say “Yes” Better: Improving Vision-Language Models via Negations
- highlights limitations of popular VLMs such as CLIP, at understanding the implications of negations,
- showcases emergent compositional understanding of objects, relations, and attributes in text

Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D
- method trains to predict unseen views on feature spaces generated by vision models (i.e. DINO, CLIP)
  - but then generalizes to novel vision operators and tasks, such as style transfer, super-resolution, open vocabulary segmentation and image colorization

EZ-CLIP: Efficient Zeroshot Video Action Recognition
- no fundamental alterations to clip, guides visual prompts to focus on capturing motion
VideoCLIP
- compute similarity with text and perform vector retrieval

SEECODERS
better clip, nearest neighbor
- https://arxiv.org/pdf/2110.05208.pdf
  - nearest neighbor, contrastive
- https://arxiv.org/abs/2111.07783
Image-and-Language, pixels only no strings: CLIPPO
- ByT5, token free, no tokenizer
  - character aware models: can spell, like by ByT5
    - maybe hands-aware models?
Hyperbolic Contrastive Learning for Visual Representations beyond Objects
Retrieval-Enhanced Contrastive Vision-Text Models
- train frozen clip to retrieve knowledge from an external memory
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
- ability of open-vocabulary classification and also strong mask generator
ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation
- learns pseudo labels, then refines them
- source-free domain adaptation, mitigates misaligned embeddings
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding
- CLIP and SAM (good at identifying objects positions) merged into model
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
- auxiliary alpha channel to suggest attentive regions, control over the emphasis

Improving CLIP Training with Language Rewrites
- rewrite the text descriptions associated with each image using an LLM
Language models are weak learners
- better-than-random performance, boosting component for other models

MetaCLIP: a fully open-source replication of CLIP
TiC-CLIP: Continual Training of CLIP Models
- continues training from the last checkpoint by replaying old data, reduces compute by 2.5times vs from scratch
ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
- performance on par with bigger models
- distill clip knowledge into the prior model

RECLIP: Resource-efficient CLIP by Training with Small Images
CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget
AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models (full unsupervised)

federated clip
- FedCLIP: Fast Generalization and Personalization for CLIP in Federated Learning
- https://arxiv.org/abs/2302.13485
EVA-CLIP: Improved Training Techniques for CLIP at Scale