clip
Table of Contents
- parent: computervision
- CLIP AS REWARD
- simo take on better clip papers: https://twitter.com/cloneofsimo/status/1666086583005769728
- COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?
- attributes(adjectives) with its subjects properly identified
- What does CLIP know about a red circle? Visual prompt engineering for VLMs
- direct the model attention to that region, while also maintaining global information
- Parrot Captions Teach CLIP to Spot Text
- urgent to redesign CLIP-like models so they ignore captions
- UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding
- image-level, region-level, and pixel-level captions/tags
- CLIP Model for Images to Textual Prompts Based on Top-k Neighbors
- CLIP model with K-nearest neighbors (KNN) algorithm
- ParaCLIP: Fine-tuning CLIP Text Encoders with Two-step Paraphrasing
- finetuning paraphrases while freezing the image encoder
1. NOT TO WORDS
2. TRAIN CLIP
- SpLiCE: decomposes CLIP embeddings into sparse combinations of human-interpretable, semantic concepts
- can be used for concept bottleneck models and spurious correlation detection
- Any-Shift Prompting for Generalization over Distributions
- encode the distribution information and their relationships
- guide the generalization of the CLIP image-language model from training to any test distribution
- faster testing
- encode the distribution information and their relationships
- CoN-CLIP: Learn “No” to Say “Yes” Better: Improving Vision-Language Models via Negations
- highlights limitations of popular VLMs such as CLIP, at understanding the implications of negations,
- showcases emergent compositional understanding of objects, relations, and attributes in text
3. 3D+ CLIP
3.1. LIFT3D
- Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D
- method trains to predict unseen views on feature spaces generated by vision models (i.e. DINO, CLIP)
- but then generalizes to novel vision operators and tasks, such as style transfer, super-resolution, open vocabulary segmentation and image colorization
- method trains to predict unseen views on feature spaces generated by vision models (i.e. DINO, CLIP)
4. VIDEO CLIP
5. PRIOR ALTERNATIVES
- SEECODERS
- better clip, nearest neighbor
- https://arxiv.org/pdf/2110.05208.pdf
- nearest neighbor, contrastive
- https://arxiv.org/abs/2111.07783
- https://arxiv.org/pdf/2110.05208.pdf
- Image-and-Language, pixels only no strings: CLIPPO
- Hyperbolic Contrastive Learning for Visual Representations beyond Objects
- Retrieval-Enhanced Contrastive Vision-Text Models
- train frozen clip to retrieve knowledge from an external memory
- Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
- ability of open-vocabulary classification and also strong mask generator
- ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation
- learns pseudo labels, then refines them
- source-free domain adaptation, mitigates misaligned embeddings
- SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding
- CLIP and SAM (good at identifying objects positions) merged into model
- Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
- auxiliary alpha channel to suggest attentive regions, control over the emphasis
5.1. TEXT MANIPULATION
5.2. CLIP YET BETTER
- MetaCLIP: a fully open-source replication of CLIP
- TiC-CLIP: Continual Training of CLIP Models
- continues training from the last checkpoint by replaying old data, reduces compute by 2.5times vs from scratch
- ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
- performance on par with bigger models
- distill clip knowledge into the prior model
6. CHEAPNESS
7. SCALENESS
- federated clip
- FedCLIP: Fast Generalization and Personalization for CLIP in Federated Learning
- https://arxiv.org/abs/2302.13485
- EVA-CLIP: Improved Training Techniques for CLIP at Scale
8. FASTNESS
- unum: trained in a day
- An Inverse Scaling Law for CLIP Training, training clip cheaply in 2 days