segmentation
Table of Contents
- parent: computervision
- ArtLine: gan to get artline from image, maybe instead of canny for controlnet?
- Emergence of Segmentation with Minimalistic White-Box Transformers
- Boundary Attention: Learning to Find Faint Boundaries at Any Resolution
=best=
- infers including contours, corners and junctions
- MST: Adaptive Multi-Scale Tokens Guided Interactive Segmentation
- leveraging token similarity to allow for fewer tokens to be used, maintaining multi-scale token interaction
1. TARGET-ING
- Materialistic: Selecting Similar Materials in Images
- Background Prompting for Improved Object Depth
- learned background prompt, thus focuses in the object
- LISA: Reasoning Segmentation via Large Language Model
- Language Instructed Segmentation Assistant, speak to it and it segments
- SegGPT: Segmenting Everything In Context
- Painter & SegGPT Series: Vision Foundation Models from BAAI (radiography components, top of box)
- Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
- clip can perform zero-shot open-vocabulary segmentation; probability-like experiance
- CartoonSegmentation: Instance-guided Cartoon Editing with a Large-scale Dataset (anime fine details)
=best=
1.1. OBJECT DETECTION
- Tracking Any Object Amodally
- comprehend complete objects from partial visibility; boxes for occluded objects
1.1.1. CUTLER
- CutLer: object detection and segmentator
- Detecting censors with deep learning and computer vision; location (to later inpaint over them)
- U2Seg: Unsupervised Universal Image Segmentation (vs CutLER)
=best=
- clustering of seudo semantic labels
1.1.2. CONTROLNET FOR 3D
- 3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features
- finetune(controlnet) 2d diffusion to perform novel view synthesis from a single image (using epipolar warp operator)
=best=
- 3D detection and identifying cross-view point correspondences
- finetune(controlnet) 2d diffusion to perform novel view synthesis from a single image (using epipolar warp operator)
1.1.3. NERF SEGMENTATION
- NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection
- indoor 3D detection(and depth) with images as input; unseen scenes, without requiring per-scene optimization
- EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision
- captures scene geometry, appearance, motion, represent highly-dynamic scenes self-sufficiently
- SAGA: Segment Any 3D Gaussians
- multi-granularity segmentation, instantaneous(unlike SA3D)
- GARField: Group Anything with Radiance Fields
- use sam 2D masks, coarse-to-fine hierarchy
2. SAM
- SAM + DINO, segment anything, image region editing
- high quality sam
- Recognize Anything: A Strong Image Tagging Model
- https://arxiv.org/abs/2304.06718 Segment-Everything-Everywhere-All-At-Once
- inpainting
- Semantic-SAM: Segment and Recognize Anything at Any Granularity
- generate masks at multiple levels
- Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively
- CLIP-like real-world recognition
- Learning to Prompt Segment Anything Models
- optimizing the prompts using few shot data
2.1. FASTER
- Fast Segment Anything, 40ms per image PyPI
- EfficientSAM: 20x fewer parameters and 20x faster runtime
- SlimSam: 0.1% Data Makes Segment Anything Slim
- 0.9%(5.7M) parameters, 0.1% data
- TinySAM: Pushing the Envelope for Efficient Segment Anything Model
- knowledge distillation to distill a lightweight student model
2.2. VIDEOS
- segment videos https://github.com/gaomingqi/Track-Anything
- Tracking Anything with Decoupled Video Segmentation
- Video Instance Matting
- estimating each instance at each frame of a video sequence
- UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
- unify four reference-based object segmentation tasks with a single architecture (box, area from prompt)
- Lester: rotoscope animation through video object segmentation and tracking
- mask and track across frames
2.3. USE CASES
- Matting Anything Model (MAM): green screen-ed
- TOKENCOMPOSE enhanced prompting
2.3.1. UNDERSTANDING
- RelateAnything: see relationships between them
- Osprey: Pixel Understanding with Visual Instruction Tuning Understand everything for SAM
- click on and get description of cluster of pixels
2.3.2. FOLLOW AREA
- Segment Anything Meets Point Tracking, follow pixels, OPTICAL FLOW
- DreamTeacher: Pretraining Image Backbones with Deep Generative Models
- following 3d concepts with 3d understanding
3. DIFFUSION SEGMENTATION
- parent: stablediffusion
- SLIME
- Diffusion Models as Masked Autoencoders
- ODISE: Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models.
- Diffusion Models for Zero-Shot Open-Vocabulary Segmentation (considers the contextual background)
- MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation
- generate synthetic labeled data, for rare and novel categories to then teach segmentation
- FIND: Interface Foundation Models’ Embeddings
- segment and correlate to prompt token
- SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process
=best=
- augment the segmentation accuracy by denoising it (exceedingly fine details)
- EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models
- identifies correspondences between pixels and latent space features
- FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models
- through a diffusion model and an image captioner model
- both frozen
- through a diffusion model and an image captioner model
3.1. 3D SD SEG
- 3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features
- synthesis conditioned on a single image using epipolar warp operator
- 3D-aware features for 3D detection identifying cross-view point correspondences
4. AUDIO
- AudioSep: Separate Anything You Describe, Separate Anything Audio Model
5. 3D SEGMENATION
- 3.1 1.1.3 LIFT3D
- Segment Anything in 3D with NeRFs (SA3D)
- SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model
- SAD is able to perform 3D segmentation (segment out any 3D object) with RGBD inputs
- VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking (convnext)
- predict objects directly upon sparse voxel features
- no sparse-to-dense conversion, anchors, or center proxies needed anymore
- use: 2D segmentation mask into 3D boxes: code
- predict objects directly upon sparse voxel features
- EgoLifter: Open-world 3D Segmentation for Egocentric Perception
- segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects
- iSeg: Interactive 3D Segmentation via Interactive Attention
- based on clicking, positive and negative clicks directly on the shape’s surface
5.1. SUPERPRIMITIVE
- into point cloud:
- SuperPrimitive: Scene Reconstruction at a Primitive Level
- splitting images into semantically correlated local regions, then enhancing with normals
- for tasks: depth completion(per pixel), few-view structure from motion, and monocular dense visual odometry(get pov angles)
- SuperPrimitive: Scene Reconstruction at a Primitive Level
5.2. GAUSSIAN
- LangSplat: 3D Language Gaussian Splatting
- ground CLIP features into 3D language Gaussians, faster than LERF
- SA-GS: Segment Anything in 3D Gaussians
- without any training process and learned parameters
6. OPTICAL FLOW
- RAFT: Recurrent All-Pairs Field Transforms for Optical Flow (video optical flow)
- OmniMotion: Tracking Everything Everywhere All at Once (following pixels, optical flow)
- INVE: Interactive Neural Video Editing; painting pixels, then following them
- Tracking Anything in High Quality
- pretrained MR model is employed to refine the tracking result
- CoTracker: models correlation of the points in time, using attention
- can track every pixel or selected
- generate rainbow visualizations from a set of point tracks
- SpatialTracker: Tracking Any 2D Pixels in 3D Space
- dealing with occlusions and discontinuities in 2d, mitigate the issues caused by image projection
- using monocular depth estimators
- dealing with occlusions and discontinuities in 2d, mitigate the issues caused by image projection
- 2.3.2
6.1. DIFFUSION OPTICAL FLOW
- parent: diffusion
- The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation
7. FINETUNING
- ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning
- freezing model parameters, fine-tuning a small set of prompt embeddings
- addressing both catastrophic forgetting and plasticity
- significantly reducing the trainable parameters
- addressing both catastrophic forgetting and plasticity
- freezing model parameters, fine-tuning a small set of prompt embeddings