segmentation

1. TARGET-ING
- 1.1. OBJECT DETECTION
2. SAM
3. DIFFUSION SEGMENTATION
- 3.1. 3D SD SEG
4. AUDIO
5. 3D SEGMENATION
- 5.1. SUPERPRIMITIVE
- 5.2. GAUSSIAN
6. OPTICAL FLOW
- 6.1. DIFFUSION OPTICAL FLOW
7. FINETUNING

parent: computer_vision
ArtLine: gan to get artline from image, maybe instead of canny for controlnet?
Emergence of Segmentation with Minimalistic White-Box Transformers
Boundary Attention: Learning to Find Faint Boundaries at Any Resolution =best=
- infers including contours, corners and junctions
MST: Adaptive Multi-Scale Tokens Guided Interactive Segmentation
- leveraging token similarity to allow for fewer tokens to be used, maintaining multi-scale token interaction

1. TARGET-ING

Materialistic: Selecting Similar Materials in Images
Background Prompting for Improved Object Depth
- learned background prompt, thus focuses in the object
LISA: Reasoning Segmentation via Large Language Model
- Language Instructed Segmentation Assistant, speak to it and it segments
SegGPT: Segmenting Everything In Context
- Painter & SegGPT Series: Vision Foundation Models from BAAI (radiography components, top of box)
Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
- clip can perform zero-shot open-vocabulary segmentation; probability-like experiance
CartoonSegmentation: Instance-guided Cartoon Editing with a Large-scale Dataset (anime fine details) =best=

1.1. OBJECT DETECTION

Tracking Any Object Amodally
- comprehend complete objects from partial visibility; boxes for occluded objects

1.1.1. CUTLER

CutLer: object detection and segmentator
- Detecting censors with deep learning and computer vision; location (to later inpaint over them)
U2Seg: Unsupervised Universal Image Segmentation (vs CutLER) =best=
- clustering of seudo semantic labels

1.1.2. CONTROLNET FOR 3D

3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features
- finetune(controlnet) 2d diffusion to perform novel view synthesis from a single image (using epipolar warp operator) =best=
- 3D detection and identifying cross-view point correspondences

1.1.3. NERF SEGMENTATION

NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection
- indoor 3D detection(and depth) with images as input; unseen scenes, without requiring per-scene optimization
EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision
- captures scene geometry, appearance, motion, represent highly-dynamic scenes self-sufficiently
SAGA: Segment Any 3D Gaussians
- multi-granularity segmentation, instantaneous(unlike SA3D)
GARField: Group Anything with Radiance Fields
- use sam 2D masks, coarse-to-fine hierarchy

2. SAM

SAM + DINO, segment anything, image region editing
high quality sam
- Recognize Anything: A Strong Image Tagging Model
https://arxiv.org/abs/2304.06718 Segment-Everything-Everywhere-All-At-Once
inpainting
Semantic-SAM: Segment and Recognize Anything at Any Granularity
- generate masks at multiple levels
Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively
- CLIP-like real-world recognition
Learning to Prompt Segment Anything Models
- optimizing the prompts using few shot data

2.1. FASTER

Fast Segment Anything, 40ms per image PyPI
EfficientSAM: 20x fewer parameters and 20x faster runtime
SlimSam: 0.1% Data Makes Segment Anything Slim
- 0.9%(5.7M) parameters, 0.1% data
TinySAM: Pushing the Envelope for Efficient Segment Anything Model
- knowledge distillation to distill a lightweight student model

2.2. VIDEOS

segment videos https://github.com/gaomingqi/Track-Anything
Tracking Anything with Decoupled Video Segmentation
Video Instance Matting
- estimating each instance at each frame of a video sequence
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
- unify four reference-based object segmentation tasks with a single architecture (box, area from prompt)
Lester: rotoscope animation through video object segmentation and tracking
- mask and track across frames

2.3. USE CASES

Matting Anything Model (MAM): green screen-ed
TOKENCOMPOSE enhanced prompting

2.3.1. UNDERSTANDING

RelateAnything: see relationships between them
Osprey: Pixel Understanding with Visual Instruction Tuning Understand everything for SAM
- click on and get description of cluster of pixels

2.3.2. FOLLOW AREA

Segment Anything Meets Point Tracking, follow pixels, OPTICAL FLOW
DreamTeacher: Pretraining Image Backbones with Deep Generative Models
- following 3d concepts with 3d understanding

3. DIFFUSION SEGMENTATION

parent: stable_diffusion
SLIME
Diffusion Models as Masked Autoencoders
ODISE: Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models.
Diffusion Models for Zero-Shot Open-Vocabulary Segmentation (considers the contextual background)
MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation
- generate synthetic labeled data, for rare and novel categories to then teach segmentation
FIND: Interface Foundation Models’ Embeddings
- segment and correlate to prompt token
SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process =best=
- augment the segmentation accuracy by denoising it (exceedingly fine details)
EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models
- identifies correspondences between pixels and latent space features
FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models
- through a diffusion model and an image captioner model
  - both frozen

3.1. 3D SD SEG

3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features
- synthesis conditioned on a single image using epipolar warp operator
- 3D-aware features for 3D detection identifying cross-view point correspondences

4. AUDIO

AudioSep: Separate Anything You Describe, Separate Anything Audio Model

5. 3D SEGMENATION

3.1 1.1.3 LIFT3D
Segment Anything in 3D with NeRFs (SA3D)
- SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model
SAD is able to perform 3D segmentation (segment out any 3D object) with RGBD inputs
VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking (convnext)
- predict objects directly upon sparse voxel features
  - no sparse-to-dense conversion, anchors, or center proxies needed anymore
- use: 2D segmentation mask into 3D boxes: code
EgoLifter: Open-world 3D Segmentation for Egocentric Perception
- segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects
iSeg: Interactive 3D Segmentation via Interactive Attention
- based on clicking, positive and negative clicks directly on the shape’s surface

5.1. SUPERPRIMITIVE

into point cloud:
- SuperPrimitive: Scene Reconstruction at a Primitive Level
  - splitting images into semantically correlated local regions, then enhancing with normals
  - for tasks: depth completion(per pixel), few-view structure from motion, and monocular dense visual odometry(get pov angles)

5.2. GAUSSIAN

LangSplat: 3D Language Gaussian Splatting
- ground CLIP features into 3D language Gaussians, faster than LERF
SA-GS: Segment Anything in 3D Gaussians
- without any training process and learned parameters

6. OPTICAL FLOW

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow (video optical flow)
- OmniMotion: Tracking Everything Everywhere All at Once (following pixels, optical flow)
- INVE: Interactive Neural Video Editing; painting pixels, then following them
Tracking Anything in High Quality
- pretrained MR model is employed to refine the tracking result
CoTracker: models correlation of the points in time, using attention
- can track every pixel or selected
generate rainbow visualizations from a set of point tracks
SpatialTracker: Tracking Any 2D Pixels in 3D Space
- dealing with occlusions and discontinuities in 2d, mitigate the issues caused by image projection
  - using monocular depth estimators
2.3.2

6.1. DIFFUSION OPTICAL FLOW

parent: diffusion
The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

7. FINETUNING

ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning
- freezing model parameters, fine-tuning a small set of prompt embeddings
  - addressing both catastrophic forgetting and plasticity
    - significantly reducing the trainable parameters

segmentation

Table of Contents

1. TARGET-ING

1.1. OBJECT DETECTION

1.1.1. CUTLER

1.1.2. CONTROLNET FOR 3D

1.1.3. NERF SEGMENTATION

2. SAM

2.1. FASTER

2.2. VIDEOS

2.3. USE CASES

2.3.1. UNDERSTANDING

2.3.2. FOLLOW AREA

3. DIFFUSION SEGMENTATION

3.1. 3D SD SEG

4. AUDIO

5. 3D SEGMENATION

5.1. SUPERPRIMITIVE

5.2. GAUSSIAN

6. OPTICAL FLOW

6.1. DIFFUSION OPTICAL FLOW

7. FINETUNING