logistic
Table of Contents
- parent: domain
- SafeDiffuser: Safe Planning with Diffusion Probabilistic Models
1. BEHAVIOURAL
- BEHAVIORAL TRANSFORMER
- Building Cooperative Embodied Agents Modularly with Large Language Models
1.1. PLANNING
- Planning with Diffusion for Flexible Behavior Synthesis
- PlaSma: Making Small Language Models Better Procedural Knowledge Models for (Counterfactual) Planning
- LLM, revision of a plan to cope with a counterfactual situation
- PlaSma: Making Small Language Models Better Procedural Knowledge Models for (Counterfactual) Planning
- ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search
- entire action space as a decision tree, then identifying the most low-cost valid path as the solution
=cheapest cost decision=
- entire action space as a decision tree, then identifying the most low-cost valid path as the solution
- K-Level Reasoning with Large Language Models
- decision-making in evolving environments, dynamic reasoning
- TravelPlanner: A Benchmark for Real-World Planning with Language Agents
- llms have success rate of 0.6% on travel planning
1.1.1. CODE PLANNING
- CodePlan: Repository-level Coding using LLMs and Planning
- context derived from the entire repository, previous code changes
- package migration, fixing errors reports from static analysis or testing, and adding type annotations or other specifications
1.2. ROBOTS
- LLM AS REWARD 3.2.2.1 MOTION SYNTHESIS
- Generative Agents: Interactive Simulacra of Human Behavior, sims
- Discovering Adaptable Symbolic Algorithms from Scratch
- evolve(activate) safe control policies that avoid falling when individual limbs suddenly break
- Dynalang: Learning to Model the World with Language
- agents that leverage diverse language that describes state of the world with feedback
- Diffusion-CCSP: Compositional Diffusion-Based Continuous Constraint Solvers
- novel combinations of known constraint
- Dolphins: Multimodal Language Model for Driving
- holistic understanding of intricate driving scenarios and multimodal instructions
- PhotoBot: Reference-Guided Interactive Photography via Natural Language
- take photo at best poses (cinematography), best perspectives and povs
1.2.1. MINECRAFT
- STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
- unCLIP is effective for creating instruction-following sequential decision-making agents
- pretrained models like VPT and MineCLIP, STEVE-1 costs just $60 to train
- JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models
- multimodal memory, planning using both pre-trained knowledge and actual game experiences
- BeTAIL: Behavior Transformer Adversarial Imitation Learning from Human Racing Gameplay
- without requiring hand-designed reward functions
1.2.2. DAG
- DAG Amendment for Inverse Control of Parametric Shapes
- depending the size of the brush and the location, infers the intention
- and modifies the hyperparameters, not just one axis but whole arm-mechanisms
- depending the size of the brush and the location, infers the intention
2. ECONOMY
- BloombergGPT: A Large Language Model for Finance (economy)
3. SCENE
- Diffusion-based Generation, Optimization, and Planning in 3D Scenes
- ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
- 2D foundation models then fusing their output to 3D by multi-view association
- complex reasoning over spatial and semantic concepts.
- LangNav: Language as a Perceptual Representation for Navigation
- select an action(from instruction) based on the current view and the trajectory history
3.1. SCENE SYNTHESIS
- SCENE TEXTURES
- 3D-GPT: Procedural 3D Modeling with Large Language Models
- instruction-driven 3D modeling
- evolving(and enhancing) their detailed forms while dynamically adapting on subsequent instructions
- instruction-driven 3D modeling
- Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs
- leveraging clip scene understanding instead of layouts, GAN based
- Text2Street: Controllable Text-to-image Generation for Street Views
- text-to-map generation integrating road structure-topology, object layout and weather description
- SemCity: Semantic Scene Generation with Triplane Diffusion (refinement and inpainting)
- Procedural terrain generation with style transfer
- drawing style from real-world height maps unto perlin noise
- RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion
- optimizes a 3D Gaussian Splatting, allows 3D synthesis from a single image
3.1.1. ROOM
- DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling
- multi-timestep sampling strategy guided by the formation patterns of 3D objects
- enables targeted adjustments
3.1.1.1. ROOM LAYOUT
- BlockFusion: Expandable 3D Scene Generation using Latent Tri-plane Extrapolation
- semantically and geometrically meaningful transitions that harmoniously blend with the existing scene
- 2D layout conditioning-control
- InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior
- with semantic graph prior and a layout decoder
- GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting
- LLM to generate initial layout for geometric constrain
- Sketch-to-Architecture: Generative AI-aided Architectural Design
- generate conceptual floorplans and 3D models from simple sketches
3.1.1.2. ROOMDREAMER
- RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture
- using a cubemap, with depth(object plane to screen plane) and distance map
- Holodeck: promptable system that can generate diverse, customized, and interactive 3D environments
3.2. INTERACTIONS
3.2.1. MOTION SYNTHESIS
- motion
- AI4Animation: Neural State Machine for Character-Scene Interactions
- NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis
- neural interaction field attached to a specific object
- guided diffusion model trained on generated synthetic data
- Story-to-Motion: Synthesizing Infinite and Controllable Character Animation from Long Text
- text-to-motion various locations and specific motions
- motion semantic trajectory constraint
- CHOIS: Controllable Human-Object Interaction Synthesis
- diffusion with constraints
- language informs style and intent
- waypoints ground the motion and can be effectively extracted using high-level planning methods
- diffusion with constraints
- ROAM: Robust and Object-aware Motion Generation using Neural Pose Descriptors
- method for human-object interaction synthesi
- given unseen object, optimise for closest in the feature space
- TRUMANS: Scaling Up Dynamic Human-Scene Interaction Modeling
- 15 hours of human interactions across 100 indoor scenes
- diffusion-based autoregressive model that efficiently generates HSI sequences of any length
3.2.2. LLM
- 3D-LLM: Injecting the 3D World into Large Language Models
- llm with 3d understanding such as spatial relationships, affordances, physics, layout
- can take 3D point clouds and their features as input
- Physically Grounded Vision-Language Models for Robotic Manipulation
- planning on tasks that require reasoning about physical object concepts
- Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM
- long-sequence and efficient motion
3.2.2.1. PROPER-ING INSTRUCTIONS
- Auto-Instruct: Automatic Instruction Generation and Ranking for Black-Box Language Models
- method to automatically improve the quality of LLM instructions
- Creative Robot Tool Use with Large Language Models
- input instructions and outputs executable code for controlling robots(tools)
3.2.3. INSIDE COMPUTER
- A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis (website is scene)
- LLM-driven agent to complete instruction tasks on real websites
- A Zero-Shot Language Agent for Computer Control with Structured Reflection
- partially observed environment, iteratively learning from its mistakes, structured thought management
3.2.4. GENERATE BLENDER
- SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code
- models a scene graph as a blueprint, detailing spatial relationships among assets in the scene
- then writes blender Python scripts based on this graph, translating relationships into numerical constraints for asset layout
4. UNDERSTANDING
- INTERACTIONS
- Pixel-Wise Color Constancy via Smoothness Techniques in Multi-Illuminant Scenes
- anti-abnormal-light filter by learning pixel-wise illumination maps caused by multiple light sources
4.1. POSE - POSITION
- DETECTING HUMAN MOTION SYNTHESIS
- PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment
- modelling the distribution of camera poses given input images
- Detector-Free Structure from Motion
- Effective Whole-body Pose Estimation with Two-stages Distillation
- instead of openpose preprocessor
- DECO: Dense Estimation of 3D Human-Scene Contact In The Wild
- recognize 3D contact between body and objects
- Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation
- people, animals, furniture, faces
- Reconstructing Close Human Interactions from Multiple Views
- input multi-view 2D keypoint heatmaps and reconstruct the pose of each individual
- Extreme Two-View Geometry From Object Poses with Diffusion Models
- extreme viewpoint changes, with no co-visible regions in the images
4.1.1. MONOCULAR
- Real-time Monocular Full-body Capture in World Space via Sequential Proxy-to-Motion Learning
- body tracking, only one view needed
- D3PRefiner: A Diffusion-based Denoise Method for 3D Human Pose Refinement
- refine the output of any existing 3D pose estimator (monocular camera-based 3D pose estimation)
- SMPLer: monocular 3D human motion capture, Motion Capture from Any Video
4.1.2. GET HEAD POSE
- GPAvatar: Generalizable and Precise Head Avatar from Image(s)
- recreate the head avatar and precisely control expressions and postures
- IMUSIC: IMU-based Facial Expression Capture
- facial expression capture using purely IMU signals
- privacy-protecting, hybrid capture against occlusions, detecting movements often invisible
4.1.3. SKELETON
- Coverage Axis++: Efficient Inner Point Selection for 3D Shape Skeletonization
- strategy that considers both shape coverage and uniformity to derive skeletal points
4.2. GEOMETRY INTERACTIONS
- Understanding 3D Object Interaction from a Single Image
- Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation
- 3D geometry understanding (tokens) with 2D rich semantics
- SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
- scene and object caption, object referral
- Learning Generalizable Feature Fields for Mobile Manipulation
- GeFF (Generalizable Feature Fields)
- for both navigation and manipulation in real time