logistic

1. BEHAVIOURAL
- 1.1. PLANNING
  - 1.1.1. CODE PLANNING
- 1.2. ROBOTS
2. ECONOMY
3. SCENE
- 3.1. SCENE SYNTHESIS
  - 3.1.1. ROOM
    - 3.1.1.1. ROOM LAYOUT
    - 3.1.1.2. ROOMDREAMER
- 3.2. INTERACTIONS
4. UNDERSTANDING
- 4.1. POSE - POSITION
- 4.2. GEOMETRY INTERACTIONS

parent: domain
SafeDiffuser: Safe Planning with Diffusion Probabilistic Models

1. BEHAVIOURAL

BEHAVIORAL TRANSFORMER
Building Cooperative Embodied Agents Modularly with Large Language Models

1.1. PLANNING

Planning with Diffusion for Flexible Behavior Synthesis
- PlaSma: Making Small Language Models Better Procedural Knowledge Models for (Counterfactual) Planning
  - LLM, revision of a plan to cope with a counterfactual situation
ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search
- entire action space as a decision tree, then identifying the most low-cost valid path as the solution
  - =cheapest cost decision=
K-Level Reasoning with Large Language Models
- decision-making in evolving environments, dynamic reasoning
TravelPlanner: A Benchmark for Real-World Planning with Language Agents
- llms have success rate of 0.6% on travel planning

1.1.1. CODE PLANNING

CodePlan: Repository-level Coding using LLMs and Planning
- context derived from the entire repository, previous code changes
- package migration, fixing errors reports from static analysis or testing, and adding type annotations or other specifications

1.2. ROBOTS

LLM AS REWARD 3.2.2.1 MOTION SYNTHESIS
Generative Agents: Interactive Simulacra of Human Behavior, sims
Discovering Adaptable Symbolic Algorithms from Scratch
- evolve(activate) safe control policies that avoid falling when individual limbs suddenly break
Dynalang: Learning to Model the World with Language
- agents that leverage diverse language that describes state of the world with feedback
Diffusion-CCSP: Compositional Diffusion-Based Continuous Constraint Solvers
- novel combinations of known constraint
Dolphins: Multimodal Language Model for Driving
- holistic understanding of intricate driving scenarios and multimodal instructions
PhotoBot: Reference-Guided Interactive Photography via Natural Language
- take photo at best poses (cinematography), best perspectives and povs

1.2.1. MINECRAFT

STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
- unCLIP is effective for creating instruction-following sequential decision-making agents
- pretrained models like VPT and MineCLIP, STEVE-1 costs just $60 to train
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models
- multimodal memory, planning using both pre-trained knowledge and actual game experiences
BeTAIL: Behavior Transformer Adversarial Imitation Learning from Human Racing Gameplay
- without requiring hand-designed reward functions

1.2.2. DAG

DAG Amendment for Inverse Control of Parametric Shapes
- depending the size of the brush and the location, infers the intention
  - and modifies the hyperparameters, not just one axis but whole arm-mechanisms

1.2.3. DATASET

RT-X: the largest open-source robot dataset
MimicPlay: imitation learning algorithm that extracts the most signals from unlabeled human motions

2. ECONOMY

BloombergGPT: A Large Language Model for Finance (economy)

3. SCENE

Diffusion-based Generation, Optimization, and Planning in 3D Scenes
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
- 2D foundation models then fusing their output to 3D by multi-view association
- complex reasoning over spatial and semantic concepts.
LangNav: Language as a Perceptual Representation for Navigation
- select an action(from instruction) based on the current view and the trajectory history

3.1. SCENE SYNTHESIS

SCENE TEXTURES
3D-GPT: Procedural 3D Modeling with Large Language Models
- instruction-driven 3D modeling
  - evolving(and enhancing) their detailed forms while dynamically adapting on subsequent instructions
Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs
- leveraging clip scene understanding instead of layouts, GAN based
Text2Street: Controllable Text-to-image Generation for Street Views
- text-to-map generation integrating road structure-topology, object layout and weather description
SemCity: Semantic Scene Generation with Triplane Diffusion (refinement and inpainting)
Procedural terrain generation with style transfer
- drawing style from real-world height maps unto perlin noise
RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion
- optimizes a 3D Gaussian Splatting, allows 3D synthesis from a single image

3.1.1. ROOM

DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling
- multi-timestep sampling strategy guided by the formation patterns of 3D objects
- enables targeted adjustments

3.1.1.1. ROOM LAYOUT

BlockFusion: Expandable 3D Scene Generation using Latent Tri-plane Extrapolation
- semantically and geometrically meaningful transitions that harmoniously blend with the existing scene
- 2D layout conditioning-control
InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior
- with semantic graph prior and a layout decoder
GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting
- LLM to generate initial layout for geometric constrain
Sketch-to-Architecture: Generative AI-aided Architectural Design
- generate conceptual floorplans and 3D models from simple sketches

3.1.1.2. ROOMDREAMER

RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture
- using a cubemap, with depth(object plane to screen plane) and distance map
Holodeck: promptable system that can generate diverse, customized, and interactive 3D environments

3.2. INTERACTIONS

GEOMETRY INTERACTIONS POSE - POSITION

3.2.1. MOTION SYNTHESIS

motion
AI4Animation: Neural State Machine for Character-Scene Interactions
- PAE - PHASE
NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis
- neural interaction field attached to a specific object
- guided diffusion model trained on generated synthetic data
Story-to-Motion: Synthesizing Infinite and Controllable Character Animation from Long Text
- text-to-motion various locations and specific motions
- motion semantic trajectory constraint
CHOIS: Controllable Human-Object Interaction Synthesis
- diffusion with constraints
  1. language informs style and intent
  2. waypoints ground the motion and can be effectively extracted using high-level planning methods
ROAM: Robust and Object-aware Motion Generation using Neural Pose Descriptors
- method for human-object interaction synthesi
- given unseen object, optimise for closest in the feature space
TRUMANS: Scaling Up Dynamic Human-Scene Interaction Modeling
- 15 hours of human interactions across 100 indoor scenes
- diffusion-based autoregressive model that efficiently generates HSI sequences of any length

3.2.2. LLM

3D-LLM: Injecting the 3D World into Large Language Models
- llm with 3d understanding such as spatial relationships, affordances, physics, layout
- can take 3D point clouds and their features as input
Physically Grounded Vision-Language Models for Robotic Manipulation
- planning on tasks that require reasoning about physical object concepts
Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM
- long-sequence and efficient motion

3.2.2.1. PROPER-ING INSTRUCTIONS

Auto-Instruct: Automatic Instruction Generation and Ranking for Black-Box Language Models
- method to automatically improve the quality of LLM instructions
Creative Robot Tool Use with Large Language Models
- input instructions and outputs executable code for controlling robots(tools)

3.2.3. INSIDE COMPUTER

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis (website is scene)
- LLM-driven agent to complete instruction tasks on real websites
A Zero-Shot Language Agent for Computer Control with Structured Reflection
- partially observed environment, iteratively learning from its mistakes, structured thought management

3.2.4. GENERATE BLENDER

SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code
- models a scene graph as a blueprint, detailing spatial relationships among assets in the scene
- then writes blender Python scripts based on this graph, translating relationships into numerical constraints for asset layout

4. UNDERSTANDING

INTERACTIONS
Pixel-Wise Color Constancy via Smoothness Techniques in Multi-Illuminant Scenes
- anti-abnormal-light filter by learning pixel-wise illumination maps caused by multiple light sources

4.1. POSE - POSITION

DETECTING HUMAN MOTION SYNTHESIS
PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment
- modelling the distribution of camera poses given input images
- Detector-Free Structure from Motion
Effective Whole-body Pose Estimation with Two-stages Distillation
- instead of openpose preprocessor
DECO: Dense Estimation of 3D Human-Scene Contact In The Wild
- recognize 3D contact between body and objects
Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation
- people, animals, furniture, faces
Reconstructing Close Human Interactions from Multiple Views
- input multi-view 2D keypoint heatmaps and reconstruct the pose of each individual
Extreme Two-View Geometry From Object Poses with Diffusion Models
- extreme viewpoint changes, with no co-visible regions in the images

4.1.1. MONOCULAR

Real-time Monocular Full-body Capture in World Space via Sequential Proxy-to-Motion Learning
- body tracking, only one view needed
D3PRefiner: A Diffusion-based Denoise Method for 3D Human Pose Refinement
- refine the output of any existing 3D pose estimator (monocular camera-based 3D pose estimation)
SMPLer: monocular 3D human motion capture, Motion Capture from Any Video

4.1.2. GET HEAD POSE

GPAvatar: Generalizable and Precise Head Avatar from Image(s)
- recreate the head avatar and precisely control expressions and postures
IMUSIC: IMU-based Facial Expression Capture
- facial expression capture using purely IMU signals
- privacy-protecting, hybrid capture against occlusions, detecting movements often invisible

4.1.3. SKELETON

Coverage Axis++: Efficient Inner Point Selection for 3D Shape Skeletonization
- strategy that considers both shape coverage and uniformity to derive skeletal points

4.2. GEOMETRY INTERACTIONS

Understanding 3D Object Interaction from a Single Image
Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation
- 3D geometry understanding (tokens) with 2D rich semantics
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
- scene and object caption, object referral
Learning Generalizable Feature Fields for Mobile Manipulation
- GeFF (Generalizable Feature Fields)
- for both navigation and manipulation in real time

logistic

Table of Contents

1. BEHAVIOURAL

1.1. PLANNING

1.1.1. CODE PLANNING

1.2. ROBOTS

1.2.1. MINECRAFT

1.2.2. DAG

1.2.3. DATASET

2. ECONOMY

3. SCENE

3.1. SCENE SYNTHESIS

3.1.1. ROOM

3.1.1.1. ROOM LAYOUT

3.1.1.2. ROOMDREAMER

3.2. INTERACTIONS

3.2.1. MOTION SYNTHESIS

3.2.2. LLM

3.2.2.1. PROPER-ING INSTRUCTIONS

3.2.3. INSIDE COMPUTER

3.2.4. GENERATE BLENDER

4. UNDERSTANDING

4.1. POSE - POSITION

4.1.1. MONOCULAR

4.1.2. GET HEAD POSE

4.1.3. SKELETON

4.2. GEOMETRY INTERACTIONS