logistic

Table of Contents

1. BEHAVIOURAL

1.1. PLANNING

  • Planning with Diffusion for Flexible Behavior Synthesis
    • PlaSma: Making Small Language Models Better Procedural Knowledge Models for (Counterfactual) Planning
      • LLM, revision of a plan to cope with a counterfactual situation
  • ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search
    • entire action space as a decision tree, then identifying the most low-cost valid path as the solution
      • =cheapest cost decision=
  • K-Level Reasoning with Large Language Models
    • decision-making in evolving environments, dynamic reasoning
  • TravelPlanner: A Benchmark for Real-World Planning with Language Agents
    • llms have success rate of 0.6% on travel planning

1.1.1. CODE PLANNING

  • CodePlan: Repository-level Coding using LLMs and Planning
    • context derived from the entire repository, previous code changes
    • package migration, fixing errors reports from static analysis or testing, and adding type annotations or other specifications

1.2. ROBOTS

  • LLM AS REWARD 3.2.2.1 MOTION SYNTHESIS
  • Generative Agents: Interactive Simulacra of Human Behavior, sims
  • Discovering Adaptable Symbolic Algorithms from Scratch
    • evolve(activate) safe control policies that avoid falling when individual limbs suddenly break
  • Dynalang: Learning to Model the World with Language
    • agents that leverage diverse language that describes state of the world with feedback
  • Diffusion-CCSP: Compositional Diffusion-Based Continuous Constraint Solvers
    • novel combinations of known constraint
  • Dolphins: Multimodal Language Model for Driving
    • holistic understanding of intricate driving scenarios and multimodal instructions
  • PhotoBot: Reference-Guided Interactive Photography via Natural Language
    • take photo at best poses (cinematography), best perspectives and povs

1.2.1. MINECRAFT

  • STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
    • unCLIP is effective for creating instruction-following sequential decision-making agents
    • pretrained models like VPT and MineCLIP, STEVE-1 costs just $60 to train
  • JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models
    • multimodal memory, planning using both pre-trained knowledge and actual game experiences
  • BeTAIL: Behavior Transformer Adversarial Imitation Learning from Human Racing Gameplay
    • without requiring hand-designed reward functions

1.2.2. DAG

  • DAG Amendment for Inverse Control of Parametric Shapes
    • depending the size of the brush and the location, infers the intention
      • and modifies the hyperparameters, not just one axis but whole arm-mechanisms

1.2.3. DATASET

  • RT-X: the largest open-source robot dataset
  • MimicPlay: imitation learning algorithm that extracts the most signals from unlabeled human motions

2. ECONOMY

  • BloombergGPT: A Large Language Model for Finance (economy)

3. SCENE

  • Diffusion-based Generation, Optimization, and Planning in 3D Scenes
  • ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
    • 2D foundation models then fusing their output to 3D by multi-view association
    • complex reasoning over spatial and semantic concepts.
  • LangNav: Language as a Perceptual Representation for Navigation
    • select an action(from instruction) based on the current view and the trajectory history

3.1. SCENE SYNTHESIS

  • SCENE TEXTURES
  • 3D-GPT: Procedural 3D Modeling with Large Language Models
    • instruction-driven 3D modeling
      • evolving(and enhancing) their detailed forms while dynamically adapting on subsequent instructions
  • Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs
    • leveraging clip scene understanding instead of layouts, GAN based
  • Text2Street: Controllable Text-to-image Generation for Street Views
    • text-to-map generation integrating road structure-topology, object layout and weather description
  • SemCity: Semantic Scene Generation with Triplane Diffusion (refinement and inpainting)
  • Procedural terrain generation with style transfer
    • drawing style from real-world height maps unto perlin noise
  • RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion
    • optimizes a 3D Gaussian Splatting, allows 3D synthesis from a single image

3.1.1. ROOM

  • DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling
    • multi-timestep sampling strategy guided by the formation patterns of 3D objects
    • enables targeted adjustments
3.1.1.1. ROOM LAYOUT
  • BlockFusion: Expandable 3D Scene Generation using Latent Tri-plane Extrapolation
    • semantically and geometrically meaningful transitions that harmoniously blend with the existing scene
    • 2D layout conditioning-control
  • InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior
    • with semantic graph prior and a layout decoder
  • GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting
    • LLM to generate initial layout for geometric constrain
  • Sketch-to-Architecture: Generative AI-aided Architectural Design
    • generate conceptual floorplans and 3D models from simple sketches
3.1.1.2. ROOMDREAMER
  • RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture
    • using a cubemap, with depth(object plane to screen plane) and distance map
  • Holodeck: promptable system that can generate diverse, customized, and interactive 3D environments

3.2. INTERACTIONS

3.2.1. MOTION SYNTHESIS

  • motion
  • AI4Animation: Neural State Machine for Character-Scene Interactions
  • NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis
    • neural interaction field attached to a specific object
    • guided diffusion model trained on generated synthetic data
  • Story-to-Motion: Synthesizing Infinite and Controllable Character Animation from Long Text
    • text-to-motion various locations and specific motions
    • motion semantic trajectory constraint
  • CHOIS: Controllable Human-Object Interaction Synthesis
    • diffusion with constraints
      1. language informs style and intent
      2. waypoints ground the motion and can be effectively extracted using high-level planning methods
  • ROAM: Robust and Object-aware Motion Generation using Neural Pose Descriptors
    • method for human-object interaction synthesi
    • given unseen object, optimise for closest in the feature space
  • TRUMANS: Scaling Up Dynamic Human-Scene Interaction Modeling
    • 15 hours of human interactions across 100 indoor scenes
    • diffusion-based autoregressive model that efficiently generates HSI sequences of any length

3.2.2. LLM

  • 3D-LLM: Injecting the 3D World into Large Language Models
    • llm with 3d understanding such as spatial relationships, affordances, physics, layout
    • can take 3D point clouds and their features as input
  • Physically Grounded Vision-Language Models for Robotic Manipulation
    • planning on tasks that require reasoning about physical object concepts
  • Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM
    • long-sequence and efficient motion
3.2.2.1. PROPER-ING INSTRUCTIONS
  • Auto-Instruct: Automatic Instruction Generation and Ranking for Black-Box Language Models
    • method to automatically improve the quality of LLM instructions
  • Creative Robot Tool Use with Large Language Models
    • input instructions and outputs executable code for controlling robots(tools)

3.2.3. INSIDE COMPUTER

  • A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis (website is scene)
    • LLM-driven agent to complete instruction tasks on real websites
  • A Zero-Shot Language Agent for Computer Control with Structured Reflection
    • partially observed environment, iteratively learning from its mistakes, structured thought management

3.2.4. GENERATE BLENDER

  • SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code
    • models a scene graph as a blueprint, detailing spatial relationships among assets in the scene
    • then writes blender Python scripts based on this graph, translating relationships into numerical constraints for asset layout

4. UNDERSTANDING

  • INTERACTIONS
  • Pixel-Wise Color Constancy via Smoothness Techniques in Multi-Illuminant Scenes
    • anti-abnormal-light filter by learning pixel-wise illumination maps caused by multiple light sources

4.1. POSE - POSITION

  • DETECTING HUMAN MOTION SYNTHESIS
  • PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment
    • modelling the distribution of camera poses given input images
    • Detector-Free Structure from Motion
  • Effective Whole-body Pose Estimation with Two-stages Distillation
    • instead of openpose preprocessor
  • DECO: Dense Estimation of 3D Human-Scene Contact In The Wild
    • recognize 3D contact between body and objects
  • Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation
    • people, animals, furniture, faces
  • Reconstructing Close Human Interactions from Multiple Views
    • input multi-view 2D keypoint heatmaps and reconstruct the pose of each individual
  • Extreme Two-View Geometry From Object Poses with Diffusion Models
    • extreme viewpoint changes, with no co-visible regions in the images

4.1.1. MONOCULAR

  • Real-time Monocular Full-body Capture in World Space via Sequential Proxy-to-Motion Learning
    • body tracking, only one view needed
  • D3PRefiner: A Diffusion-based Denoise Method for 3D Human Pose Refinement
    • refine the output of any existing 3D pose estimator (monocular camera-based 3D pose estimation)
  • SMPLer: monocular 3D human motion capture, Motion Capture from Any Video

4.1.2. GET HEAD POSE

  • GPAvatar: Generalizable and Precise Head Avatar from Image(s)
    • recreate the head avatar and precisely control expressions and postures
  • IMUSIC: IMU-based Facial Expression Capture
    • facial expression capture using purely IMU signals
    • privacy-protecting, hybrid capture against occlusions, detecting movements often invisible

4.1.3. SKELETON

  • Coverage Axis++: Efficient Inner Point Selection for 3D Shape Skeletonization
    • strategy that considers both shape coverage and uniformity to derive skeletal points

4.2. GEOMETRY INTERACTIONS

  • Understanding 3D Object Interaction from a Single Image
  • Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation
    • 3D geometry understanding (tokens) with 2D rich semantics
  • SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
    • scene and object caption, object referral
  • Learning Generalizable Feature Fields for Mobile Manipulation
    • GeFF (Generalizable Feature Fields)
    • for both navigation and manipulation in real time

Author: Tekakutli

Created: 2024-04-13 Sat 04:35