Diffusion Model Scene Representation

Project Overview

The ability of latent diffusion models (LDMs) to generate realistic images from textual descriptions has seen remarkable advancements. Even when trained purely on images without explicit depth information, they typically output coherent pictures of 3D scenes. These models have the astonishing capacity to create detailed, coherent scene representations.

However, their ability to represent 3D information (depth, saliency, shading) within generated images remains unclear. Existing research primarily focuses on the output capabilities of these models, leaving a gap in comprehending their internal processing mechanics. Our project aims to delve into the diffusion process of LDMs, unraveling how they internally represent and process 3D scenes.

This investigation is crucial as it not only enhances our understanding of AI’s interpretive capabilities but also paves the way for further advancements in image synthesis.