We present a method that adds geometric details to an input (coarse) 3D mesh through text guidance. Our method can be applied to different types of input conditions. From left to right, the input mesh is an assembly of six primitive shapes, a low-poly mesh, another low-poly mesh, and a mesh initialized by silhouette carving.
We propose a novel technique for adding geometric details to an input coarse 3D mesh guided by a text prompt. Our method is composed of three stages. First, we generate a single-view RGB image conditioned on the input coarse geometry and the input text prompt. This single-view image generation step allows the user to pre-visualize the result and offers stronger conditioning for subsequent multi-view generation. Second, we use our novel multi-view normal generation architecture to jointly generate six different views of the normal images. The joint view generation reduces inconsistencies and leads to sharper details. Third, we optimize our mesh with respect to all views and generate a fine, detailed geometry as output. The resulting method produces an output within seconds and offers explicit user control over the coarse structure, pose, and desired details of the resulting 3D mesh.
Method overview. Our method consists of three stages: single-view generation, multi-view generation and mesh refinement/optimization. Given an input mesh and an input text prompt, we first use a large-scale pre-trained diffusion model (highlighted in red) to generate an RGB image that respects the input conditions. Next, we use a multi-view diffusion model (highlighted in blue) that takes as input the generated RGB image and the normal renderings of the input mesh and generates multi-view normals. Finally, we use the generated multi-view normals to supervise the refinement of the input mesh.
Qualitative results. Our method generates 3D meshes that have better geometric details and visual quality compared to state-of-the-art methods.