LATTE3D - obtaining a 3D model from a textual description

LATTE3D (Large-scale Amortized Text-To-Enhanced3D Synthesis) is the latest artificial intelligence model from NVIDIA for transforming text into 3D, the third one released this year after Magic3D and ATT3D. Each of them has improved upon the previous model, increasing the speed of learning and the final output. In ATT3D, NVIDIA began training on multiple text prompts as well as multiple 3D assets to account for various ways a user might describe an object for recreation. This approach accelerates training compared to training on individual prompts as in Magic3D. LATTE3D also utilizes multiple prompts - NVIDIA has generated a set of 100,000 possible prompts using ChatGPT - while enhancing the visual quality of the generated objects.

When comparing demo assets created by ATT3D and LATTE3D, the results from LATTE3D are noticeably sharper and more detailed. They still have relatively low resolution, but they have reached a level where they can be used for scene embellishments or even as background assets.

LATTE3D primarily serves as a proof of concept: NVIDIA has not released the source code, and the model has been trained only for two specific types of assets: animals and everyday objects. Particularly significant is that it demonstrates the pace of development in text-to-3D technology and, consequently, how soon publicly available text-to-3D services might emerge.

At the NVIDIA GTC 2024 conference, Sanja Fidler, the company's Vice President of AI Research, acknowledged that the quality "hasn't come close to what an artist could create," but she noted how far things have progressed since Google announced its innovative DreamFusion model at the end of 2022.
"A year ago, AI models took an hour to create 3D images of this quality, and now it takes between 10 and 12 seconds," she said. "Now we can get results orders of magnitude faster, making almost real-time text-to-3D generation accessible to creators across different industries."