The AI CAD Frontier

We recently announced SGS-1, our state-of-the-art generative model for 3D CAD. Existing generative CAD/3D architectures exist on a frontier, trading off complexity for editability. In this post, we'll briefly touch on the strengths and weaknesses of various existing methods, and how SGS-1 achieves results beyond the current frontier.

Chart showing the frontier for image-to-3D generation. SGS-1 is a large step towards greater editability and complexity for CAD generation.

In the above chart, we show the performance of various image-to-3D methods across both geometric complexity and parametric editability - how editable (and therefore useful) the outputs are in a traditional CAD workflow. We choose to look specifically at image-to-3D methods for a few reasons - CAD generation isn't useful unless it is controllable, 3D design is a visual process, text is an imprecise medium, sketching and dimensioned drawings are common in engineering design, models that take image input can be composed easily with other models/methods (text-to-image, rendering 3D scans/meshes). You can see 2 main classes of CAD generative models, and another class for shape based methods:

Autoregressive sketch/extrude models (or VLMs) where, given an image of an object, the model produces sketch and extrude commands (or corresponding code) that can be converted to geometry using CAD software
Image or sketch conditioned B-rep diffusion models, that directly produce B-rep surfaces and edges that are then stitched and postprocessed into a valid B-rep solid
3D shape diffusion models, which produce implicit SDFs or occupancy grids which are then extracted into meshes

The first category of model produces the most useful outputs for engineering. The output is the full sequence of CAD commands used to produce certain geometry, these kinds of models enable very simple "autocomplete", and they respect input dimensions the most. A relevant recent work here is CAD-Coder [1].

However, these models are trying to establish the most difficult correspondence - given an image, they need to produce a valid sequence of CAD commands, operations that correspond to a spatial output after processing by the CAD kernel, but that are not themselves natively spatial. Recent works have shown that VLMs do not have good spatial understanding and "cheat" by memorizing training data relationships [2, 3]. Additionally, data in this domain is severely limited - most works use the DeepCAD dataset [4] or some augmented version. This dataset supports a very limited amount of CAD operations, has a large amount of duplicates, and has very little data diversity. Newer datasets with more complex operations will be needed to make progress with these approaches - WHUCAD [5] is a good start, and we at Spectral Labs plan to release our own dataset to the research community in the near future.

Architecture diagram from CAD-Coder, a VLM-based method that predicts CAD command operations.

The second category of model directly predicts B-rep geometry (surfaces and curves represented by parametric equations) - these are mainly diffusion models. The canonical work here is BrepGen [6], of which there are many successor works that improve on various dimensions by having fewer models, better conditioning, better B-rep encoding, and other changes [7, 8, 9].

These models can generate more complex geometry conditioned than autoregressive methods, but still generalize badly at test time. In our tests we only got valid results for modes of the data distribution (hex nuts, etc.). Outputs are still very simple, with complete failures for any medium complexity inputs. Data here is also quite limited, most works use the ABC dataset [10] as a starting point and deduplicate, then filter to face and edge count minimums and maximums.

Architecture diagram from HoLa-Brep, which uses a VAE to encode parametric B-rep surfaces and curves, and then uses a latent diffusion model to denoise noised latents conditioned on image input

The last category of model predicts implicit signed distance fields or occupancy grids - these are also mainly diffusion models. These models have seen great improvements over the last year, and the SOTA models are able to generate very complex 3D geometry very fast [11, 12].

However, the outputs are not parametric at all, and cannot be used by engineers in their workflows. They are "dumb" meshes instead of B-reps or command sequences, and can be 3D printed but not edited in traditional engineering workflows, and not manufactured - a good work outlining this is the position paper “You Can't Manufacture a NeRF” [13]. Data here is the most abundant, with the ObjaverseXL dataset [14] serving as the starting point for most models.

Architecture diagram from a shape diffusion model, which uses a shape VAE to encode the shape geometry and then uses a latent diffusion model to denoise latents conditioned on image input

SGS-1 is a breakthrough research contribution because it offers a better complexity/editability tradeoff than was possible before.

SGS-1 generalizes much more effectively at test time than existing B-rep generative models, leading to greater success rates on more diverse and complex geometry. In the graphic below, we show comparisons against the best available image-conditioned B-rep generative model, HoLa-Brep. To be as fair as possible, we only show examples where HoLa-Brep produced a valid closed watertight solid.Our outputs are constructed with a combination of an in-house implementation and an open-source geometry kernel, making it feasible for us to convert using any other commercial software such as Fusion360.

Unlike existing B-rep generative models, SGS-1 does not represent every surface with B-spline geometry - SGS-1 actually produces 3D primitives (planes, cylinders, cones, spheres, torii) where correct, meaningfully improving direct modeling editability of outputs in CAD software. For especially complex surfaces and edges, we do use B-splines as a representation, but doing so accurately and correctly is an ongoing area of research for us.

Research is a collective process of pushing the frontier forward. We contribute SGS-1 to this effort as a step beyond existing capabilities, and plan to make additional contributions in the form of evaluation benchmarks, datasets, and additional models. We are actively working on SGS-2, which will continue to improve complexity and editability while introducing new capabilities. This is a significant research effort, and we are growing the team with more researchers who are excited to work on this problem - an important step towards engineering AGI. If you would like to push the frontier forward with us, contact us through this form or at jobs@spectrallabs.ai!

Citations

1. Doris, Anna C., et al. "CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation." arXiv preprint arXiv:2505.14646 (2025).

2. Vo, An, et al. "Vision Language Models are Biased." arXiv preprint arXiv:2505.23941 (2025).

3. Qi, Jianing, et al. "Beyond semantics: Rediscovering spatial awareness in vision-language models." *arXiv preprint arXiv:2503.17349* (2025).

4. Wu, Rundi, Chang Xiao, and Changxi Zheng. "Deepcad: A deep generative network for computer-aided design models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

5. Fan, Rubin, et al. "A parametric and feature-based CAD dataset to support human-computer interaction for advanced 3D shape learning." *Integrated Computer-Aided Engineering* 32.1 (2025): 75-96.

6. Xu, Xiang, et al. "Brepgen: A b-rep generative diffusion model with structured latent geometry." ACM Transactions on Graphics (TOG) 43.4 (2024): 1-14.

7. Lee, Mingi, et al. "BrepDiff: Single-Stage B-rep Diffusion Model." Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. 2025.

8. Liu, Yilin, et al. "Hola: B-rep generation using a holistic latent representation." ACM Transactions on Graphics (TOG) 44.4 (2025): 1-25.

9. Fan, Jiajie, et al. "NeuroNURBS: Learning Efficient Surface Representations for 3D Solids." arXiv preprint arXiv:2411.10848 (2024).

10. Koch, Sebastian, et al. "Abc: A big cad model dataset for geometric deep learning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

11. Zhao, Zibo, et al. "Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation." arXiv preprint arXiv:2501.12202 (2025).

12. Lai, Zeqiang, et al. "Unleashing vecset diffusion model for fast shape generation." arXiv preprint arXiv:2503.16302 (2025).

13. Kimmel, M. A., et al. "Position: You Can't Manufacture a NeRF." Forty-second International Conference on Machine Learning Position Paper Track.

14. Deitke, Matt, et al. "Objaverse-xl: A universe of 10m+ 3d objects." Advances in Neural Information Processing Systems 36 (2023): 35799-35813.