SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion

1Beihang University   2The University of Hong Kong
svgfusion teaser
Noteworthy characteristics of the SVGs generated by our new method include:
(a) primitive ordering aligned with human design principles, (b) a clear and systematic layering structure of vector primitives, and (c) high editability.
TL;DR: A native SVG generative model built on a Vector-Pixel Fusion Latent representation and Vector Space Diffusion Transformer, enabling versatile and high-quality SVG asset creation.

Abstract

The generation of Scalable Vector Graphics (SVG) assets from textual data remains a significant challenge, largely due to the scarcity of high-quality vector datasets and the limitations in scalable vector representations required for modeling intricate graphic distributions. This work introduces SVGFusion, a Text-to-SVG model capable of scaling to real-world SVG data without reliance on a text-based discrete language model or prolonged SDS optimization. The essence of SVGFusion is to learn a continuous latent space for vector graphics with a popular Text-to-Image framework. Specifically, SVGFusion consists of two modules: a Vector-Pixel Fusion Variational Autoencoder (VP-VAE) and a Vector Space Diffusion Transformer (VS-DiT). VP-VAE takes both the SVGs and corresponding rasterizations as inputs and learns a continuous latent space, whereas VS-DiT learns to generate a latent code within this space based on the text prompt. Based on VP-VAE, a novel rendering sequence modeling strategy is proposed to enable the latent space to embed the knowledge of construction logics in SVGs. This empowers the model to achieve human-like design capabilities in vector graphics, while systematically preventing occlusion in complex graphic compositions. Moreover, our SVGFusion's ability can be continuously improved by leveraging the scalability of the VS-DiT by adding more VS-DiT blocks. A large-scale SVG dataset is collected to evaluate the effectiveness of our proposed method. Extensive experimentation has confirmed the superiority of our SVGFusion over existing SVG generation methods, achieving enhanced quality and generalizability, thereby establishing a novel framework for SVG content creation.


Methodology


method
An overview of our SVGFusion pipeline.
(a) Our pipeline begins with the neural representation of SVGs, where XML-defined SVG tensors are transformed into a learnable matrix to derive an SVG embedding (Sec. 3.1). (b) We propose the Vector-Pixel Fusion Variational Autoencoder (VP-VAE, Sec. 3.2) within a transformer-based architecture to encode vector embeddings alongside pixel-level features into a latent vector space. The resulting vectors are subsequently decoded using a transformer decoder, which parallels the encoder, to reconstruct vector graphics. (c) The Vector Space Diffusion Transformer (VS-DiT, Sec. 3.3) is then trained within the latent space constructed by the VP-VAE. Textual features extracted from the text prompt using the CLIP models are incorporated into each VS-DiT block. The generative capability of SVGFusion can be continuously enhanced by stacking additional VS-DiT blocks.

Experiments


experiments
Qualitative Comparison of SVGFusion and Existing Text-to-SVG Methods. The target SVGs are in the emoji style. We use prompt modifiers for the optimization-based approach to encourage the appropriate style: "minimal flat 2D vector icon, emoji icon, lineal color, on a white background, trending on ArtStation." Note that although the visual quality of results generated by optimization-based methods is high, these methods face challenges in decomposing the SVGs for further editing.

Citation

@article{xing2024svgfusion,
  title={SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion},
  author={Xing, Ximing and Hu, Juncheng and Zhang, Jing and Xu, Dong and Yu, Qian},
  booktitle={arXiv preprint: 2412.10437},
  year={2024}
}