StarNexus Publishing

Cross-Modal Digital Asset Generation Based on Large-Scale Language Models

1. Department of Computer Science and Technology, Tsinghua University, Beijing
2. Department of Electronic and Computer Engineering, HKUST, Hong Kong

Received: 2026-01-10 Revised: 2026-02-20 Accepted: 2026-03-01 Published: 2026-03-15

Large Language Models Cross-Modal Generation Digital Assets Diffusion Models Text-to-Image

Abstract

This paper proposes a cross-modal digital asset generation method based on large-scale language models, addressing the limitations of traditional generation models in cross-modal semantic alignment.

Full Text

1. Introduction

With the rapid development of generative AI, cross-modal content generation has become a research hotspot in computer vision and NLP.

2. Method

Our framework consists of three core modules: (1) Transformer-based text encoder; (2) Cross-modal alignment module; (3) Conditional diffusion model.

3. Experiment Results

Extensive experiments on MS-COCO, CUB-200 and ImageNet show that our method achieves 12.3% FID improvement over baselines.

4. Conclusion

This paper successfully combines LLM semantic understanding with high-quality diffusion generation, proposing an advanced cross-modal digital asset generation method.

Funding Information

National Natural Science Foundation of China (No. 62236005), RGC General Research Fund (No. 14201223)

Conflict of Interest Statement

All authors declare no conflict of interest.

Data Availability Statement

Data and code supporting this research are publicly available on Figshare: https://figshare.com/xxx

References

Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in Neural Information Processing Systems, 2017, 30: 5998-6008.
Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 10684-10695.
Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision. International Conference on Machine Learning, 2021: 8748-8763.