1. Introduction

With the rapid development of generative AI, cross-modal content generation has become a research hotspot in computer vision and NLP.

2. Method

Our framework consists of three core modules: (1) Transformer-based text encoder; (2) Cross-modal alignment module; (3) Conditional diffusion model.

3. Experiment Results

Extensive experiments on MS-COCO, CUB-200 and ImageNet show that our method achieves 12.3% FID improvement over baselines.

4. Conclusion

This paper successfully combines LLM semantic understanding with high-quality diffusion generation, proposing an advanced cross-modal digital asset generation method.