2024, issue 4, p. 90-109

Received 12.11.2024; Revised 26.11.2024; Accepted 03.12.2024

Published 18.12.2024; First Online 23.12.2024

https://doi.org/10.34229/2707-451X.24.4.9

Previous | FULL TEXT (in Ukrainian) | Next

UDC 519.688

Conceptual Model and NLP-System "Text to Image"

Pavlo Maslianko ^* , Kate Pavlovska

Igor Sikorsky Kyiv Polytechnic Institute,Ukraine

^* Correspondence: This email address is being protected from spambots. You need JavaScript enabled to view it.

Introduction. The development of theoretical tools and instrumental means of transforming text information into images is an urgent problem for various fields of human activity and organizational systems of various purposes. The article proposes a conceptual model and NLP system "Text to image" based on the methodology of system engineering of Data Science systems, architecture, and software of the image generation system based on the latent diffusion model. It is proposed to improve the basic architecture of the latent diffusion model by using a diffusion transformer. It is found that unlike approaches based on U-Net architecture, DiTs work with latent patches, providing better scalability and increased performance.

The purpose of the work is to develop a scientifically based conceptual model and system for transforming text descriptions into images, based on the methodology of system engineering, modern methods of deep learning and business profile of Erikson – Penker.

Results. Estimation problems, the properties of which are regulated by a parameter, have been constructed for the problem of placing objects in Euclidean space. The properties of the evaluation problem depending on the value of the parameter are studied and the limits of the value of the parameter are shown, the observance of which allows obtaining estimates adequate to the initial problem. Verification and validation of the developed NLP system "Text to image" for converting text data into images was carried out. The generation results demonstrate the exact reproduction of key elements, which indicates the high quality of the correspondence between the image and the text description. As a result of a comparative analysis of the performance of the models, it was determined that the TransformerLD system, although inferior to the Stable Diffusion and DALL-E 2 models in terms of FID and IS, still remains competitive.

Conclusions. The construction of a dynamic branching tree and nonlinear estimations allows speeding up the process of finding the optimal solution, but it depends significantly on the initial problem, which complicates the development of a general algorithm. The development of the conceptual model and the NLP system "Text to image" allows implementing the effective transformation of text data into images, which is a topical issue in the field of data visualization.

Keywords: system engineering, Data Science, NLP-systems “Text to image.

Cite as: Maslianko P., Pavlovska K. Conceptual Model and NLP-System "Text to Image". Cybernetics and Computer Technologies. 2024. 4. P. 90–109. (in Ukrainian) https://doi.org/10.34229/2707-451X.24.4.9

References

1. Yin L. A Review of Text-to-Image Synthesis Methods. 2024 5th International Conference on Computer Vision. 2024. P. 858–861. https://doi.org/10.1109/CVIDL62147.2024.10603609

2. Li H. et al. On the Scalability of Diffusion-based Text-to-Image Generation. 2024 Conference on Computer Vision and Pattern. 2024. P. 9400–9409. https://doi.org/10.1109/CVPR52733.2024.00898

3. Patel M., Kim C., Cheng S., Baral C., Yang Y. ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. P. 9069–9078. https://doi.org/10.1109/CVPR52733.2024.00866

4. Zhang Y., Song Y., Yu J., Pan H., Jing Z. Fast Personalized Text to Image Synthesis with Attention Injection. ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. 2024. P. 6195–6199. https://doi.org/10.1109/ICASSP48485.2024.10447042

5. Rauniyar A., Raj A., Kumar A., Kandu A.K., Singh A., Gupta A. Text to Image Generator with Latent Diffusion Models. International Conference on Computational Intelligence and Networking. 2023. P. 144–148. https://doi.org/10.1109/CICTN57981.2023.10140348

6. Prerak S. Addressing Bias in Text-to-Image Generation: A Review of Mitigation Methods. 2024 Third International Conference on Smart Technologies and Systems for Next Generation Computing. 2024. P. 1–6. https://doi.org/10.1109/ICSTSN61422.2024.10671230

7. Shi J., Xiong W., Lin Z., Jung H.J. InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. P. 8543–8552. https://doi.org/10.1109/CVPR52733.2024.00816

8. Yamac A., Genc D., Zaman E., Gerschner F., Klaiber M., Theissler A. Open-Source Text-to-Image Models: Evaluation using Metrics and Human Perception. Annual Computers and Applications Conference. 2024. P. 1659–1664. https://doi.org/10.1109/COMPSAC61105.2024.00261

9. Text-to-image: latent diffusion models. Nicd: official web-site. https://nicd.org.uk/knowledge-hub/image-to-text-latent-diffusion-models (accessed: 21.11.2024)

10. TokenCompose: Text-to-Image Diffusion with Token-level Supervision. https://mlpc-ucsd.github.io/TokenCompose/ (accessed: 21.11.2024)

11. Zhang S. e't al. Learning Multi-Dimensional Human Preference for Text-to-Image Generation. 2024 Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2024 P. 8018–8027. https://doi.org/10.1109/CVPR52733.2024.00766

12. Maung A., Nguyen H.H., Kiya H., Echizen I. Fine-Tuning Text-To-Image Diffusion Models for Class-Wise Spurious Feature Generation. 2024 IEEE International Conference on Image Processing (ICIP). 2024. P. 3910–3916. https://doi.org/10.1109/ICIP51287.2024.10647627

13. Peebles W., Xie S. Scalable Diffusion Models with Transformers. arXiv. 2022. https://arxiv.org/abs/2212.09748 (accessed: 21.11.2024)

14. Maslianko P., Sielskyi Y. Data Science — Definition and Structural Representation. System Research & Information Technologies. 2021. No. 1. P. 61–78. https://doi.org/10.20535/SRIT.2308-8893.2021.1.05

15. Maslianko P.P., Sielskyi E.P. Method of system engineering of neural machine translation systems. KPI Science News. 2021. No. 2. P. 46–55. https://doi.org/10.20535/kpisn.2021.2.236939

16. Kandwal S., Nehra V. A Survey of Text-to-Image Diffusion Models in Generative AI. International Conference on Cloud Computing. 2024. P. 73–78. https://doi.org/10.1109/Confluence60223.2024.10463372

17. Ahamed S., Al Amin A., Ahsan S.M.M. Synthesizing Realistic Images from Textual Descriptions: A Transformer-Based GAN Approach. 2023 International Conference on Next-Generation Computing. 2023. P. 1–6. https://doi.org/10.1109/NCIM59001.2023.10212565

18. Rethinking FID: Towards a Better Evaluation Metric for Image Generation. https://arxiv.org/html/2401.09603 (accessed: 21.11.2024)

19. Zhou D., Li Y., Ma F., Zhang X., Yang Y. MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. P. 6818–6828. https://doi.org/10.1109/CVPR52733.2024.00651

20. He F. et al. CartoonDiff: Training-free Cartoon Image Generation with Diffusion Transformer Models. IEEE International Conference on Acoustics, Speech and Signal Processing. 2024. P. 3825–3829. https://doi.org/10.1109/ICASSP48485.2024.10447821

21. Quantitative Comparison of Stable Diffusion, Midjourney and DALL-E 2. http://bit.ly/3BVLwwV (accessed: 21.11.2024)

22. Akar C.A., Luckow A., Obeid A., Beddawi C., Kamradt M., Makhoul A. Enhancing Complex Image Synthesis with Conditional Generative Models and Rule Extraction. 2023. P. 136–143. https://doi.org/10.1109/ICMLA58977.2023.00027

ISSN 2707-451X (Online)

ISSN 2707-4501 (Print)

Previous | FULL TEXT (in Ukrainian) | Next

2024, issue 4, p. 90-109

Archive