Jezia Zakraoui et al. proposes a solution for the automatic visualization of stories written in natural language \cite{zakraoui2023pipeline}. The main goal is to extract the semantic structure of texts and transform this information into visual sequences. This process is specifically designed to support language learning and literacy development. However, the consistent and accurate visualization of texts with keywords is a challenging problem. The proposed pipeline consists of four main stages. In the first stage, the texts are linguistically analyzed to extract characters, scene objects and events in the story. Complex sentences are transformed into simpler language structures, making it easier to visualize the text. This step is accomplished with the scene graph method; scene graphs function as a graphical structure that shows objects and the relationships between these objects. In the next stage, visuals are produced. Images are created in a single frame or in multiple frames for a more detailed representation of events. Particularly complex and rare actions are visualized more clearly using multiple images. The scene graph determines how events and characters are consistently placed. Finally, a CLIP-like model is used to check the semantic match between text and images. This model calculates the semantic similarity between text and images and determines which ones are the most appropriate. In the process, the images are ranked and a final story sequence is created. The article notes that this method is effective for story visualization and can be further improved. bunu sen mi yazdın