Evaluating Vision Language Models' Understanding of Sequential Visual Storytelling
Current Vision Language Models (VLMs) demonstrate a critical gap between surface-level recognition and deep narrative reasoning when processing sequential visual storytelling. Through a comprehensive investigation of manga narrative understanding, we reveal that while recent large multimodal models excel at individual panel interpretation, they systematically fail at temporal causality and cross-panel cohesion—core requirements for coherent story comprehension.
We introduce a novel evaluation framework that combines fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment to systematically characterize these limitations.
Our methodology includes:
(i) a rigorous annotation protocol linking visual elements to narrative structure through aligned light novel text,
(ii) comprehensive evaluation across multiple reasoning paradigms including direct inference and retrieval-augmented generation, and
(iii) cross-modal similarity analysis revealing fundamental misalignments in current VLMs' joint representations.
Applying this framework to Re:Zero manga across 11 chapters with 308 annotated pages, we conduct the first systematic study of long-form narrative understanding in VLMs through three core evaluation axes: generative storytelling, contextual dialogue grounding, and temporal reasoning.
Our findings demonstrate that current models lack genuine story-level intelligence, struggling particularly with non-linear narratives, character consistency, and causal inference across extended sequences. This work establishes both the foundation and practical methodology for evaluating narrative intelligence, while providing actionable insights into capability of deep sequential understanding of Discrete Visual Narratives beyond basic recognition in Multimodal Models.
Story & Summary Generation.
Text-Box Detection, Classification and Association.
Page Predictions and Visual Question Answering.
Data Preparation and Annotation Pipeline.
Breakthrough insights into Vision Language Models' narrative understanding capabilities
" models lack story-level intelligence, struggling particularly with non-linear narratives, character consistency, and causal inference "
First comprehensive evaluation of VLMs on manga narrative understanding across 11 chapters with 308 annotated pages
Fine-grained multimodal annotation protocol linking visual elements to narrative structure through aligned light novel text
Systematic characterization of VLM limitations through cross-modal embedding analysis and retrieval-augmented assessment
Generative storytelling, contextual dialogue grounding, and temporal reasoning assessment
Reveals fundamental gaps in temporal causality, cross-panel cohesion, and character consistency
Bringing Traction to work on super fun and under-served domain of Comics and Mangas with VLMs and AI
Complete Re:Zero manga coverage
Fine-grained annotations
Light novel correspondence
Re:Zero manga chapters with corresponding light novel text alignment
Fine-grained multimodal annotation linking visual elements to narrative structure
Three-axis assessment: storytelling, dialogue grounding, temporal reasoning
Embedding analysis revealing misalignments in VLM joint representations
Systematic evaluation reveals critical limitations in current VLM narrative understanding
Inconsistent narrative generation across panels
Limited contextual dialogue understanding
Weak causal inference capabilities
Difficulty maintaining character identity
VLMs excel at individual panel interpretation but fail at temporal causality and cross-panel cohesion
Models struggle with character identity tracking across extended manga sequences
Poor performance on complex storytelling techniques like flashbacks and parallel timelines
RAG approaches show improvements but don't solve core reasoning limitations
Current VLMs demonstrate a critical gap between surface-level recognition and deep narrative reasoning.
While they excel at individual panel interpretation, they systematically fail at the temporal causality and cross-panel cohesion that are core requirements for coherent story comprehension.
if you use our work, cite our paper with the following bibTeX entry:
@inproceedings{ baranwal2025reverse, title={Re:Verse - Can your VLM read a Manga?}, author={Aaditya Baranwal and Madhav Kataria and Naitik Agrawal and Shruti Vyas and Yogesh S Rawat}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision - Workshops (ICCV-W)}, year={2025}, url={https://arxiv.org/abs/2508.08508} }