Comic Icon
Comic Icon
Comic Icon
Comic Icon
Comic Icon
Comic Icon
Comic Icon
Comic Icon
Comic Icon
Comic Icon
Comic Icon
Comic Icon
Comic Icon
Comic Icon
Comic Icon
Re:Verse Comic Logo

Re:Verse - Can Your VLM Read a Manga?

🌟Oral Presentation
🏆Best Paper Recommendation

Evaluating Vision Language Models' Understanding of Sequential Visual Storytelling

Re:Verse - VLM Manga Reading Evaluation Framework

Aaditya Baranwal¹†, Madhav Kataria², Naitik Agrawal³

Yogesh Singh Rawat¹, Shruti Vyas¹

¹University of Central Florida || ²Indian Institute of Technology Jodhpur || ³Indian Institute of Technology Varanasi

Abstract

Current Vision Language Models (VLMs) demonstrate a critical gap between surface-level recognition and deep narrative reasoning when processing sequential visual storytelling. Through a comprehensive investigation of manga narrative understanding, we reveal that while recent large multimodal models excel at individual panel interpretation, they systematically fail at temporal causality and cross-panel cohesion—core requirements for coherent story comprehension.

We introduce a novel evaluation framework that combines fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment to systematically characterize these limitations.

Our methodology includes:
(i) a rigorous annotation protocol linking visual elements to narrative structure through aligned light novel text,
(ii) comprehensive evaluation across multiple reasoning paradigms including direct inference and retrieval-augmented generation, and
(iii) cross-modal similarity analysis revealing fundamental misalignments in current VLMs' joint representations.

Applying this framework to Re:Zero manga across 11 chapters with 308 annotated pages, we conduct the first systematic study of long-form narrative understanding in VLMs through three core evaluation axes: generative storytelling, contextual dialogue grounding, and temporal reasoning.

Our findings demonstrate that current models lack genuine story-level intelligence, struggling particularly with non-linear narratives, character consistency, and causal inference across extended sequences. This work establishes both the foundation and practical methodology for evaluating narrative intelligence, while providing actionable insights into capability of deep sequential understanding of Discrete Visual Narratives beyond basic recognition in Multimodal Models.

Experiments and Curation

📊

Experiment 1 Results

Story & Summary Generation.

📈

Experiment 2 Results

Text-Box Detection, Classification and Association.

🗠

Experiment 3 Results

Page Predictions and Visual Question Answering.

⚙️

Dataset Procurement

Data Preparation and Annotation Pipeline.

Key Highlights

Breakthrough insights into Vision Language Models' narrative understanding capabilities

" models lack story-level intelligence, struggling particularly with non-linear narratives, character consistency, and causal inference "

📚

Sequential Visual Storytelling

First comprehensive evaluation of VLMs on manga narrative understanding across 11 chapters with 308 annotated pages

🎯

Novel Evaluation Framework

Fine-grained multimodal annotation protocol linking visual elements to narrative structure through aligned light novel text

🔍

Cross-Modal Analysis

Systematic characterization of VLM limitations through cross-modal embedding analysis and retrieval-augmented assessment

Three Core Evaluation Axes

Generative storytelling, contextual dialogue grounding, and temporal reasoning assessment

💡

Critical Insights

Reveals fundamental gaps in temporal causality, cross-panel cohesion, and character consistency

😎

Cool Work

Bringing Traction to work on super fun and under-served domain of Comics and Mangas with VLMs and AI

Re:Verse Dataset

📖

11 Chapters

Complete Re:Zero manga coverage

🎯

308 Pages

Fine-grained annotations

📝

Aligned Text

Light novel correspondence

1
📚

Data Collection

Re:Zero manga chapters with corresponding light novel text alignment

2
🎯

Annotation Protocol

Fine-grained multimodal annotation linking visual elements to narrative structure

3
🔍

Evaluation Framework

Three-axis assessment: storytelling, dialogue grounding, temporal reasoning

4

Cross-Modal Analysis

Embedding analysis revealing misalignments in VLM joint representations

Key Findings

Systematic evaluation reveals critical limitations in current VLM narrative understanding

📝

Generative Storytelling

Low

Inconsistent narrative generation across panels

💬

Dialogue Grounding

Moderate

Limited contextual dialogue understanding

Temporal Reasoning

Poor

Weak causal inference capabilities

👥

Character Consistency

Low

Difficulty maintaining character identity

Surface vs. Deep Understanding

VLMs excel at individual panel interpretation but fail at temporal causality and cross-panel cohesion

Critical Gap Identified

Character Consistency Issues

Models struggle with character identity tracking across extended manga sequences

Fundamental Limitation

Non-Linear Narrative Challenges

Poor performance on complex storytelling techniques like flashbacks and parallel timelines

Story-Level Intelligence Missing

Retrieval-Augmented Benefits

RAG approaches show improvements but don't solve core reasoning limitations

Partial Solution

Current VLMs demonstrate a critical gap between surface-level recognition and deep narrative reasoning.
While they excel at individual panel interpretation, they systematically fail at the temporal causality and cross-panel cohesion that are core requirements for coherent story comprehension.

Citation

if you use our work, cite our paper with the following bibTeX entry:

📝 BibTeX Citation

@inproceedings{
baranwal2025reverse,
title={Re:Verse - Can your VLM read a Manga?},
author={Aaditya Baranwal and Madhav Kataria and Naitik Agrawal and Shruti Vyas and Yogesh S Rawat},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision - Workshops (ICCV-W)},
year={2025},
url={https://arxiv.org/abs/2508.08508}
}