مطالب مرتبط با کلیدواژه

Vision Transforme


۱.

Unlocking Book Genre from Covers: A Multimodal Approach to Book Genre Prediction(مقاله علمی وزارت علوم)

کلیدواژه‌ها: Book Cover Analysis Book Genre Prediction Multimodal Learning Vision Transforme Mamba

تعداد بازدید : ۲۰ تعداد دانلود : ۱۹
In today’s visually driven market, book cover design plays a crucial role in conveying a work’s narrative and thematic essence. A book cover is a multimodal entity, consisting of various visual and textual elements. While conventional recommendation systems have often overlooked the semantic richness of cover imagery, prior work attempting to incorporate textual information relied on OCR to extract text from covers. However, these raw tokens capture only a fraction of the cover's meaning and often miss deeper thematic and narrative cues. Recognizing these limitations, we leverage the advanced knowledge accumulated in VLMs to derive a more comprehensive representation, using this knowledge to add it as an additional feature to the system. In this paper, we use VLM-generated descriptions and integrate these rich descriptions as a new textual feature. Our enhanced corpus comprises 57,000 book covers across 30 genres (1,900 per genre), each annotated with both raw imagery and VLM-generated narrative summaries. We fuse two state-of-the-art vision encoders (ViT and VisionMamba) with a text encoder that processes these VLM descriptions. Experimental results demonstrate a Top 1 accuracy of 63.31% and a Top 3 accuracy of 83.03%, marking a substantial improvement over the state-of-the-art variant and underscoring the value of VLM-derived context in multimodal genre classification.