Abstract

Although most current large multimodal models (LMMs) can already understand photos of natural scenes and portraits, their understand ing of abstract images, e.g., charts, maps, or layouts, and visual reasoning capabilities remains quite rudimentary. They often struggle with simple daily tasks, such as reading time from a clock, understanding a flowchart, or planning a route using a road map. In light of this, we design a multi-modal self-instruct, utilizing large language models and their code capabilities to synthesize massive abstract images and visual reasoning instructions across daily scenarios. Our strategy effortlessly creates a multimodal benchmark with 11,193 instructions for eight visual scenarios: charts, tables, simulated maps, dashboards, flowcharts, relation graphs, floor plans, and visual puzzles. This benchmark, constructed with simple lines and geometric elements, exposes the shortcomings of most advanced LMMs like GPT-4V and Llava in abstract image understanding, spatial relations reasoning, and visual element induction. Besides, to verify the quality of our synthetic data, we fine-tune an LMM using 62,476 synthetic chart, table and road map instructions. The results demonstrate improved chart understanding and map navigation performance, and also demonstrate potential benefits for other visual reasoning tasks.

Leaderboard on Our Benchmark-11k

We evaluate the performance of many representative LMMs using our benchmark containing all tasks . We observe that for these abstract images, even advanced LMMs like GPT-4V and Claude3 achieved only 49.5% and 50.1% accuracy on average for all tasks, leaving a significant gap to human-level performance (82.1%).

Rank Name Chart Table Road Map Dashboard Relation Graph Flowchart Visual Puzzles Layout Avg
- Human* 93.5 95.1 75 85.3 82.5 65.5 62.5 97.6 82.1

Introduction

We identify that current LMMs have a significant gap compared to humans in understanding and visually reasoning about abstract images, such as maps, charts, and layouts. Utilizing LLM and code, We design a multimodal self-instruct strategy to synthesize a diverse set of abstract images and reasoning instructions, providing value data for LMMs. We synthesized a benchmark of 11,193 highquality abstract images, covering eight common scenarios. Our benchmark reveals significant deficiencies even in advanced LMMs. Besides, we synthesized 62,476 chart and road map instructions for fine-tuning, verifying the effectiveness of the synthesized data.

Approach Overview

Our multi-modal self-instruct is an LLM-driven data synthesis strategy capable of producing abstract images and aligned reasoning instructions for various daily scenarios, including road maps, dashboards, 2D planar layouts, charts, relation graphs, flowcharts, and visual puzzles.
Firstly, our strategy can autonomously propose a creative idea for visual scenarios, e.g., using a step-by-step flowchart to demonstrate how to attend an academy conference or designing road map . Then it generates detailed code to visualize this idea. After synthesizing the desired image, LLMs self-instruct multiple high-quality Q&A pairs for this visual content. The entire process is fully completed by the LLM with a few demonstrations.
We illustrate the entire process of our image-text synthesis, including using road maps for navigation, interpreting pie charts, solving visual puzzles, and using operating workflow. For each scenario, we synthesize multiple questions, annotated answers, and rationales. For example, in the pie chart case, the LLM designs a multi-step math question about the difference between the largest and smallest categories.

Fine-Turning Result

In addition to constructing the benchmark, we fine-tuned the Llava-1.5-7B model using the training sets from chart, table, and map tasks, and compared its performance with other baselines
We evaluate whether Llava-our-62k can generalize to other benchmarks, especially the tasks with significant differences. These results show that our model can generalize to other types of visual reasoning tasks, rather than merely fitting to the training scenarios.

Case Study