Anatomy-CoT: Teaching MLLMs to Reason in Radiology

Tongji University & ByteDance
*Indicates co-first authors

A demonstration of the Anatomy-CoT framework in action, showcasing visually-grounded reasoning on a radiology image.

Abstract

Chain-of-Thought (CoT) has shown promise in enabling multimodal large language models to solve complex problems. However, CoT suffers from an over-reliance on textual cues and struggles to adapt general reasoning capabilities to highly specialized domains such as radiology. Radiologists, in real clinical practice, develop their expertise through structured pedagogical training that demands not only accurate predictions but also a clear, transparent alignment between textual reasoning and the underlying visual evidence. In this paper, we introduce Anatomy-CoT, a multi-step reasoning framework that follows real-world radiology pedagogical practices and incorporates visual grounding to enhance interpretability for radiologists. Anatomy-CoT is built on two core principles: (1) it employs structured, anatomy-centric reasoning paths that organize the entire thought process in a pedagogically coherent manner, and (2) it enforces visual grounding by representing anatomical concepts in an interleaved format that directly links textual reasoning to corresponding image regions. This approach not only bridges the modality gap but also ensures faithfulness to the visual evidence. To enable MLLMs to adopt this framework, we construct a large-scale instruction-tuning dataset, Anatomy-CoT-200K, comprising over 200,000 examples. We further introduce GRPO-MR, a reinforcement learning algorithm that enhances structured reasoning by supervising both the accuracy of anatomical grounding and the coherence of the specialized reasoning process. This reduces reasoning-path ambiguity and substantially improves the model’s final performance. Comprehensive evaluations show that Anatomy-CoT delivers transparent reasoning, achieves an 11.7% improvement over vanilla CoT, and generalizes robustly to out-of-domain radiology images.

Approach

Our approach is designed to systematically teach Multimodal Large Language Models (MLLMs) the Anatomy-CoT reasoning paradigm, transforming them into verifiable and transparent reasoners for radiology. This is achieved through a two-stage training strategy: (1) We first perform Interleaved CoT Supervised Fine-Tuning (SFT) on our custom-built Anatomy-CoT-200k dataset. This initial stage aligns the base model with our complex, interleaved reasoning format, teaching it to generate structured analyses (<think>) and reasoning steps (<rethink>) grounded in anatomical evidence. (2) We then apply GRPO-MR, our novel reinforcement learning algorithm, to refine the model's capabilities. This stage optimizes the policy using a composite reward signal that uniquely evaluates both the final answer accuracy and the structural integrity of the reasoning process, including the textual fidelity and visual grounding precision of its anatomical findings.

Why Does Grounding Matter?

Our ablation studies reveal that interleaving visual grounding is not just an auxiliary task, but a fundamental driver of reasoning quality. The resulting model exhibits distinct cognitive behaviors:

  • Hyper-Efficient Learning: As shown in the left figure, Anatomy-CoT surpasses standard vanilla GRPO after just 10 iterations of reinforcement learning (RL), demonstrating significantly higher data efficiency and performance potential.
  • Sustained Visual Attention: The attention score analysis (right figure) confirms that generating explicit bounding boxes acts as a persistent anchor. It forces the model to maintain significantly higher attention on visual tokens throughout the generation process compared to vanilla GRPO.
  • Reduced Cognitive Drift: While vanilla GRPO rapidly lose focus on the image as the response grows, Anatomy-CoT's attention decays at a much slower rate. This ensures that the final conclusion is derived from active visual verification rather than language priors.
RL Performance Comparison

Learning Efficiency: Anatomy-CoT vs. vanilla GRPO
under few-shot RL settings.

Attention Score Analysis

Visual Focus: Average attention scores on image tokens
during the reasoning phase.