An ensemble of trained multimodal encoders and Vision-Language Models (VLMs) has become a standard approach for Visual Question Answering (VQA) tasks. However, such naive models often fail to produce responses with the detailed precision necessary for complex, domain-specific applications such as medical VQA. Our representation model, BIND: BLIVA Integrated with Dense Encoding, extends prior multimodal work by refining the joint embedding space through dense, query-token-based encodings, inspired by contrastive pretraining techniques. This refined encoder powers Med-GRIM, a model designed for medical VQA tasks that leverages graph-based retrieval and prompt engineering to integrate domain-specific knowledge. Rather than relying on compute-heavy fine-tuning of vision and language models on specific datasets, Med-GRIM applies a low-compute, modular workflow with small language models (SLMs) for efficiency. Med-GRIM employs prompt-based retrieval to dynamically inject relevant knowledge, ensuring both accuracy and robustness in its responses. By assigning distinct roles to each agent within the VQA system, Med-GRIM achieves large language model performance at a fraction of the computational cost. Additionally, to support scalable research in zero-shot multimodal medical applications, we introduce DermaGraph, a novel Graph-RAG dataset comprising diverse dermatological conditions. This dataset facilitates both multimodal and unimodal querying. The code and dataset will be released on acceptance.

Contributions

We propose a novel multimodal representation learning architecture that advances performance on state-of-the-art datasets in multimodal learning.
OWe introduce DermaGraph, a multimodal, graph-structured dataset designed for RAG tasks in dermatology, which also supports unimodal input scenarios.
We develop a zero-shot learning pipeline specifically tailored for medical query processing, enabling accurate and efficient responses to medical inquiries.
We design a graph filtering mechanism for improved diagnostic accuracy in medical applications.

Med-GRIM integrates multimodal inputs—such as images and descriptions—through a series of specialized modules, including BIND, graph retrieval layers and prompt injection . The model first assesses possible conditions, ranking them by probability, then dynamically retrieves relevant data and refines responses iteratively. This approach allows it to present condition-agnostic insights and tailor responses based on user feedback. Through iterative filtering, Med-GRIM engages users with clarifying questions, adapting its answers based on specific input cues, as shown in the step-by-step reasoning for diagnosing conditions.

It emphasizes structuring and framing instructional prompts to align closely with the patterns and context the model was trained on. Leveraging this approach, we employ a custom-designed prompt template tailored specifically to our application.

Some essential considerations when creating an effective prompt template include:

Be as specific as possible: Providing clear, specific prompts reduces ambiguity, enabling the AI to better understand the query and its context.
Specify the level of detail required: Indicate preferences for the response format, such as using bullet points, specifying the number of questions, or defining paragraph limits.
Offer positive guidance: Focus on explaining what the LLM should do, rather than what it should not do, to maintain clarity and direction.

These connections illustrate the semantic relationships across nodes, linking dermatological conditions with overlapping attributes, such as shared symptoms, similar underlying causes, or common treatment strategies. For instance, conditions with symptoms like redness or inflammation may be grouped together, creating meaningful clusters that aid in understanding correlations between different conditions. Similarly, treatment strategies that apply to multiple conditions, such as the use of topical steroids or antihistamines, establish additional connections between nodes. This interconnected design transforms the dataset into a dynamic knowledge graph, where the relationships between nodes provide rich context for various tasks.

MedGRIM: Enhanced Zero-shot Medical VQA using prompt-embedded Multimodal GraphRAG

Submitted to ICCV 2025
Links will be updated on acceptance

MedGRIM is an interactive medical AI Agent with transparent thought process

Abstract

Contributions

MedGRIM: Method

Pompt Templates Used

Demo Video

Dataset Overview

Qualitative Result

MedGRIM: Enhanced Zero-shot Medical VQA using prompt-embedded Multimodal GraphRAG

Submitted to ICCV 2025Links will be updated on acceptance

MedGRIM is an interactive medical AI Agent with transparent thought process

Abstract

Contributions

MedGRIM: Method

Pompt Templates Used

Demo Video

Dataset Overview

Qualitative Result

Submitted to ICCV 2025
Links will be updated on acceptance