Celebrity Recognition and VQA

RC-MLLM model is developed based on the Qwen2-VL model through a novel method called RCVIT (Region-level Context-aware Visual Instruction Tuning), using the specially constructed RCMU dataset for training. Its core feature is the capability for Region-level Context-aware Multimodal Understanding (RCMU). This means it can simultaneously understand both the visual content of specific regions/objects within an image and their associated textual information (utilizing bounding boxes coordinates), allowing it to respond to user instructions in a more context-aware manner. Simply put, RC-MLLM not only understands images but can also integrate the textual information linked to specific objects within the image for understanding. It achieves outstanding performance on RCMU tasks and is suitable for applications like personalized conversation.

📑 Region-Level Context-Aware Multimodal Understanding | 🤗 Models:RC-Qwen2VL-2b RC-Qwen2VL-7b| 📁 Dataset | Github | 🚀 Personalized Conversation Demo

📌 Upload an image containing celebrities, the system will recognize them and provide Wikipedia-based VQA using the RC-Qwen2-VL model.

Upload Image

Question

Confidence Threshold (%)

Adjust the minimum confidence level for celebrity recognition

50 100

Example Images with Questions

Upload Image	Question

Recognition Result

RC-Qwen2-VL Answer

Person Information

Instructions

Upload an image containing celebrities
Enter your question, for example:
- "Who are the people in the image?"
- "What achievements does the person on the left have?"
- "What is the relationship between these people?"
Adjust the confidence threshold slider if needed (lower values will recognize more faces but might be less accurate)
Click the submit button to get the answer