UAVBench and UAVIT-1M:

A Low-Altitude UAV Vision-Language Benchmark and Instruction-Tuning Dataset for MLLMs

Institution Name

πŸŽ‰πŸŽ‰πŸŽ‰[NEW!] This is the first vision-language benchmark, instruction tuning dataset, and multi-modal large language model baseline, specifically tailored to low-altitude UAV scenarios.

The web page project is under construction...



πŸ“’What's New
  • This is an ongoing project. We will be working on improving it.
  • [2025......] The complete evaluation code is coming soon!
  • [2025......] The detailed low-altitude UAV MLLMs model inference tutorial is coming soon!
  • [2025.05.13] πŸ”₯ 3 low-altitude UAV Multi-modal Large Language Model baselines are released!
  • [2025.05.13] The GeoChat-UAV model is released!
  • [2025.05.13] The MiniGPTv2-UAV model is released!
  • [2025.05.13] The LLaVA1.5-UAV model is released!
  • [2025.05.13] πŸ”₯ UAVBench benchmark and UAVIT-1M instruct tunning Dataset are released!

Abstract

Multimodal Large Language Models (MLLMs) have made significant strides in natural images and satellite remote sensing images, allowing humans to hold a meaningful dialogue based on given visual content. However, understanding low-altitude drone scenarios remains a challenge, even for advanced MLLMs. Existing benchmarks primarily focus on a few specific low-altitude visual tasks, which cannot fully assess the ability of MLLMs in real-world low-altitude UAV applications. Therefore, we introduce UAVBench, a comprehensive benchmark, and UAVIT-1M, a large-scale instruction tuning dataset, designed to evaluate and improve MLLMs' abilities in low-altitude UAV visual and vision-language tasks. UAVBench comprises 43 test units and 966k high-quality data samples across 10 tasks at the image-level and region-level. UAVIT-1M consists of approximately 1.24 million diverse instructions, covering 789k multi-scene low-altitude UAV images and about 2,000 types of spatial resolutions with 11 distinct tasks, i.e, image/region classification, image/region captioning, VQA, object detection, visual grounding, etc. UAVBench and UAVIT-1M feature pure real-world visual images and rich weather conditions, and involve manual sampling verification to ensure high quality. Our in-depth analysis of 10 state-of-the-art MLLMs using UAVBench reveals that existing MLLMs cannot generate accurate conversations about low-altitude visual content. Extensive experiments demonstrate that fine-tuning MLLMs on UAVIT-1M significantly addresses this gap. Our contributions pave the way for bridging the gap between current MLLMs and low-altitude UAV real-world application demands.

Teaser

Figure 1: UAVIT-1M supports 11 distinct tasks, spanning visual comprehension to vision-language reasoning, from image level to region level.

Logo UAVBench and UAVIT-1M


β€’ Object Category

Teaser

Figure 2: Category distribution in each task. Zoom in to view the specific categories and corresponding quantities.

β€’ Question Types of VQA Task

Teaser

Figure 3: Distribution of question types in image-level and region-level VQA tasks of UAVBench and UAVIT-1M.

β€’ Distribution of Object Size and Target Counting Difficulty

Teaser

Figure 4: (a) Distribution of object sizes in all region-level tasks. (b) Distribution of difficulty in the target counting task.

β€’ Spatial Resolution of Image and Distribution of Object Position

Teaser

Figure 5: Image resolution and target position distributions in UAVIT1M. Best viewed by zooming in.


Qualitative Comparisons


πŸ“ƒ BibTeX


@article{2025uavbench,
  title={{UAVBench and UAVIT-1M}: A Low-Altitude UAV Vision-Language Benchmark and Instruction-Tuning Dataset for MLLMs},
  author={},
  journal={},
  year={2025}
}
      

Acknowledgement

This website is adapted from theΒ NerfiesΒ project page, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.