ECCV 2026

DART

Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding

Problem

Zero-shot grounding breaks when a query needs temporal reasoning.

Existing zero-shot VTG methods often score frames against the query independently. This works for simple events, but it misses event order and causal structure in multi-stage queries.

DART couples query-conditioned keyframe selection with a routing signal from the DPP spectrum. Simple queries use a fast direct path, while complex queries use Temporal Markup Prompting for structured reasoning.

Query: "He dismounts after flipping and lands on the mat." The correct segment covers flipping, dismounting, and landing.

Method

DART decides when to reason.

The method first builds compact temporal evidence, then routes each query by estimated difficulty, and finally localizes the event with either a fast path or a slow reasoning path.

Figure 2 from the DART paper, showing query-conditioned DPP, spectral difficulty routing, and dual-path temporal localization.

Query-Conditioned DPP

DART selects frames that are relevant to the query and diverse across time, so the LVLM sees the key stages rather than redundant frames.

Spectral Routing

Spectral entropy estimates the structure of the selected temporal evidence and routes hard queries to structured reasoning.

Temporal Markup Prompting

The slow path asks the LVLM to analyze the event, tag frame roles, and extract temporal boundaries from the marked sequence.

Results

DART improves zero-shot VTG on IID and OOD splits.

The tables below report the main IID and OOD mIoU numbers from the paper.

Method Charades mIoU ActivityNet mIoU
TFVTG 44.51 34.10
TAG 45.69 36.55
DART 48.93 39.89
Split TAG mIoU DART mIoU
Charades OOD-1 44.7 48.1
Charades OOD-2 44.6 47.8
ActivityNet OOD-1 36.2 39.7
ActivityNet OOD-2 36.1 39.5

Citation

BibTeX

@inproceedings{zhang2026dart,
  title     = {DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding},
  author    = {Zhang, Zhengbo and Huang, Mark He and Tu, Zhigang and Yang, Ming-Hsuan},
  booktitle = {European Conference on Computer Vision},
  year      = {2026}
}