校内登录

柯炜

副教授

基本信息 / Basic Information

  • 电子邮箱:
  • 所在单位: 软件学院
  • 学历: 硕博连读
  • 办公地点:
  • 性别: 男
  • 联系方式:
  • 学位: 博士
  • 博士生导师: 是
  • 硕士生导师: 是
  • 所属院系: 软件学院
  • 学科: 计算机科学与技术

我的新闻

当前位置: 中文主页 - 我的新闻

两篇论文被CVPR接收,恭喜琮培和彭达!

发布时间:2026-03-01
点击次数:
发布时间:
2026-03-01
文章标题:
两篇论文被CVPR接收,恭喜琮培和彭达!
内容:

FlexiVideo: Variation-Aware Temporal Dynamics Modeling for Efficient Video Understanding

Abstract: Natural videos exhibit heterogeneous temporal dynamics, with certain segments undergoing high-dynamic scene transitions and others dominated by low-dynamic visual changes. However, treating all frames identically, a common practice in most MLLMs, leads to redundant visual encoding, which results in significant computational overhead. The recent state-of-the-art model, i.e., Qwen2.5-VL, adopts a fixed two-frame encoding scheme, but our pilot experiments indicate that it encounters a visual confusion problem under high-dynamic frame pairs. To address this issue, we propose FlexiVideo, an efficient MLLM that models temporal dynamics leveraging visual variation. FlexiVideo first employs an adaptive temporal segmentation module to estimate inter-frame differences, grouping consecutive frames into scene segments with subtle visual changes. Subsequently, a dynamical spatio-temporal embedding module adjusts the temporal window for scene-level encoding. By restructuring scene-level visual representations within a structured temporal organization, our approach models dynamics more effectively and reduces the encoding burden while preserving fine-grained visual variations. Extensive experiments show that FlexiVideo-3B consistently outperforms Qwen2.5-VL-3B across 6 general video benchmarks. Notably, when evaluated on MotionBench at 10 FPS, FlexiVideo-3B reduces visual tokens by 43.5% compared with Qwen2.5-VL-3B while achieving a 1.3% performance gain, striking a significantly better balance between efficiency and effectiveness. Code and checkpoints will be released soon.

 

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

Abstract: Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9\% mIoU on ADE20K (+9.4\%), surpassing specialized vision models like DINOv2 (49.1\%), while zero-shot segmentation accuracy improves by up to 22\%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.