校内登录

柯炜

副教授

基本信息 / Basic Information

  • 电子邮箱:
  • 所在单位: 软件学院
  • 学历: 硕博连读
  • 办公地点:
  • 性别: 男
  • 联系方式:
  • 学位: 博士
  • 博士生导师: 是
  • 硕士生导师: 是
  • 所属院系: 软件学院
  • 学科: 计算机科学与技术

我的新闻

当前位置: 中文主页 - 我的新闻

一篇论文被ICLR接收,恭喜琮培!

发布时间:2025-02-07
点击次数:
发布时间:
2025-02-07
文章标题:
一篇论文被ICLR接收,恭喜琮培!
内容:

Refining CLIP's Spatial Awareness: A Visual-Centric Perspective

 

Refining CLIP's Spatial Awareness: A Visual-Centric Perspective | OpenReview

 

Abstract:

Contrastive Language-Image Pre-training (CLIP) excels in global alignment with language but exhibits limited sensitivity to spatial information, leading to strong performance in zero-shot classification tasks but underperformance in tasks requiring precise spatial understanding. Recent approaches have introduced Region-Language Alignment (RLA) to enhance CLIP's performance in dense multimodal tasks by aligning regional visual representations with corresponding text inputs. However, we find that CLIP ViTs fine-tuned with RLA suffer from notable loss in spatial awareness, which is crucial for dense prediction tasks. To address this, we propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates above degradation. To further enhance spatial correlations, we introduce a lightweight Refiner that extracts refined correlations directly from CLIP before feeding them into SCD, based on an intriguring finding that CLIP naturally capture high-quality dense features. Together, these components form a robust distillation framework that enables CLIP ViTs to integrate both visual-language and visual-centric improvements, achieving state-of-the-art results across various open-vocabulary dense prediction benchmarks.