CVPR 2026 Paper Analysis

📊 CVPR 2026 各方向概览

本报告汇总了 arXiv 上 97 篇标注了 CVPR 2026 接收标记的论文，覆盖数据合成、数据生成、图像生成、自动标注、检测分割五大方向。CVPR 2026 官方论文列表预计 2026年4-5月上线后将进行补充。

数据合成

数据生成

图像生成

自动标注

检测分割

数据合成（40篇）

数据合成方向关注利用生成模型（扩散模型、GAN等）自动创建标注数据集，以缓解人工标注成本高昂的瓶颈。CVPR 2026 中该方向涌现出大量工作，核心趋势包括：(1) 基于文本条件的精细控制合成；(2) Sim-to-Real 迁移；(3) 任务驱动的数据质量筛选。

扩散模型控制合成 Sim-to-Real迁移任务驱动质量筛选多模态数据生成

数据生成（10篇）

数据生成方向聚焦于自动化、大规模地构建训练数据，包括多模态数据生成、数据集蒸馏与压缩、以及利用大模型进行数据扩充。

多模态数据生成数据集蒸馏 LLM辅助数据构建跨域数据生成

图像生成（55篇）

图像生成是 CVPR 2026 中最活跃的方向之一，以扩散模型（DiT、LDM）为主流框架，在文生图、图像编辑、视频生成、3D生成等子任务上均有大量突破。关键趋势：推理加速（步数减少、Token压缩）、生成可控性、安全性对齐。

DiT推理加速文生图可控性视频生成 3D生成安全对齐

自动标注（5篇）

自动标注方向利用半监督学习、伪标签、大模型辅助等技术降低人工标注依赖，是数据高效学习的核心支撑技术。

伪标签质量控制半监督学习 VLM辅助标注自训练

检测分割（20篇）

检测与分割方向 CVPR 2026 侧重于多模态融合（RGB+热红外/点云）、医学图像分割、视觉-语言接地等细分赛道，提出了多种高效 backbone 和跨模态交互机制。

多模态融合医学图像分割视觉语言接地实时推理

数据合成（40篇）

SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning

数据合成

🎯 研究动机

Abstract: …evaluations fail to diagnose fine-grained, association-level forgetting.

🔧 核心方法

We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised model polluted with this data, and a Clean mod&he

✨ 主要贡献

📄 摘要

Abstract: …evaluations fail to diagnose fine-grained, association-level forgetting. We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised model polluted with this data, and a Clean mod… ▽ More As multimodal models like CLIP become integral to downstream systems, the need to remove sensit

数据合成

🎯 研究动机

Abstract: …our study reveals that reading analog clocks in real-world environments remains a significant challenge for state-of-the-art VLMs.

🔧 核心方法

As a result, VLMs trained on such… ▽ More Advances in vision-language models (VLMs) have achieved rem

✨ 主要贡献

Abstract: …our study reveals that reading analog clocks in real-world environments remains a significant challenge for state-of-the-art VLMs. As a result, VLMs trained on such… ▽ More Advances in vision-language models (VLMs) have achieved rem

📄 摘要

Abstract: …our study reveals that reading analog clocks in real-world environments remains a significant challenge for state-of-the-art VLMs. Existing analog clock datasets are largely synthetic or planar with limited stylistic diversity and minimal background context, failing to capture the visual variability of real-world scenes. As a result, VLMs trained on such… ▽ More Advances in vision-language models (VLMs) have achieved rem

数据合成 RL synthetic

#11

FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

数据合成数据生成

🎯 研究动机

Abstract: …without relying on 3D priors.

🔧 核心方法

We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training camer

📄 摘要

Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence… ▽ More Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real

数据合成 generation diffusion 图像生成

#40

TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

数据合成

Abstract: Text-to-image generation powers content creation across design, media, and… ▽ More Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce SOLACE (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external rewa

数据生成 generation 图像生成

VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models

数据生成

🎯 研究动机

Abstract: …fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning.

🔧 核心方法

While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization.

✨ 主要贡献

To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually… ▽ More Advances in large reasoning models hav

📄 摘要

Abstract: …fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually… ▽ More Advances in large reasoning models have shown strong performance on complex reasoning tasks by

数据生成 generation RL

#10

M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

数据生成

🎯 研究动机

Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingua… ▽ More

🔧 核心方法

Abstract: Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data.

✨ 主要贡献

Abstract: Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingua… ▽ More

📄 摘要

数据生成 generation multimodal

图像生成（55篇）

Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

图像生成数据合成检测分割

🎯 研究动机

Abstract: …producing highly realistic and controllable sequences.

🔧 核心方法

Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial.

✨ 主要贡献

Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors gener… ▽ More F

📄 摘要

检测分割数据合成 video detection generation 图像生成

Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection

图像生成数据合成检测分割

🎯 研究动机

Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images.

🔧 核心方法

Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images.

✨ 主要贡献

📄 摘要

检测分割数据合成 detection 图像生成 GAN synthetic

IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbation

图像生成数据合成

🎯 研究动机

However, many of these models suffer from limited intr

🔧 核心方法

Recent advances in identity-conditional diffusion models have enabled the generation of photorealistic and identity-consistent face images.

✨ 主要贡献

However, many of these models suffer from limited intr

📄 摘要

数据合成 diffusion generation 图像生成 synthetic

From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition

图像生成

🎯 研究动机

Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data.

🔧 核心方法

Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data.

✨ 主要贡献

We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightw… ▽

📄 摘要

Abstract: …greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightw… ▽ More Images can be viewed as layered compositions, foregroun

diffusion 图像生成

Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding

图像生成检测分割

🎯 研究动机

However, their potential for GUI grounding remains unexplored.

🔧 核心方法

Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement.

Under this perspective, our training framework is… ▽ More Virtual Try-on (VTON) has become a core capability for online retail, wher

📄 摘要

Abstract: …sampling, making the trade-off between fidelity and efficiency a persistent challenge. We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is… ▽ More Virtual Try-on (VTON) has become a core capability for online retail, wher

generation 图像生成

#11

GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis

图像生成数据合成

🎯 研究动机

Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging.

🔧 核心方法

Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging.

✨ 主要贡献

Diffusion-based models rely on stochastic noise-to-data tran

📄 摘要

数据合成 diffusion 图像生成

#12

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

图像生成数据生成

🎯 研究动机

Abstract: Text-to-image generation powers content creation across design, media, and… ▽ More Text-to-image generation powers content creation

🔧 核心方法

Abstract: Text-to-image generation powers content creation across design, media, and… ▽ More Text-to-image generation powers content creation across design, media, and data augmentation.

✨ 主要贡献

Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics.

📄 摘要

数据生成 generation 图像生成

#13

RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

图像生成

🎯 研究动机

Abstract: …models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths dat

Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis.

✨ 主要贡献

However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck

Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration

图像生成

🎯 研究动机

Abstract: Diffusion models have become the dominant tool for high-fidelity image and video… ▽ More Diffusion models have become the dominant t

🔧 核心方法

Abstract: Diffusion models have become the dominant tool for high-fidelity image and video… ▽ More Diffusion models have become the dominant tool for high-fidelity image and video generation,

✨ 主要贡献

To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at

📄 摘要

Abstract: Diffusion models have become the dominant tool for high-fidelity image and video… ▽ More Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at

diffusion video transformer generation 图像生成

#31

AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution

图像生成数据合成

🎯 研究动机

Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image… ▽ More Visual autoregressive (VAR) mo

🔧 核心方法

Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image… ▽ More Visual autoregressive (VAR) models have recently emerged as a promising alternat

✨ 主要贡献

This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored

📄 摘要

数据合成 generation LoRA 图像生成

Abstract: We present a method for generating a full 360° orbit video around a person from a single input… ▽ More We present a method for generating a full 360° orbit video around a person from

✨ 主要贡献

In contrast, recent video diffusion models have demonstrated th

📄 摘要

数据合成 diffusion video generation 图像生成 3D

#35

Diffusion Probe: Generated Image Result Prediction Using CNN Probes

图像生成

🎯 研究动机

Abstract: Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-… ▽ More

🔧 核心方法

Abstract: Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-… ▽ More Text-to-image (T2I) diffusion model

✨ 主要贡献

We reveal a strong correlation between early diffusion cross-

📄 摘要

Abstract: Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-… ▽ More Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-generation scenarios such as prompt iteration, agent-based generation, and flow-grpo. We reveal a strong correlation between early diffusion cross-

generation RL diffusion 图像生成

#36

SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

图像生成数据合成

🎯 研究动机

While existing methods can… ▽ More We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation.

🔧 核心方法

While existing methods can… ▽ More We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation.

✨ 主要贡献

It is essential for synthesizing partially occluded objects with depth-consis

📄 摘要

数据合成 generation 图像生成 RL 3D

#37

DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis

图像生成数据合成

🎯 研究动机

Existing methods predominantly employ optical flow-based… ▽ More Image alignment is a fundamental task in computer vision with broad applications.

🔧 核心方法

Existing methods predominantly employ optical flow-based… ▽ More Image alignment is a fundamental task in computer vision with broad applications.

✨ 主要贡献

However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded ali

📄 摘要

数据合成 diffusion 图像生成

#38

Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache

图像生成

🎯 研究动机

Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead

🔧 核心方法

✨ 主要贡献

Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or… ▽ More Diffusion models have demonstrated re

📄 摘要

Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or… ▽ More Diffusion models have demonstrated remarkable success in image and video generation, yet their practica

generation diffusion video 图像生成

#39

PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion

图像生成

🎯 研究动机

Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, o

🔧 核心方法

Abstract: Pre-trained diffusion models excel at generating high-quality… ▽ More Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their nati

✨ 主要贡献

📄 摘要

Abstract: Pre-trained diffusion models excel at generating high-quality… ▽ More Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than f

generation diffusion 图像生成

While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often r… ▽ More Diffusion-based video super-resolution (VSR) has recentl

✨ 主要贡献

📄 摘要

Abstract: Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often r… ▽ More Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While

generation diffusion video 图像生成

✨ 主要贡献

However, current discrete… ▽ More Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm s

📄 摘要

Abstract: Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete… ▽ More Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. Howeve

generation diffusion 图像生成 multimodal

#46

BiGain: Unified Token Compression for Joint Generation and Classification

图像生成数据合成

🎯 研究动机

Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative cap

🔧 核心方法

Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity.

✨ 主要贡献

We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves… ▽ More Acceleration methods

Causal Motion Diffusion Models for Autoregressive Motion Generation

图像生成数据合成

🎯 研究动机

However, existing approaches either rely on full-sequence… ▽ More Recent advances in motion diffusion models have substantially improved the realism of human

🔧 核心方法

Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis.

✨ 主要贡献

📄 摘要

数据合成 generation diffusion 图像生成

Abstract: Existing diffusion codecs typically build on text-to-… ▽ More Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion.

✨ 主要贡献

To address it, we introduce \textbf{CoD}, the first \textbf{Co}mpression-oriented \textbf{D}iff

📄 摘要

✨ 主要贡献

As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task

📄 摘要

Abstract: …is a crucial task in autonomous driving, as it helps ensure the safe operation of vehicles. However, existing datasets such as CULane and TuSimple contain relatively limited data under extreme weather conditions, including rain, snow, and fog. As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task in autonomous driving,

annotation 检测分割自动标注 detection generation

ReMoT: Reinforcement Learning with Motion Contrast Triplets

自动标注

🎯 研究动机

Abstract: …consistency -- a critical failure point in navigation, robotics, and autonomous driving.

🔧 核心方法

ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based… &

✨ 主要贡献

📄 摘要

Abstract: …consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based… ▽ More We present ReMoT, a unified training paradigm to systematically address the fundamental shortc

annotation 自动标注 video

RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation

检测分割自动标注

🎯 研究动机

Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks.

🔧 核心方法

Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks.

✨ 主要贡献

RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering ap

📄 摘要

Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering approximately 165,000 images, over 800 object instances, 1.3 million… ▽ More Visual-language grounding aims to establish semantic correspon

annotation 检测分割自动标注 segmentation RL

检测分割（20篇）

检测与分割方向 CVPR 2026 侧重于多模态融合（RGB+热红外/点云）、医学图像分割、视觉-语言接地等细分赛道，提出了多种高效 backbone 和跨模态交互机制。

PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation

数据合成检测分割

🎯 研究动机

Yet, despite rapid progress, existing HOI generation research r

📄 摘要

检测分割数据合成 detection 图像生成 GAN synthetic

OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditi

数据合成检测分割

🎯 研究动机

Abstract: …we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words.

🔧 核心方法

Abstract: …we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words.

✨ 主要贡献

📄 摘要

检测分割数据合成 detection synthetic

Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation

自动标注数据合成数据生成检测分割

🎯 研究动机

However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro

🔧 核心方法

本文提出 Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation 方法/框架。

✨ 主要贡献

However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro

📄 摘要

annotation 数据合成检测分割自动标注 generation 数据生成

MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator

检测分割

🎯 研究动机

Abstract: Video matting remains limited by the scale and realism of existing datasets.

🔧 核心方法

To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality… ▽ More Video matting remains limited by the scale and realism of exist

✨ 主要贡献

To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality… ▽ More Video matting remains limited by the

📄 摘要

Abstract: Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality… ▽ More Video matting remains limited by the scale and realism of exist

segmentation 检测分割 video

Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding

图像生成检测分割

🎯 研究动机

However, their potential for GUI grounding remains unexplored.

🔧 核心方法

✨ 主要贡献

📄 摘要

检测分割 multimodal diffusion generation 图像生成

HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation

检测分割自动标注

🎯 研究动机

However, existing datasets such as CULane and TuSimple contain relatively limited data under extreme weather conditions, including rain, snow, and fog.

🔧 核心方法

As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task in autonomous driving,

✨ 主要贡献

As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task

📄 摘要

annotation 检测分割自动标注 detection generation

Abstract: Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval.

✨ 主要贡献

It also depends heavily on indirect feature extraction pipelines, which complicate the workflow… ▽ More Recent approaches for segmentation have leveraged pre

📄 摘要

segmentation 检测分割 generation 图像生成

Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks.

🔧 核心方法

Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks.

✨ 主要贡献

📄 摘要

annotation 检测分割自动标注 segmentation RL

#17

Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning

检测分割

🎯 研究动机

Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely

🔧 核心方法

✨ 主要贡献

Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely

📄 摘要

Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries. The proposed framework, \textbf{STaRC}, overcomes this limitation by supervising frame-level saliency through a high… ▽ More Existing retrieval-augmented approaches for Dense Video Caption

segmentation 检测分割 RL video

本文提出 X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection 方法/框架。

✨ 主要贡献

📄 摘要

Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a… ▽ More The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Aga

检测分割数据合成 video detection synthetic

🎓 CVPR 2026 论文分析报告

📊 CVPR 2026 各方向概览

数据合成（40篇）

数据生成（10篇）

图像生成（55篇）

自动标注（5篇）

检测分割（20篇）

数据合成（40篇）

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要

🎯 研究动机

🔧 核心方法

✨ 主要贡献

📄 摘要