🎓 CVPR 2026 论文分析报告

数据合成 · 数据生成 · 图像生成 · 自动标注 · 检测分割

数据来源:arXiv(作者自标注 CVPR 2026)| 更新日期:2026-03-30

97
相关论文
5
研究方向
25.4%
CVPR 2026 录取率
4090
总录取数

📊 CVPR 2026 各方向概览

本报告汇总了 arXiv 上 97 篇标注了 CVPR 2026 接收标记的论文,覆盖数据合成、数据生成、图像生成、自动标注、检测分割五大方向。CVPR 2026 官方论文列表预计 2026年4-5月上线后将进行补充。

40
数据合成
10
数据生成
55
图像生成
5
自动标注
20
检测分割

数据合成(40篇)

数据合成方向关注利用生成模型(扩散模型、GAN等)自动创建标注数据集,以缓解人工标注成本高昂的瓶颈。CVPR 2026 中该方向涌现出大量工作,核心趋势包括:(1) 基于文本条件的精细控制合成;(2) Sim-to-Real 迁移;(3) 任务驱动的数据质量筛选。

扩散模型控制合成 Sim-to-Real迁移 任务驱动质量筛选 多模态数据生成

数据生成(10篇)

数据生成方向聚焦于自动化、大规模地构建训练数据,包括多模态数据生成、数据集蒸馏与压缩、以及利用大模型进行数据扩充。

多模态数据生成 数据集蒸馏 LLM辅助数据构建 跨域数据生成

图像生成(55篇)

图像生成是 CVPR 2026 中最活跃的方向之一,以扩散模型(DiT、LDM)为主流框架,在文生图、图像编辑、视频生成、3D生成等子任务上均有大量突破。关键趋势:推理加速(步数减少、Token压缩)、生成可控性、安全性对齐。

DiT推理加速 文生图可控性 视频生成 3D生成 安全对齐

自动标注(5篇)

自动标注方向利用半监督学习、伪标签、大模型辅助等技术降低人工标注依赖,是数据高效学习的核心支撑技术。

伪标签质量控制 半监督学习 VLM辅助标注 自训练

检测分割(20篇)

检测与分割方向 CVPR 2026 侧重于多模态融合(RGB+热红外/点云)、医学图像分割、视觉-语言接地等细分赛道,提出了多种高效 backbone 和跨模态交互机制。

多模态融合 医学图像分割 视觉语言接地 实时推理

数据合成(40篇)

数据合成方向关注利用生成模型(扩散模型、GAN等)自动创建标注数据集,以缓解人工标注成本高昂的瓶颈。CVPR 2026 中该方向涌现出大量工作,核心趋势包括:(1) 基于文本条件的精细控制合成;(2) Sim-to-Real 迁移;(3) 任务驱动的数据质量筛选。

🎯 研究动机

Abstract: …evaluations fail to diagnose fine-grained, association-level forgetting.

🔧 核心方法

We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised model polluted with this data, and a Clean mod&he

✨ 主要贡献

We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised

📄 摘要

Abstract: …evaluations fail to diagnose fine-grained, association-level forgetting. We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised model polluted with this data, and a Clean mod… ▽ More As multimodal models like CLIP become integral to downstream systems, the need to remove sensit

synthetic multimodal 数据合成 CLIP

🎯 研究动机

To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, sc

🔧 核心方法

To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated vis

✨ 主要贡献

To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, sc

📄 摘要

Abstract: …using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream… ▽ More Cinematic Audio Source Separation (CASS) aims to decomp

数据合成 multimodal video 数据生成

🎯 研究动机

Abstract: Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data.

🔧 核心方法

Abstract: Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data.

✨ 主要贡献

Existing approaches rely heavily on synthetic data

📄 摘要

Abstract: Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data. Existing approaches rely heavily on… ▽ More Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data. Existing approaches rely heavily on synthetic data

数据合成 RL synthetic

🎯 研究动机

Yet, despite rapid progress, existing HOI generation research r

🔧 核心方法

(4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real… ▽ More Hand-object interaction (

✨ 主要贡献

Yet, despite rapid progress, existing HOI generation research r

📄 摘要

Abstract: …depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real… ▽ More Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research r

检测分割 数据合成 video generation segmentation synthetic

🎯 研究动机

Our method addresses these challenges through two key innovations.

🔧 核心方法

Our method addresses these challenges through two key innovations.

✨ 主要贡献

First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k im… ▽ M

📄 摘要

Abstract: …poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k im… ▽ More We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban

检测分割 数据合成 generation 数据生成 segmentation synthetic

🎯 研究动机

Abstract: …with disordered structures and realistic HRTEM image noises.

🔧 核心方法

It can ensure the denoising performance of models on real images for nucleation observation.

✨ 主要贡献

Experiments on synthetic and real data show our method outperforms the state-of-the-art methods in HRTEM image denoising, with effectiveness in the localization downs… ▽ More

📄 摘要

Abstract: …with disordered structures and realistic HRTEM image noises. It can ensure the denoising performance of models on real images for nucleation observation. Experiments on synthetic and real data show our method outperforms the state-of-the-art methods in HRTEM image denoising, with effectiveness in the localization downs… ▽ More High-Resolution Transmission Electron Microscopy (HRTEM) enables atomic-scale observation of nu

数据合成 synthetic

🎯 研究动机

Abstract: …producing highly realistic and controllable sequences.

🔧 核心方法

Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial.

✨ 主要贡献

Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors gener… ▽ More F

📄 摘要

Abstract: …producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors gener… ▽ More Following major advances in text and image generation, t

检测分割 数据合成 video detection generation 图像生成

🎯 研究动机

Abstract: Models for long-term point tracking are typically trained on large synthetic datasets.

🔧 核心方法

Abstract: Models for long-term point tracking are typically trained on large synthetic datasets.

✨ 主要贡献

Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends… ▽ More Models for long-term p

📄 摘要

Abstract: Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends… ▽ More Models for long-term point tracking are typically trained on lar

annotation 数据合成 video 自动标注 RL synthetic

🎯 研究动机

Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images.

🔧 核心方法

Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images.

✨ 主要贡献

Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these… ▽ More Recent rapid advancement of generative models has significantly im

📄 摘要

Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these… ▽ More Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented

检测分割 数据合成 detection 图像生成 GAN synthetic

🎯 研究动机

Abstract: …our study reveals that reading analog clocks in real-world environments remains a significant challenge for state-of-the-art VLMs.

🔧 核心方法

As a result, VLMs trained on such… ▽ More Advances in vision-language models (VLMs) have achieved rem

✨ 主要贡献

Abstract: …our study reveals that reading analog clocks in real-world environments remains a significant challenge for state-of-the-art VLMs. As a result, VLMs trained on such… ▽ More Advances in vision-language models (VLMs) have achieved rem

📄 摘要

Abstract: …our study reveals that reading analog clocks in real-world environments remains a significant challenge for state-of-the-art VLMs. Existing analog clock datasets are largely synthetic or planar with limited stylistic diversity and minimal background context, failing to capture the visual variability of real-world scenes. As a result, VLMs trained on such… ▽ More Advances in vision-language models (VLMs) have achieved rem

数据合成 RL synthetic

🎯 研究动机

Abstract: …without relying on 3D priors.

🔧 核心方法

We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training camer

✨ 主要贡献

We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-sh

📄 摘要

Abstract: …without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, contin… ▽ More We introduce FaceCam, a system that generates video under customizable camera trajectories for

synthetic 数据合成 video generation 数据生成 3D

🎯 研究动机

However, current methods mainly target sample reduction, with limited consideration of… ▽ More Dataset Distillation (DD) compresses large datasets into compa

🔧 核心方法

However, current methods mainly target sample reduction, with limited consideration of… ▽ More Dataset Distillation (DD) compresses large datasets into compact synthetic ones that maintain training performan

✨ 主要贡献

However, current methods mainly target sample reduction, with limited consideration of data precision and

📄 摘要

Abstract: Dataset Distillation (DD) compresses large datasets into compact synthetic ones that maintain training performance. However, current methods mainly target sample reduction, with limited consideration of… ▽ More Dataset Distillation (DD) compresses large datasets into compact synthetic ones that maintain training performance. However, current methods mainly target sample reduction, with limited consideration of data precision and

数据合成 synthetic

🎯 研究动机

Abstract: …we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words.

🔧 核心方法

Abstract: …we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words.

✨ 主要贡献

This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectivenes… ▽ More

📄 摘要

Abstract: …we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectivenes… ▽ More Fingerspelling is a component of sign languages in which

检测分割 数据合成 detection synthetic

🎯 研究动机

Experiments on synthetic and real data demonstrate that Lumosaic significantly improves reconstruction fidelity and temporal stability over existing snapshot hyp… ▽ More

🔧 核心方法

Experiments on synthetic and real data demonstrate that Lumosaic significantly improves reconstruction fidelity and temporal stability over existing snapshot hyp… ▽ More We present Lumosaic, a compact active

✨ 主要贡献

Experiments on synthetic and real data demonstrate that Lumosaic significantly improves reconstruction fidelity and temporal stability over existing snapshot hyp… ▽ More

📄 摘要

Abstract: …then recovers 31-channel hyperspectral (400-700 nm) video at 30 fps and VGA resolution, producing temporally coherent and spectrally accurate reconstructions. Experiments on synthetic and real data demonstrate that Lumosaic significantly improves reconstruction fidelity and temporal stability over existing snapshot hyp… ▽ More We present Lumosaic, a compact active hyperspectral video system designed for real-time capture

数据合成 video synthetic

🎯 研究动机

Abstract: …calibrated sky illumination, together with per-pixel confidence masks.

🔧 核心方法

We demonstrate that Olbedo enables state-of-the-art diffusion-based IID models, originally trained on synthetic indoor data, to generalize to real outdoor imagery: fine-tuning on Olbedo significantly improves single-view outdoor albedo prediction on&

✨ 主要贡献

We demonstrate that Olbedo enables state-of-the-art diffusion-based IID models, originally trained on synthetic indoor data, to generalize to real outdoor imagery: fine-tuning on Olbedo significantly

📄 摘要

Abstract: …calibrated sky illumination, together with per-pixel confidence masks. We demonstrate that Olbedo enables state-of-the-art diffusion-based IID models, originally trained on synthetic indoor data, to generalize to real outdoor imagery: fine-tuning on Olbedo significantly improves single-view outdoor albedo prediction on… ▽ More Intrinsic image decomposition (IID) of outdoor scenes is crucial for relighting, editing, and u

数据合成 diffusion synthetic

🎯 研究动机

However, many of these models suffer from limited intr

🔧 核心方法

Recent advances in identity-conditional diffusion models have enabled the generation of photorealistic and identity-consistent face images.

✨ 主要贡献

However, many of these models suffer from limited intr

📄 摘要

Abstract: Synthetic… ▽ More Synthetic data has emerged as a practical alternative to authentic face datasets for training face recognition (FR) systems, especially as privacy and legal concerns increasingly restrict the use of real biometric data. Recent advances in identity-conditional diffusion models have enabled the generation of photorealistic and identity-consistent face images. However, many of these models suffer from limited intr

数据合成 diffusion generation 图像生成 synthetic

🎯 研究动机

Abstract: Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real…

🔧 核心方法

Abstract: Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real… ▽ More Large-sc

✨ 主要贡献

Abstract: Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real…

📄 摘要

Abstract: Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real… ▽ More Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its

数据合成 synthetic

🎯 研究动机

Abstract: Diverse, accurately labeled 3D human pose data is expensive and studio-bound, while in-the-wild datasets lack known ground truth.

🔧 核心方法

We introduce UnrealPose-Gen, an Unreal Engine 5 pipeline built on Movie Render Queue for high-quality offline rendering.

✨ 主要贡献

Our generated frames include: (i) 3D joints in world and camera coordinates, (ii) 2D projections and COCO-sty… ▽ More Diverse, accurately labeled 3D human po

📄 摘要

Abstract: Diverse, accurately labeled 3D human pose data is expensive and studio-bound, while in-the-wild datasets lack known ground truth. We introduce UnrealPose-Gen, an Unreal Engine 5 pipeline built on Movie Render Queue for high-quality offline rendering. Our generated frames include: (i) 3D joints in world and camera coordinates, (ii) 2D projections and COCO-sty… ▽ More Diverse, accurately labeled 3D human pose data is expensive and

3D 数据合成 RL synthetic

🎯 研究动机

Abstract: …are contextualized to produce assembly-ready outputs.

🔧 核心方法

本文提出 Unified Primitive Proxies for Structured Shape Completion 方法/框架。

✨ 主要贡献

Across synthetic and real-world benchmarks with four independent assembly solvers, UniCo consistently outperforms recent baselines, lowering Chamfer distance by up to 50% and improving normal…

📄 摘要

Abstract: …are contextualized to produce assembly-ready outputs. To ensure consistent optimization, our training strategy couples primitives and points with online target updates. Across synthetic and real-world benchmarks with four independent assembly solvers, UniCo consistently outperforms recent baselines, lowering Chamfer distance by up to 50% and improving normal… ▽ More Structured shape completion recovers missing geometry a

数据合成 RL synthetic

🎯 研究动机

However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro

🔧 核心方法

本文提出 Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation 方法/框架。

✨ 主要贡献

However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro

📄 摘要

Abstract: With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled d

annotation 数据合成 检测分割 自动标注 generation 数据生成

🎯 研究动机

Abstract: …from sequential VR sketches.

🔧 核心方法

Our approach yields higher… ▽ More VR sketching lets users explore and iterate on ideas dir

✨ 主要贡献

Our approach yields higher… ▽ More VR sketching lets users explore and iterate on ideas dir

📄 摘要

Abstract: …from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates sequential VR sketches from arbitrary shapes, (ii) a dataset of over 20k synthetic and 900 hand-drawn sketch-shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher… ▽ More VR sketching lets users explore and iterate on ideas dir

synthetic 数据合成 diffusion generation 3D

🎯 研究动机

Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored.

🔧 核心方法

Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored.

✨ 主要贡献

We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stere… ▽ More

📄 摘要

Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stere… ▽ More Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of

zero-shot 数据合成 数据生成 synthetic

🎯 研究动机

Abstract: Document generation has gained growing attention in the field of AI-driven content creation.

🔧 核心方法

In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple… ▽ More Document generation has gained growing attention in the field of AI-driven content creation.

✨ 主要贡献

In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spect

📄 摘要

Abstract: Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple… ▽ More Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spect

数据合成 generation

🎯 研究动机

Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training&helli

🔧 核心方法

Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training… ▽ More Gene

✨ 主要贡献

While recent approaches, such as R2F, address this challenge by utilizing LLM

📄 摘要

Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training… ▽ More Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM

数据合成 diffusion generation 数据生成 图像生成 RL

🎯 研究动机

Abstract: …in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect.

🔧 核心方法

To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited… ▽ More Denoising in the sRGB image space is challenging due to noise variability.

✨ 主要贡献

Although end-to-end methods perform well, their effectiveness in real-world scenarios

📄 摘要

Abstract: …in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect. To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited… ▽ More Denoising in the sRGB image space is challenging due to noise variability. Although end-to-end methods perform well, their effectiveness in real-world scenarios

数据合成 generation RL diffusion

🎯 研究动机

Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging.

🔧 核心方法

Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging.

✨ 主要贡献

Diffusion-based models rely on stochastic noise-to-data tran

📄 摘要

Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-… ▽ More Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data tran

数据合成 diffusion 图像生成

🎯 研究动机

However, the high-entropy nature of visual… ▽ More Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-

🔧 核心方法

Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis.

✨ 主要贡献

However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck

📄 摘要

Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual… ▽ More Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck

数据合成 generation 图像生成

🎯 研究动机

Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image… ▽ More Visual autoregressive (VAR) mo

🔧 核心方法

Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image… ▽ More Visual autoregressive (VAR) models have recently emerged as a promising alternat

✨ 主要贡献

This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored

📄 摘要

Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image… ▽ More Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored

数据合成 generation LoRA 图像生成

🎯 研究动机

Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity.

🔧 核心方法

Abstract: We present a method for generating a full 360° orbit video around a person from a single input… ▽ More We present a method for generating a full 360° orbit video around a person from

✨ 主要贡献

In contrast, recent video diffusion models have demonstrated th

📄 摘要

Abstract: We present a method for generating a full 360° orbit video around a person from a single input… ▽ More We present a method for generating a full 360° orbit video around a person from a single input image. Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity. In contrast, recent video diffusion models have demonstrated th

数据合成 diffusion video generation 图像生成 3D

🎯 研究动机

While existing methods can… ▽ More We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation.

🔧 核心方法

While existing methods can… ▽ More We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation.

✨ 主要贡献

It is essential for synthesizing partially occluded objects with depth-consis

📄 摘要

Abstract: We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can… ▽ More We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consis

数据合成 generation 图像生成 RL 3D

🎯 研究动机

Existing methods predominantly employ optical flow-based… ▽ More Image alignment is a fundamental task in computer vision with broad applications.

🔧 核心方法

Existing methods predominantly employ optical flow-based… ▽ More Image alignment is a fundamental task in computer vision with broad applications.

✨ 主要贡献

However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded ali

📄 摘要

Abstract: Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based… ▽ More Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded ali

数据合成 diffusion 图像生成

🎯 研究动机

Abstract: Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion… ▽ More Generatin

🔧 核心方法

To address this limitation, we propose a new paradigm: extracting rich interaction pri

✨ 主要贡献

To address this limitation, we propose a new paradigm: extracting rich interaction pri

📄 摘要

Abstract: Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion… ▽ More Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction pri

数据合成 generation 3D

🎯 研究动机

Abstract: Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments.

🔧 核心方法

Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment.

✨ 主要贡献

Few-shot approaches improve scalability across rooms but still r… ▽ More Generating audio that is acoustically consistent with a scene i

📄 摘要

Abstract: Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still r… ▽ More Generating audio that is acoustically consistent with a scene i

few-shot multimodal 数据合成

🎯 研究动机

Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative cap

🔧 核心方法

Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity.

✨ 主要贡献

We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves… ▽ More Acceleration methods

📄 摘要

Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves… ▽ More Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quali

数据合成 generation diffusion 图像生成

🎯 研究动机

However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts.

🔧 核心方法

However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts.

✨ 主要贡献

In this paper, we propose a motion factorization framework that dec… ▽ More Compositional video generation aims to synthesize multiple inst

📄 摘要

Abstract: Compositional video generation aims to synthesize multiple instances with diverse appearance and motion, which is widely applicable in real-world scenarios. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a motion factorization framework that dec… ▽ More Compositional video generation aims to synthesize multiple inst

数据合成 generation RL video

🎯 研究动机

Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and e

🔧 核心方法

本文提出 X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection 方法/框架。

✨ 主要贡献

Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and e Against this backdrop, we take a… ▽ More The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increas

📄 摘要

Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a… ▽ More The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Aga

检测分割 数据合成 video detection synthetic

🎯 研究动机

However, the underlying mechanisms gove… ▽ More Recent breakthroughs in Diffusion Transformers (DiTs) have revo

🔧 核心方法

To facilitate DiTs' capability of capturing meaningful internal representations, recent works such as REPA incorporate external pretrained encoders for representation alignment.

✨ 主要贡献

However, the underlying mechanisms gove… ▽ More Recent breakthroughs in Diffusion Transformers (DiTs) have revo

📄 摘要

Abstract: Recent breakthroughs in Diffusion Transformers (DiTs) have revolutionized the field of visual synthesis due to their superior scalability. To facilitate DiTs' capability of capturing meaningful internal representations, recent works such as REPA incorporate external pretrained encoders for representation alignment. However, the underlying mechanisms gove… ▽ More Recent breakthroughs in Diffusion Transformers (DiTs) have revo

数据合成 RL diffusion transformer

🎯 研究动机

However, current models are predominantly optimized for single-event… ▽ More Recent advances in text-to-video diffusion models have enabled high-fidelity and

🔧 核心方法

Abstract: Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis.

✨ 主要贡献

When handling multi-event prompts, w

📄 摘要

Abstract: Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event… ▽ More Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, w

数据合成 diffusion video generation 图像生成

🎯 研究动机

However, existing approaches either rely on full-sequence… ▽ More Recent advances in motion diffusion models have substantially improved the realism of human

🔧 核心方法

Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis.

✨ 主要贡献

Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence… ▽ More Recent advances in motion diffusion models have substantially improved the realism of human

📄 摘要

Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence… ▽ More Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real

数据合成 generation diffusion 图像生成

🎯 研究动机

Abstract: Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse… ▽ More

🔧 核心方法

Abstract: Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse… ▽ More Text-to-Video (T2V) models are capable of synt

✨ 主要贡献

Existing safety evaluation methods,which focus on static image and text generation, are insufficient t

📄 摘要

Abstract: Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse… ▽ More Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods,which focus on static image and text generation, are insufficient t

数据合成 generation video

数据生成(10篇)

数据生成方向聚焦于自动化、大规模地构建训练数据,包括多模态数据生成、数据集蒸馏与压缩、以及利用大模型进行数据扩充。

🎯 研究动机

To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, sc

🔧 核心方法

To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated vis

✨ 主要贡献

To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, sc

📄 摘要

Abstract: …using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream… ▽ More Cinematic Audio Source Separation (CASS) aims to decomp

数据合成 multimodal video 数据生成

🎯 研究动机

Our method addresses these challenges through two key innovations.

🔧 核心方法

Our method addresses these challenges through two key innovations.

✨ 主要贡献

First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k im… ▽ M

📄 摘要

Abstract: …poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k im… ▽ More We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban

检测分割 数据合成 generation 数据生成 segmentation synthetic

🎯 研究动机

Abstract: …without relying on 3D priors.

🔧 核心方法

We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training camer

✨ 主要贡献

We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-sh

📄 摘要

Abstract: …without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, contin… ▽ More We introduce FaceCam, a system that generates video under customizable camera trajectories for

synthetic 数据合成 video generation 数据生成 3D

🎯 研究动机

However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro

🔧 核心方法

本文提出 Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation 方法/框架。

✨ 主要贡献

However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro

📄 摘要

Abstract: With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled d

annotation 数据合成 检测分割 自动标注 generation 数据生成

🎯 研究动机

Abstract: Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data.

🔧 核心方法

本文提出 Watch and Learn: Learning to Use Computers from Online Videos 方法/框架。

✨ 主要贡献

Existing datasets are narrow, static, and costly to annotate, while… ▽ More Computer-using agents (CUAs) must plan task workflows across diverse and evolving

📄 摘要

Abstract: Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data. Existing datasets are narrow, static, and costly to annotate, while… ▽ More Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data. Exis

数据生成 video

🎯 研究动机

Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored.

🔧 核心方法

Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored.

✨ 主要贡献

We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stere… ▽ More

📄 摘要

Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stere… ▽ More Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of

zero-shot 数据合成 数据生成 synthetic

🎯 研究动机

Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training&helli

🔧 核心方法

Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training… ▽ More Gene

✨ 主要贡献

While recent approaches, such as R2F, address this challenge by utilizing LLM

📄 摘要

Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training… ▽ More Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM

数据合成 diffusion generation 数据生成 图像生成 RL

🎯 研究动机

Abstract: Text-to-image generation powers content creation across design, media, and… ▽ More Text-to-image generation powers content creation

🔧 核心方法

Abstract: Text-to-image generation powers content creation across design, media, and… ▽ More Text-to-image generation powers content creation across design, media, and data augmentation.

✨ 主要贡献

Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics.

📄 摘要

Abstract: Text-to-image generation powers content creation across design, media, and… ▽ More Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce SOLACE (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external rewa

数据生成 generation 图像生成

🎯 研究动机

Abstract: …fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning.

🔧 核心方法

While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization.

✨ 主要贡献

To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually… ▽ More Advances in large reasoning models hav

📄 摘要

Abstract: …fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually… ▽ More Advances in large reasoning models have shown strong performance on complex reasoning tasks by

数据生成 generation RL

🎯 研究动机

Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingua… ▽ More

🔧 核心方法

Abstract: Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data.

✨ 主要贡献

Abstract: Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingua… ▽ More

📄 摘要

Abstract: Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingua… ▽ More Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet

数据生成 generation multimodal

图像生成(55篇)

图像生成是 CVPR 2026 中最活跃的方向之一,以扩散模型(DiT、LDM)为主流框架,在文生图、图像编辑、视频生成、3D生成等子任务上均有大量突破。关键趋势:推理加速(步数减少、Token压缩)、生成可控性、安全性对齐。

🎯 研究动机

Abstract: …producing highly realistic and controllable sequences.

🔧 核心方法

Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial.

✨ 主要贡献

Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors gener… ▽ More F

📄 摘要

Abstract: …producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors gener… ▽ More Following major advances in text and image generation, t

检测分割 数据合成 video detection generation 图像生成

🎯 研究动机

Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images.

🔧 核心方法

Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images.

✨ 主要贡献

Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these… ▽ More Recent rapid advancement of generative models has significantly im

📄 摘要

Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these… ▽ More Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented

检测分割 数据合成 detection 图像生成 GAN synthetic

🎯 研究动机

However, many of these models suffer from limited intr

🔧 核心方法

Recent advances in identity-conditional diffusion models have enabled the generation of photorealistic and identity-consistent face images.

✨ 主要贡献

However, many of these models suffer from limited intr

📄 摘要

Abstract: Synthetic… ▽ More Synthetic data has emerged as a practical alternative to authentic face datasets for training face recognition (FR) systems, especially as privacy and legal concerns increasingly restrict the use of real biometric data. Recent advances in identity-conditional diffusion models have enabled the generation of photorealistic and identity-consistent face images. However, many of these models suffer from limited intr

数据合成 diffusion generation 图像生成 synthetic

🎯 研究动机

Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data.

🔧 核心方法

Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data.

✨ 主要贡献

We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightw… ▽

📄 摘要

Abstract: …greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightw… ▽ More Images can be viewed as layered compositions, foregroun

diffusion 图像生成

🎯 研究动机

However, their potential for GUI grounding remains unexplored.

🔧 核心方法

Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement.

✨ 主要贡献

Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement

📄 摘要

Abstract: …grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative t… ▽ More Autoregressive (AR) vision-language models (VLMs) have

检测分割 multimodal diffusion generation 图像生成

🎯 研究动机

Existing solutions often rely on extra finetuning or stronger text conditioning, but suffer from high data and computational requirements.

🔧 核心方法

The door to cost-efficient visual editing remains… ▽ More Text-guided diffusion models have advanced image editin

✨ 主要贡献

The door to cost-efficient visual editing remains… ▽ More Text-guided diffusion models have advanced image editin

📄 摘要

Abstract: …during training, which leads to poor alignment between language and vision. Existing solutions often rely on extra finetuning or stronger text conditioning, but suffer from high data and computational requirements. We argue that diffusion-based editing capabilities aren't lost but merely hidden from text. The door to cost-efficient visual editing remains… ▽ More Text-guided diffusion models have advanced image editin

diffusion 图像生成

🎯 研究动机

However, this enhanced semantic ability ma… ▽ More Recently, multimodal large language models (MLLMs) have emerged

🔧 核心方法

Abstract: Recently, multimodal large language models (MLLMs) have emerged as a unified paradigm for language and image generation.

✨ 主要贡献

However, this enhanced semantic ability ma… ▽ More Recently, multimodal large language models (MLLMs) have emerged

📄 摘要

Abstract: Recently, multimodal large language models (MLLMs) have emerged as a unified paradigm for language and image generation. Compared with diffusion models, MLLMs possess a much stronger capability for semantic understanding, enabling them to process more complex textual inputs and comprehend richer contextual meanings. However, this enhanced semantic ability ma… ▽ More Recently, multimodal large language models (MLLMs) have emerged

generation diffusion 图像生成 multimodal

🎯 研究动机

Abstract: Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles.

🔧 核心方法

Abstract: Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles.

✨ 主要贡献

Yet, they share a fundamental proper

📄 摘要

Abstract: Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity.… ▽ More Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental proper

图像生成

🎯 研究动机

Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training&helli

🔧 核心方法

Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training… ▽ More Gene

✨ 主要贡献

While recent approaches, such as R2F, address this challenge by utilizing LLM

📄 摘要

Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training… ▽ More Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM

数据合成 diffusion generation 数据生成 图像生成 RL

🎯 研究动机

Abstract: …sampling, making the trade-off between fidelity and efficiency a persistent challenge.

🔧 核心方法

We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization.

✨ 主要贡献

Under this perspective, our training framework is… ▽ More Virtual Try-on (VTON) has become a core capability for online retail, wher

📄 摘要

Abstract: …sampling, making the trade-off between fidelity and efficiency a persistent challenge. We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is… ▽ More Virtual Try-on (VTON) has become a core capability for online retail, wher

generation 图像生成

🎯 研究动机

Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging.

🔧 核心方法

Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging.

✨ 主要贡献

Diffusion-based models rely on stochastic noise-to-data tran

📄 摘要

Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-… ▽ More Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data tran

数据合成 diffusion 图像生成

🎯 研究动机

Abstract: Text-to-image generation powers content creation across design, media, and… ▽ More Text-to-image generation powers content creation

🔧 核心方法

Abstract: Text-to-image generation powers content creation across design, media, and… ▽ More Text-to-image generation powers content creation across design, media, and data augmentation.

✨ 主要贡献

Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics.

📄 摘要

Abstract: Text-to-image generation powers content creation across design, media, and… ▽ More Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce SOLACE (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external rewa

数据生成 generation 图像生成

🎯 研究动机

Abstract: …models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths dat

🔧 核心方法

Abstract: …models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models.

✨ 主要贡献

We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I… ▽ More Recen

📄 摘要

Abstract: …models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I… ▽ More Recent text-to-image (T2I) diffusion models achieve remarkab

diffusion 图像生成

🎯 研究动机

However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment.

🔧 核心方法

Abstract: Diffusion models have achieved remarkable success in image and video generation tasks.

✨ 主要贡献

Abstract: Diffusion models have achieved remarkable success in image and video generation tasks. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle t… ▽ More Diffusio

📄 摘要

Abstract: Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle t… ▽ More Diffusion models have achieved remarkable success in image and

diffusion video transformer generation 图像生成

🎯 研究动机

Abstract: In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas

🔧 核心方法

Abstract: In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas… ▽ More

✨ 主要贡献

Progress remains hindered by

📄 摘要

Abstract: In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas… ▽ More In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas generative models largely simulate pixels. Progress remains hindered by

generation 图像生成

🎯 研究动机

Abstract: While existing generation and unified models excel at… ▽ More While existing generation and unified models excel at general image ge

🔧 核心方法

Abstract: While existing generation and unified models excel at… ▽ More While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep

✨ 主要贡献

To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic

📄 摘要

Abstract: While existing generation and unified models excel at… ▽ More While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise data-to-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic

generation 图像生成

🎯 研究动机

However, false claims of using official models can mislead users and harm model owners' reputations, making model verification essential to confirm whether an API's underlying model matches it

🔧 核心方法

Abstract: As Text-to-Image (T2I)… ▽ More As Text-to-Image (T2I) generation becomes widespread, third-party platforms increasingly integrate multiple model APIs for convenient image creation.

✨ 主要贡献

Existing methods address this by using ver

📄 摘要

Abstract: As Text-to-Image (T2I)… ▽ More As Text-to-Image (T2I) generation becomes widespread, third-party platforms increasingly integrate multiple model APIs for convenient image creation. However, false claims of using official models can mislead users and harm model owners' reputations, making model verification essential to confirm whether an API's underlying model matches its claim. Existing methods address this by using ver

generation RL 图像生成

🎯 研究动机

Abstract: Despite significant progress in text-to-image… ▽ More Despite significant progress in text-to-image generation, aligning outputs wit

🔧 核心方法

本文提出 Self-Corrected Image Generation with Explainable Latent Rewards 方法/框架。

✨ 主要贡献

In contrast, evaluating generated images is more tract

📄 摘要

Abstract: Despite significant progress in text-to-image… ▽ More Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tract

generation RL 图像生成

🎯 研究动机

Abstract: Diffusion models have demonstrated remarkable performance in image… ▽ More Diffusion models have demonstrated remarkable performance

🔧 核心方法

Abstract: Diffusion models have demonstrated remarkable performance in image… ▽ More Diffusion models have demonstrated remarkable performance in image generation, particularly within the doma

✨ 主要贡献

Abstract: Diffusion models have demonstrated remarkable performance in image… ▽ More Diffusion models have demonstrated remarkable performance

📄 摘要

Abstract: Diffusion models have demonstrated remarkable performance in image… ▽ More Diffusion models have demonstrated remarkable performance in image generation, particularly within the domain of style transfer. Prevailing style transfer approaches typically leverage pre-trained diffusion models' robust feature extraction capabilities alongside external modular control pathways to explicitly impose style guidance signals. However, t

generation RL diffusion 图像生成

🎯 研究动机

This implicit use suffers from a fundamental misalignment in representation.

🔧 核心方法

Abstract: Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval.

✨ 主要贡献

It also depends heavily on indirect feature extraction pipelines, which complicate the workflow… ▽ More Recent approaches for segmentation have leveraged pre

📄 摘要

Abstract: Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow… ▽ More Recent approaches for segmentation have leveraged pretrained ge

segmentation 检测分割 generation 图像生成

🎯 研究动机

However, when trained only for reconstruction, high-dimensional latent spaces often lose meaningful structure, making diffusion training h… ▽ More Reducing t

🔧 核心方法

Abstract: Reducing token count is crucial for efficient training and inference of latent diffusion models, especially at high resolution.

✨ 主要贡献

However, when trained only for reconstruction, high-dimensional latent spaces often lose meaningful structure, making diffusion training h… ▽ More Reducing t

📄 摘要

Abstract: Reducing token count is crucial for efficient training and inference of latent diffusion models, especially at high resolution. A common strategy is to build high-compression image tokenizers with more channels per token. However, when trained only for reconstruction, high-dimensional latent spaces often lose meaningful structure, making diffusion training h… ▽ More Reducing token count is crucial for efficient training and infe

diffusion VAE 图像生成

🎯 研究动机

However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we obs

🔧 核心方法

Abstract: Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image… ▽ More Latent diffusion models have emerged as the dominant framework for high-f

✨ 主要贡献

However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we obs

📄 摘要

Abstract: Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image… ▽ More Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we obs

generation diffusion 图像生成

🎯 研究动机

However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results.

🔧 核心方法

Abstract: Diffusion-based text-to-image (T2I) models have made remarkable progress in… ▽ More Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealis

✨ 主要贡献

This limitation arises from the long

📄 摘要

Abstract: Diffusion-based text-to-image (T2I) models have made remarkable progress in… ▽ More Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long

generation diffusion 图像生成

🎯 研究动机

However, the high-entropy nature of visual… ▽ More Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-

🔧 核心方法

Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis.

✨ 主要贡献

However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck

📄 摘要

Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual… ▽ More Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck

数据合成 generation 图像生成

🎯 研究动机

Abstract: Conditional image… ▽ More Conditional image generation methods are increasingly used in human-centric applications, yet existing hum

🔧 核心方法

Abstract: Conditional image… ▽ More Conditional image generation methods are increasingly used in human-centric applications, yet existing human amodal completion (HAC) models offer users limi

✨ 主要贡献

Given an occluded person image, they hallucinate invisible regions while preserving visible ones, but cannot reliably incorporate user-specified constraints such as a desired pose or spatial extent.

📄 摘要

Abstract: Conditional image… ▽ More Conditional image generation methods are increasingly used in human-centric applications, yet existing human amodal completion (HAC) models offer users limited control over the completed content. Given an occluded person image, they hallucinate invisible regions while preserving visible ones, but cannot reliably incorporate user-specified constraints such as a desired pose or spatial extent. As a result

generation 图像生成

🎯 研究动机

Abstract: Text-to-image… ▽ More Text-to-image generation has advanced rapidly, but existing models still struggle with faithfully composing mu

🔧 核心方法

Abstract: Text-to-image… ▽ More Text-to-image generation has advanced rapidly, but existing models still struggle with faithfully composing multiple objects and preserving their attributes in

✨ 主要贡献

We propose coDrawAgents, an interactive multi-agent dialogue framework with four specialized agents: Interpreter, Planner, Checker, and Painter that collaborate to improve compositional generation.

📄 摘要

Abstract: Text-to-image… ▽ More Text-to-image generation has advanced rapidly, but existing models still struggle with faithfully composing multiple objects and preserving their attributes in complex scenes. We propose coDrawAgents, an interactive multi-agent dialogue framework with four specialized agents: Interpreter, Planner, Checker, and Painter that collaborate to improve compositional generation. The Interpreter adaptively decides b

generation 图像生成

🎯 研究动机

Abstract: With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges.

🔧 核心方法

Traditional methods are incapable of dealing with increasingly realistic… ▽ More With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges.

✨ 主要贡献

Traditional methods are incapable of dealing with increasingly realistic images generated by rapidly evolving image generation techniques.

📄 摘要

Abstract: With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges. Traditional methods are incapable of dealing with increasingly realistic… ▽ More With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges. Traditional methods are incapable of dealing with increasingly realistic images generated by rapidly evolving image generation techniques.

generation 图像生成

🎯 研究动机

Existing methods can effectively erase narrow concepts that are specific and concrete, such as distinct intellectual properties (e.g.

🔧 核心方法

Abstract: Concept erasure is extensively utilized in image… ▽ More Concept erasure is extensively utilized in image generation to prevent text-to-image models from generating undesired content

✨ 主要贡献

However, their performance degrades on broad concepts such a

📄 摘要

Abstract: Concept erasure is extensively utilized in image… ▽ More Concept erasure is extensively utilized in image generation to prevent text-to-image models from generating undesired content. Existing methods can effectively erase narrow concepts that are specific and concrete, such as distinct intellectual properties (e.g. Pikachu) or recognizable characters (e.g. Elon Musk). However, their performance degrades on broad concepts such a

generation diffusion 图像生成

🎯 研究动机

Abstract: Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process.

🔧 核心方法

Abstract: Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process.

✨ 主要贡献

Abstract: Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process. While feature caching techniques achieve effective acceleration at higher step counts (e.g., 50 steps), they exhibit critical limitations in the practical low-step regime of 20-30 steps.

📄 摘要

Abstract: Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process. While feature caching techniques achieve effective acceleration at higher step counts (e.g., 50 steps), they exhibit critical limitations in the practical low-step regime of 20-30 steps. As the inte… ▽ More Despite achieving state-of-the-art generation quality, diffusio

generation diffusion 图像生成

🎯 研究动机

Abstract: Diffusion models have become the dominant tool for high-fidelity image and video… ▽ More Diffusion models have become the dominant t

🔧 核心方法

Abstract: Diffusion models have become the dominant tool for high-fidelity image and video… ▽ More Diffusion models have become the dominant tool for high-fidelity image and video generation,

✨ 主要贡献

To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at

📄 摘要

Abstract: Diffusion models have become the dominant tool for high-fidelity image and video… ▽ More Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at

diffusion video transformer generation 图像生成

🎯 研究动机

Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image… ▽ More Visual autoregressive (VAR) mo

🔧 核心方法

Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image… ▽ More Visual autoregressive (VAR) models have recently emerged as a promising alternat

✨ 主要贡献

This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored

📄 摘要

Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image… ▽ More Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored

数据合成 generation LoRA 图像生成

🎯 研究动机

Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization.

🔧 核心方法

Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization.

✨ 主要贡献

Abstract: Reinforcement learning (RL) has demonstrated remarkable success in text and image… ▽ More Reinforcement learning (RL) has demonstrat

📄 摘要

Abstract: Reinforcement learning (RL) has demonstrated remarkable success in text and image… ▽ More Reinforcement learning (RL) has demonstrated remarkable success in text and image generation, yet its potential in 3D generation remains largely unexplored. Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization. In this work, we aim to e

generation RL 图像生成 3D

🎯 研究动机

To address this challenge, we introduce a novel method that strengthens the spatial und

🔧 核心方法

To address this challenge, we introduce a novel method that strengthens the spatial und

✨ 主要贡献

To address this challenge, we introduce a novel method that strengthens the spatial und

📄 摘要

Abstract: Recent progress in text-to-image… ▽ More Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial und

generation RL 图像生成 ViT

🎯 研究动机

Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity.

🔧 核心方法

Abstract: We present a method for generating a full 360° orbit video around a person from a single input… ▽ More We present a method for generating a full 360° orbit video around a person from

✨ 主要贡献

In contrast, recent video diffusion models have demonstrated th

📄 摘要

Abstract: We present a method for generating a full 360° orbit video around a person from a single input… ▽ More We present a method for generating a full 360° orbit video around a person from a single input image. Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity. In contrast, recent video diffusion models have demonstrated th

数据合成 diffusion video generation 图像生成 3D

🎯 研究动机

Abstract: Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-… ▽ More

🔧 核心方法

Abstract: Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-… ▽ More Text-to-image (T2I) diffusion model

✨ 主要贡献

We reveal a strong correlation between early diffusion cross-

📄 摘要

Abstract: Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-… ▽ More Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-generation scenarios such as prompt iteration, agent-based generation, and flow-grpo. We reveal a strong correlation between early diffusion cross-

generation RL diffusion 图像生成

🎯 研究动机

While existing methods can… ▽ More We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation.

🔧 核心方法

While existing methods can… ▽ More We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation.

✨ 主要贡献

It is essential for synthesizing partially occluded objects with depth-consis

📄 摘要

Abstract: We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can… ▽ More We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consis

数据合成 generation 图像生成 RL 3D

🎯 研究动机

Existing methods predominantly employ optical flow-based… ▽ More Image alignment is a fundamental task in computer vision with broad applications.

🔧 核心方法

Existing methods predominantly employ optical flow-based… ▽ More Image alignment is a fundamental task in computer vision with broad applications.

✨ 主要贡献

However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded ali

📄 摘要

Abstract: Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based… ▽ More Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded ali

数据合成 diffusion 图像生成

🎯 研究动机

Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead

🔧 核心方法

Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling.

✨ 主要贡献

Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or… ▽ More Diffusion models have demonstrated re

📄 摘要

Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or… ▽ More Diffusion models have demonstrated remarkable success in image and video generation, yet their practica

generation diffusion video 图像生成

🎯 研究动机

Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, o

🔧 核心方法

Abstract: Pre-trained diffusion models excel at generating high-quality… ▽ More Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their nati

✨ 主要贡献

Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, o

📄 摘要

Abstract: Pre-trained diffusion models excel at generating high-quality… ▽ More Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than f

generation diffusion 图像生成

🎯 研究动机

Existing lightweight variants predominantly compress the denoising U-Net or reduce the di

🔧 核心方法

Abstract: Latent diffusion models such as Stable Diffusion 1.5 offer strong generative priors that are highly valuable for… ▽ More Latent diffusion models such as Stable Diffusion 1.5 offer st

✨ 主要贡献

Existing lightweight variants predominantly compress the denoising U-Net or reduce the di

📄 摘要

Abstract: Latent diffusion models such as Stable Diffusion 1.5 offer strong generative priors that are highly valuable for… ▽ More Latent diffusion models such as Stable Diffusion 1.5 offer strong generative priors that are highly valuable for image restoration, yet their full pipelines remain too computationally heavy for deployment on edge devices. Existing lightweight variants predominantly compress the denoising U-Net or reduce the di

diffusion 图像生成

🎯 研究动机

Abstract: Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge.

🔧 核心方法

Abstract: Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge.

✨ 主要贡献

This paper identifies that this issue primarily stems from two ou… ▽ More Generating long videos using pre-trained video diffusion models, which are typicall

📄 摘要

Abstract: Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two ou… ▽ More Generating long videos using pre-trained video diffusion models, which are typically trained on short c

diffusion video generation 图像生成 CLIP

🎯 研究动机

Abstract: Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs.

🔧 核心方法

While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often r… ▽ More Diffusion-based video super-resolution (VSR) has recentl

✨ 主要贡献

Abstract: Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often r… ▽ More Diffus

📄 摘要

Abstract: Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often r… ▽ More Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While

generation diffusion video 图像生成

🎯 研究动机

However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints.

🔧 核心方法

However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints.

✨ 主要贡献

Abstract: Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I)… ▽ More Diffusion Transformers (DiTs) have significant

📄 摘要

Abstract: Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I)… ▽ More Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I) generation quality, enabling high-quality personalized content creation. However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints. To tackle these challenges, we propose a memo

generation diffusion 图像生成 transformer

🎯 研究动机

However, most existing… ▽ More Spatial transcriptomics (ST) enables spot-level in situ e

🔧 核心方法

Recent advances explore using score- or flow-based generative models to estimate the conditional distribution of gene expression from histology, offering a flexible alternative to deterministic regression approaches.

✨ 主要贡献

However, most existing… ▽ More Spatial transcriptomics (ST) enables spot-level in situ e

📄 摘要

Abstract: …profiling, but its high cost and limited throughput motivate predicting expression directly from HE-stained histology. Recent advances explore using score- or flow-based generative models to estimate the conditional distribution of gene expression from histology, offering a flexible alternative to deterministic regression approaches. However, most existing… ▽ More Spatial transcriptomics (ST) enables spot-level in situ e

generation 图像生成

🎯 研究动机

However, current discrete… ▽ More Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm s

🔧 核心方法

Abstract: Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures.

✨ 主要贡献

However, current discrete… ▽ More Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm s

📄 摘要

Abstract: Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete… ▽ More Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. Howeve

generation diffusion 图像生成 multimodal

🎯 研究动机

Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative cap

🔧 核心方法

Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity.

✨ 主要贡献

We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves… ▽ More Acceleration methods

📄 摘要

Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves… ▽ More Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quali

数据合成 generation diffusion 图像生成

🎯 研究动机

Abstract: Diffusion Transformers have become a dominant paradigm in visual… ▽ More Diffusion Transformers have become a dominant paradigm in v

🔧 核心方法

本文提出 SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer 方法/框架。

✨ 主要贡献

Among common training-free techniques, caching offers high acceleration efficiency but often compromises fidelity, whereas pruning shows the opposite trade-off.

📄 摘要

Abstract: Diffusion Transformers have become a dominant paradigm in visual… ▽ More Diffusion Transformers have become a dominant paradigm in visual generation, yet their low inference efficiency remains a key bottleneck hindering further advancement. Among common training-free techniques, caching offers high acceleration efficiency but often compromises fidelity, whereas pruning shows the opposite trade-off. Integrating caching with pruni

diffusion transformer ViT generation 图像生成

🎯 研究动机

Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting… ▽ More Generating high-quality 360° panoramic vid

🔧 核心方法

Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting… ▽ More Generating high-quality 360° panoramic videos from perspective input is one of the crucial a

✨ 主要贡献

Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting… ▽ More Generating high-quality 360° panoramic vid

📄 摘要

Abstract: Generating high-quality 360° panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting… ▽ More Generating high-quality 360° panoramic videos from perspective input is one of the crucial applications f

generation diffusion video 图像生成

🎯 研究动机

Abstract: Diffusion models achieve strong… ▽ More Diffusion models achieve strong generative performance but remain slow at inference due to t

🔧 核心方法

Abstract: Diffusion models achieve strong… ▽ More Diffusion models achieve strong generative performance but remain slow at inference due to the need for repeated full-model denoising passes.

✨ 主要贡献

Abstract: Diffusion models achieve strong… ▽ More Diffusion models achieve strong generative performance but remain slow at inference due to t

📄 摘要

Abstract: Diffusion models achieve strong… ▽ More Diffusion models achieve strong generative performance but remain slow at inference due to the need for repeated full-model denoising passes. We present Token-Adaptive Predictor (TAP), a training-free, probe-driven framework that adaptively selects a predictor for each token at every sampling step. TAP uses a single full evaluation of the model's first layer as a low-cost probe to comp

diffusion 图像生成

🎯 研究动机

Abstract: Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models.

🔧 核心方法

Abstract: Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models.

✨ 主要贡献

In this paper, we expl

📄 摘要

Abstract: Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we explore a unified framework called CFG-Ctrl, which reinterprets CFG as a control applied to the first-order continuous-time… ▽ More Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we expl

RL diffusion 图像生成

🎯 研究动机

However, current models are predominantly optimized for single-event… ▽ More Recent advances in text-to-video diffusion models have enabled high-fidelity and

🔧 核心方法

Abstract: Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis.

✨ 主要贡献

When handling multi-event prompts, w

📄 摘要

Abstract: Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event… ▽ More Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, w

数据合成 diffusion video generation 图像生成

🎯 研究动机

However, existing approaches either rely on full-sequence… ▽ More Recent advances in motion diffusion models have substantially improved the realism of human

🔧 核心方法

Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis.

✨ 主要贡献

Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence… ▽ More Recent advances in motion diffusion models have substantially improved the realism of human

📄 摘要

Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence… ▽ More Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real

数据合成 generation diffusion 图像生成

🎯 研究动机

Abstract: Text-to-Image (T2I) diffusion models have demonstrated significant advancements in… ▽ More Text-to-Image (T2I) diffusion models have

🔧 核心方法

Abstract: Text-to-Image (T2I) diffusion models have demonstrated significant advancements in… ▽ More Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generati

✨ 主要贡献

Abstract: Text-to-Image (T2I) diffusion models have demonstrated significant advancements in… ▽ More Text-to-Image (T2I) diffusion models have

📄 摘要

Abstract: Text-to-Image (T2I) diffusion models have demonstrated significant advancements in… ▽ More Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are ave

generation diffusion 图像生成

🎯 研究动机

Abstract: While diffusion models have shown great potential in portrait… ▽ More While diffusion models have shown great potential in portrait

🔧 核心方法

Abstract: While diffusion models have shown great potential in portrait… ▽ More While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and c

✨ 主要贡献

Abstract: While diffusion models have shown great potential in portrait… ▽ More While diffusion models have shown great potential in portrait

📄 摘要

Abstract: While diffusion models have shown great potential in portrait… ▽ More While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized detail

generation diffusion video 图像生成

🎯 研究动机

Abstract: Existing diffusion codecs typically build on text-to-… ▽ More Existing diffusion codecs typically build on text-to-image diffusion f

🔧 核心方法

Abstract: Existing diffusion codecs typically build on text-to-… ▽ More Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion.

✨ 主要贡献

To address it, we introduce \textbf{CoD}, the first \textbf{Co}mpression-oriented \textbf{D}iff

📄 摘要

Abstract: Existing diffusion codecs typically build on text-to-… ▽ More Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion. However, text conditioning is suboptimal from a compression perspective, hindering the potential of downstream diffusion codecs, particularly at ultra-low bitrates. To address it, we introduce \textbf{CoD}, the first \textbf{Co}mpression-oriented \textbf{D}iff

RL diffusion 图像生成

自动标注(5篇)

自动标注方向利用半监督学习、伪标签、大模型辅助等技术降低人工标注依赖,是数据高效学习的核心支撑技术。

🎯 研究动机

Abstract: Models for long-term point tracking are typically trained on large synthetic datasets.

🔧 核心方法

Abstract: Models for long-term point tracking are typically trained on large synthetic datasets.

✨ 主要贡献

Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends… ▽ More Models for long-term p

📄 摘要

Abstract: Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends… ▽ More Models for long-term point tracking are typically trained on lar

annotation 数据合成 video 自动标注 RL synthetic

🎯 研究动机

However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro

🔧 核心方法

本文提出 Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation 方法/框架。

✨ 主要贡献

However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro

📄 摘要

Abstract: With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled d

annotation 数据合成 检测分割 自动标注 generation 数据生成

🎯 研究动机

However, existing datasets such as CULane and TuSimple contain relatively limited data under extreme weather conditions, including rain, snow, and fog.

🔧 核心方法

As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task in autonomous driving,

✨ 主要贡献

As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task

📄 摘要

Abstract: …is a crucial task in autonomous driving, as it helps ensure the safe operation of vehicles. However, existing datasets such as CULane and TuSimple contain relatively limited data under extreme weather conditions, including rain, snow, and fog. As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task in autonomous driving,

annotation 检测分割 自动标注 detection generation

🎯 研究动机

Abstract: …consistency -- a critical failure point in navigation, robotics, and autonomous driving.

🔧 核心方法

ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based… &

✨ 主要贡献

ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpass

📄 摘要

Abstract: …consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based… ▽ More We present ReMoT, a unified training paradigm to systematically address the fundamental shortc

annotation 自动标注 video

🎯 研究动机

Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks.

🔧 核心方法

Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks.

✨ 主要贡献

RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering ap

📄 摘要

Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering approximately 165,000 images, over 800 object instances, 1.3 million… ▽ More Visual-language grounding aims to establish semantic correspon

annotation 检测分割 自动标注 segmentation RL

检测分割(20篇)

检测与分割方向 CVPR 2026 侧重于多模态融合(RGB+热红外/点云)、医学图像分割、视觉-语言接地等细分赛道,提出了多种高效 backbone 和跨模态交互机制。

🎯 研究动机

Yet, despite rapid progress, existing HOI generation research r

🔧 核心方法

(4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real… ▽ More Hand-object interaction (

✨ 主要贡献

Yet, despite rapid progress, existing HOI generation research r

📄 摘要

Abstract: …depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real… ▽ More Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research r

检测分割 数据合成 video generation segmentation synthetic

🎯 研究动机

Our method addresses these challenges through two key innovations.

🔧 核心方法

Our method addresses these challenges through two key innovations.

✨ 主要贡献

First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k im… ▽ M

📄 摘要

Abstract: …poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k im… ▽ More We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban

检测分割 数据合成 generation 数据生成 segmentation synthetic

🎯 研究动机

Abstract: …producing highly realistic and controllable sequences.

🔧 核心方法

Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial.

✨ 主要贡献

Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors gener… ▽ More F

📄 摘要

Abstract: …producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors gener… ▽ More Following major advances in text and image generation, t

检测分割 数据合成 video detection generation 图像生成

🎯 研究动机

Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images.

🔧 核心方法

Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images.

✨ 主要贡献

Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these… ▽ More Recent rapid advancement of generative models has significantly im

📄 摘要

Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these… ▽ More Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented

检测分割 数据合成 detection 图像生成 GAN synthetic

🎯 研究动机

Abstract: …we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words.

🔧 核心方法

Abstract: …we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words.

✨ 主要贡献

This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectivenes… ▽ More

📄 摘要

Abstract: …we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectivenes… ▽ More Fingerspelling is a component of sign languages in which

检测分割 数据合成 detection synthetic

🎯 研究动机

However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro

🔧 核心方法

本文提出 Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation 方法/框架。

✨ 主要贡献

However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro

📄 摘要

Abstract: With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled d

annotation 数据合成 检测分割 自动标注 generation 数据生成

🎯 研究动机

Abstract: Video matting remains limited by the scale and realism of existing datasets.

🔧 核心方法

To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality… ▽ More Video matting remains limited by the scale and realism of exist

✨ 主要贡献

To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality… ▽ More Video matting remains limited by the

📄 摘要

Abstract: Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality… ▽ More Video matting remains limited by the scale and realism of exist

segmentation 检测分割 video

🎯 研究动机

However, their potential for GUI grounding remains unexplored.

🔧 核心方法

Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement.

✨ 主要贡献

Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement

📄 摘要

Abstract: …grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative t… ▽ More Autoregressive (AR) vision-language models (VLMs) have

检测分割 multimodal diffusion generation 图像生成

🎯 研究动机

However, existing datasets such as CULane and TuSimple contain relatively limited data under extreme weather conditions, including rain, snow, and fog.

🔧 核心方法

As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task in autonomous driving,

✨ 主要贡献

As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task

📄 摘要

Abstract: …is a crucial task in autonomous driving, as it helps ensure the safe operation of vehicles. However, existing datasets such as CULane and TuSimple contain relatively limited data under extreme weather conditions, including rain, snow, and fog. As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task in autonomous driving,

annotation 检测分割 自动标注 detection generation

🎯 研究动机

Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions.

🔧 核心方法

Abstract: Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency.

✨ 主要贡献

We present CURE, an error-aware curriculum learning framework that improves grounding a… ▽ More Medical vision-language models can automate the generation of

📄 摘要

Abstract: Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding a… ▽ More Medical vision-language models can automate the generation of r

检测分割 generation

🎯 研究动机

However, this assumption causes cascading errors: false detections and invalid relations propagate through the reasoning… ▽ More Referring Expression Compreh

🔧 核心方法

While such approaches achieve interpretable reasoning and strong zero-shot generalization, they assume that intermediate reasoning steps are accurate.

✨ 主要贡献

While such approaches achieve interpretable reasoning and strong zero-shot generalization, they assume that intermediate reasoning steps are accurate.

📄 摘要

Abstract: …reasoning, decomposing queries into structured programs and executing them step-by-step. While such approaches achieve interpretable reasoning and strong zero-shot generalization, they assume that intermediate reasoning steps are accurate. However, this assumption causes cascading errors: false detections and invalid relations propagate through the reasoning… ▽ More Referring Expression Comprehension (REC) aims to locali

检测分割 detection zero-shot

🎯 研究动机

This implicit use suffers from a fundamental misalignment in representation.

🔧 核心方法

Abstract: Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval.

✨ 主要贡献

It also depends heavily on indirect feature extraction pipelines, which complicate the workflow… ▽ More Recent approaches for segmentation have leveraged pre

📄 摘要

Abstract: Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow… ▽ More Recent approaches for segmentation have leveraged pretrained ge

segmentation 检测分割 generation 图像生成

🎯 研究动机

However, practical deployment in real-world applications - especially on resource constrained edge devices - requi

🔧 核心方法

Abstract: Scene Graph Generation (SGG) aims to extract a detailed graph structure from an… ▽ More Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a repre

✨ 主要贡献

However, practical deployment in real-world applications - especially on resource constrained edge devices - requi

📄 摘要

Abstract: Scene Graph Generation (SGG) aims to extract a detailed graph structure from an… ▽ More Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a representation that holds significant promise as a robust intermediate step for complex downstream tasks like reasoning for embodied agents. However, practical deployment in real-world applications - especially on resource constrained edge devices - requi

检测分割 generation RL

🎯 研究动机

Existing methods primarily rely on… ▽ More Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between tes

🔧 核心方法

Existing methods primarily rely on… ▽ More Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between test images and normal templates.

✨ 主要贡献

Existing methods primarily rely on image reconstruction or template retrieval but face a fundamenta

📄 摘要

Abstract: Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between test images and normal templates. Existing methods primarily rely on… ▽ More Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between test images and normal templates. Existing methods primarily rely on image reconstruction or template retrieval but face a fundamenta

检测分割 detection

🎯 研究动机

However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networ

🔧 核心方法

本文提出 PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation 方法/框架。

✨ 主要贡献

However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networ

📄 摘要

Abstract: Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion… ▽ More Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion detection and radiotherapy target delineation. However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networ

segmentation 检测分割 detection

🎯 研究动机

Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks.

🔧 核心方法

Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks.

✨ 主要贡献

RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering ap

📄 摘要

Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering approximately 165,000 images, over 800 object instances, 1.3 million… ▽ More Visual-language grounding aims to establish semantic correspon

annotation 检测分割 自动标注 segmentation RL

🎯 研究动机

Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely

🔧 核心方法

Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth

✨ 主要贡献

Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely

📄 摘要

Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries. The proposed framework, \textbf{STaRC}, overcomes this limitation by supervising frame-level saliency through a high… ▽ More Existing retrieval-augmented approaches for Dense Video Caption

segmentation 检测分割 RL video

🎯 研究动机

Abstract: …compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic

🔧 核心方法

This demonstrates that contrastive distillation provides a principled and effic… ▽ More Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and

✨ 主要贡献

Abstract: …compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic This demonstrates that contrastive distillation provides a principled and effic… ▽ More Foundation models are transforming Earth Observation (EO), yet the di

📄 摘要

Abstract: …compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and effic… ▽ More Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and

segmentation 检测分割 detection contrastive

🎯 研究动机

Abstract: …in both visual quality and quantitative metrics.

🔧 核心方法

本文提出 3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion 方法/框架。

✨ 主要贡献

More materials: https://github.com/work-su… ▽ More The miniaturization of thermal sensors for mobile platforms inherently limits their spatial res

📄 摘要

Abstract: …in both visual quality and quantitative metrics. More importantly, the thermal images enhanced by 3M-TI lead to substantial gains in critical downstream tasks like object detection and segmentation, underscoring its practical value for robust mobile thermal perception systems. More materials: https://github.com/work-su… ▽ More The miniaturization of thermal sensors for mobile platforms inherently limits their spatial res

segmentation 检测分割 diffusion detection

🎯 研究动机

Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and e

🔧 核心方法

本文提出 X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection 方法/框架。

✨ 主要贡献

Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and e Against this backdrop, we take a… ▽ More The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increas

📄 摘要

Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a… ▽ More The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Aga

检测分割 数据合成 video detection synthetic