数据合成 · 数据生成 · 图像生成 · 自动标注 · 检测分割
数据来源:arXiv(作者自标注 CVPR 2026)| 更新日期:2026-03-30
本报告汇总了 arXiv 上 97 篇标注了 CVPR 2026 接收标记的论文,覆盖数据合成、数据生成、图像生成、自动标注、检测分割五大方向。CVPR 2026 官方论文列表预计 2026年4-5月上线后将进行补充。
数据合成方向关注利用生成模型(扩散模型、GAN等)自动创建标注数据集,以缓解人工标注成本高昂的瓶颈。CVPR 2026 中该方向涌现出大量工作,核心趋势包括:(1) 基于文本条件的精细控制合成;(2) Sim-to-Real 迁移;(3) 任务驱动的数据质量筛选。
数据生成方向聚焦于自动化、大规模地构建训练数据,包括多模态数据生成、数据集蒸馏与压缩、以及利用大模型进行数据扩充。
图像生成是 CVPR 2026 中最活跃的方向之一,以扩散模型(DiT、LDM)为主流框架,在文生图、图像编辑、视频生成、3D生成等子任务上均有大量突破。关键趋势:推理加速(步数减少、Token压缩)、生成可控性、安全性对齐。
自动标注方向利用半监督学习、伪标签、大模型辅助等技术降低人工标注依赖,是数据高效学习的核心支撑技术。
检测与分割方向 CVPR 2026 侧重于多模态融合(RGB+热红外/点云)、医学图像分割、视觉-语言接地等细分赛道,提出了多种高效 backbone 和跨模态交互机制。
数据合成方向关注利用生成模型(扩散模型、GAN等)自动创建标注数据集,以缓解人工标注成本高昂的瓶颈。CVPR 2026 中该方向涌现出大量工作,核心趋势包括:(1) 基于文本条件的精细控制合成;(2) Sim-to-Real 迁移;(3) 任务驱动的数据质量筛选。
Abstract: …evaluations fail to diagnose fine-grained, association-level forgetting.
We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised model polluted with this data, and a Clean mod&he
We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised
Abstract: …evaluations fail to diagnose fine-grained, association-level forgetting. We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised model polluted with this data, and a Clean mod… ▽ More As multimodal models like CLIP become integral to downstream systems, the need to remove sensit
To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, sc
To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated vis
To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, sc
Abstract: …using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream… ▽ More Cinematic Audio Source Separation (CASS) aims to decomp
Abstract: Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data.
Abstract: Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data.
Existing approaches rely heavily on synthetic data
Abstract: Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data. Existing approaches rely heavily on… ▽ More Accurately modeling how real-world materials reflect light remains a core challenge in inverse rendering, largely due to the scarcity of real measured reflectance data. Existing approaches rely heavily on synthetic data
Yet, despite rapid progress, existing HOI generation research r
(4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real… ▽ More Hand-object interaction (
Yet, despite rapid progress, existing HOI generation research r
Abstract: …depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real… ▽ More Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research r
Our method addresses these challenges through two key innovations.
Our method addresses these challenges through two key innovations.
First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k im… ▽ M
Abstract: …poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k im… ▽ More We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban
Abstract: …with disordered structures and realistic HRTEM image noises.
It can ensure the denoising performance of models on real images for nucleation observation.
Experiments on synthetic and real data show our method outperforms the state-of-the-art methods in HRTEM image denoising, with effectiveness in the localization downs… ▽ More
Abstract: …with disordered structures and realistic HRTEM image noises. It can ensure the denoising performance of models on real images for nucleation observation. Experiments on synthetic and real data show our method outperforms the state-of-the-art methods in HRTEM image denoising, with effectiveness in the localization downs… ▽ More High-Resolution Transmission Electron Microscopy (HRTEM) enables atomic-scale observation of nu
Abstract: …producing highly realistic and controllable sequences.
Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial.
Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors gener… ▽ More F
Abstract: …producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors gener… ▽ More Following major advances in text and image generation, t
Abstract: Models for long-term point tracking are typically trained on large synthetic datasets.
Abstract: Models for long-term point tracking are typically trained on large synthetic datasets.
Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends… ▽ More Models for long-term p
Abstract: Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends… ▽ More Models for long-term point tracking are typically trained on lar
Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images.
Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images.
Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these… ▽ More Recent rapid advancement of generative models has significantly im
Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these… ▽ More Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented
Abstract: …our study reveals that reading analog clocks in real-world environments remains a significant challenge for state-of-the-art VLMs.
As a result, VLMs trained on such… ▽ More Advances in vision-language models (VLMs) have achieved rem
Abstract: …our study reveals that reading analog clocks in real-world environments remains a significant challenge for state-of-the-art VLMs. As a result, VLMs trained on such… ▽ More Advances in vision-language models (VLMs) have achieved rem
Abstract: …our study reveals that reading analog clocks in real-world environments remains a significant challenge for state-of-the-art VLMs. Existing analog clock datasets are largely synthetic or planar with limited stylistic diversity and minimal background context, failing to capture the visual variability of real-world scenes. As a result, VLMs trained on such… ▽ More Advances in vision-language models (VLMs) have achieved rem
Abstract: …without relying on 3D priors.
We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training camer
We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-sh
Abstract: …without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, contin… ▽ More We introduce FaceCam, a system that generates video under customizable camera trajectories for
However, current methods mainly target sample reduction, with limited consideration of… ▽ More Dataset Distillation (DD) compresses large datasets into compa
However, current methods mainly target sample reduction, with limited consideration of… ▽ More Dataset Distillation (DD) compresses large datasets into compact synthetic ones that maintain training performan
However, current methods mainly target sample reduction, with limited consideration of data precision and
Abstract: Dataset Distillation (DD) compresses large datasets into compact synthetic ones that maintain training performance. However, current methods mainly target sample reduction, with limited consideration of… ▽ More Dataset Distillation (DD) compresses large datasets into compact synthetic ones that maintain training performance. However, current methods mainly target sample reduction, with limited consideration of data precision and
Abstract: …we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words.
Abstract: …we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words.
This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectivenes… ▽ More
Abstract: …we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectivenes… ▽ More Fingerspelling is a component of sign languages in which
Experiments on synthetic and real data demonstrate that Lumosaic significantly improves reconstruction fidelity and temporal stability over existing snapshot hyp… ▽ More
Experiments on synthetic and real data demonstrate that Lumosaic significantly improves reconstruction fidelity and temporal stability over existing snapshot hyp… ▽ More We present Lumosaic, a compact active
Experiments on synthetic and real data demonstrate that Lumosaic significantly improves reconstruction fidelity and temporal stability over existing snapshot hyp… ▽ More
Abstract: …then recovers 31-channel hyperspectral (400-700 nm) video at 30 fps and VGA resolution, producing temporally coherent and spectrally accurate reconstructions. Experiments on synthetic and real data demonstrate that Lumosaic significantly improves reconstruction fidelity and temporal stability over existing snapshot hyp… ▽ More We present Lumosaic, a compact active hyperspectral video system designed for real-time capture
Abstract: …calibrated sky illumination, together with per-pixel confidence masks.
We demonstrate that Olbedo enables state-of-the-art diffusion-based IID models, originally trained on synthetic indoor data, to generalize to real outdoor imagery: fine-tuning on Olbedo significantly improves single-view outdoor albedo prediction on&
We demonstrate that Olbedo enables state-of-the-art diffusion-based IID models, originally trained on synthetic indoor data, to generalize to real outdoor imagery: fine-tuning on Olbedo significantly
Abstract: …calibrated sky illumination, together with per-pixel confidence masks. We demonstrate that Olbedo enables state-of-the-art diffusion-based IID models, originally trained on synthetic indoor data, to generalize to real outdoor imagery: fine-tuning on Olbedo significantly improves single-view outdoor albedo prediction on… ▽ More Intrinsic image decomposition (IID) of outdoor scenes is crucial for relighting, editing, and u
However, many of these models suffer from limited intr
Recent advances in identity-conditional diffusion models have enabled the generation of photorealistic and identity-consistent face images.
However, many of these models suffer from limited intr
Abstract: Synthetic… ▽ More Synthetic data has emerged as a practical alternative to authentic face datasets for training face recognition (FR) systems, especially as privacy and legal concerns increasingly restrict the use of real biometric data. Recent advances in identity-conditional diffusion models have enabled the generation of photorealistic and identity-consistent face images. However, many of these models suffer from limited intr
Abstract: Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real…
Abstract: Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real… ▽ More Large-sc
Abstract: Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real…
Abstract: Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real… ▽ More Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its
Abstract: Diverse, accurately labeled 3D human pose data is expensive and studio-bound, while in-the-wild datasets lack known ground truth.
We introduce UnrealPose-Gen, an Unreal Engine 5 pipeline built on Movie Render Queue for high-quality offline rendering.
Our generated frames include: (i) 3D joints in world and camera coordinates, (ii) 2D projections and COCO-sty… ▽ More Diverse, accurately labeled 3D human po
Abstract: Diverse, accurately labeled 3D human pose data is expensive and studio-bound, while in-the-wild datasets lack known ground truth. We introduce UnrealPose-Gen, an Unreal Engine 5 pipeline built on Movie Render Queue for high-quality offline rendering. Our generated frames include: (i) 3D joints in world and camera coordinates, (ii) 2D projections and COCO-sty… ▽ More Diverse, accurately labeled 3D human pose data is expensive and
Abstract: …are contextualized to produce assembly-ready outputs.
本文提出 Unified Primitive Proxies for Structured Shape Completion 方法/框架。
Across synthetic and real-world benchmarks with four independent assembly solvers, UniCo consistently outperforms recent baselines, lowering Chamfer distance by up to 50% and improving normal…
Abstract: …are contextualized to produce assembly-ready outputs. To ensure consistent optimization, our training strategy couples primitives and points with online target updates. Across synthetic and real-world benchmarks with four independent assembly solvers, UniCo consistently outperforms recent baselines, lowering Chamfer distance by up to 50% and improving normal… ▽ More Structured shape completion recovers missing geometry a
However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro
本文提出 Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation 方法/框架。
However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro
Abstract: With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled d
Abstract: …from sequential VR sketches.
Our approach yields higher… ▽ More VR sketching lets users explore and iterate on ideas dir
Our approach yields higher… ▽ More VR sketching lets users explore and iterate on ideas dir
Abstract: …from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates sequential VR sketches from arbitrary shapes, (ii) a dataset of over 20k synthetic and 900 hand-drawn sketch-shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher… ▽ More VR sketching lets users explore and iterate on ideas dir
Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored.
Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored.
We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stere… ▽ More
Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stere… ▽ More Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of
Abstract: Document generation has gained growing attention in the field of AI-driven content creation.
In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple… ▽ More Document generation has gained growing attention in the field of AI-driven content creation.
In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spect
Abstract: Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple… ▽ More Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spect
Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training&helli
Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training… ▽ More Gene
While recent approaches, such as R2F, address this challenge by utilizing LLM
Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training… ▽ More Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM
Abstract: …in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect.
To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited… ▽ More Denoising in the sRGB image space is challenging due to noise variability.
Although end-to-end methods perform well, their effectiveness in real-world scenarios
Abstract: …in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect. To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited… ▽ More Denoising in the sRGB image space is challenging due to noise variability. Although end-to-end methods perform well, their effectiveness in real-world scenarios
Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging.
Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging.
Diffusion-based models rely on stochastic noise-to-data tran
Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-… ▽ More Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data tran
However, the high-entropy nature of visual… ▽ More Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-
Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis.
However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck
Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual… ▽ More Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck
Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image… ▽ More Visual autoregressive (VAR) mo
Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image… ▽ More Visual autoregressive (VAR) models have recently emerged as a promising alternat
This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored
Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image… ▽ More Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored
Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity.
Abstract: We present a method for generating a full 360° orbit video around a person from a single input… ▽ More We present a method for generating a full 360° orbit video around a person from
In contrast, recent video diffusion models have demonstrated th
Abstract: We present a method for generating a full 360° orbit video around a person from a single input… ▽ More We present a method for generating a full 360° orbit video around a person from a single input image. Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity. In contrast, recent video diffusion models have demonstrated th
While existing methods can… ▽ More We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation.
While existing methods can… ▽ More We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation.
It is essential for synthesizing partially occluded objects with depth-consis
Abstract: We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can… ▽ More We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consis
Existing methods predominantly employ optical flow-based… ▽ More Image alignment is a fundamental task in computer vision with broad applications.
Existing methods predominantly employ optical flow-based… ▽ More Image alignment is a fundamental task in computer vision with broad applications.
However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded ali
Abstract: Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based… ▽ More Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded ali
Abstract: Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion… ▽ More Generatin
To address this limitation, we propose a new paradigm: extracting rich interaction pri
To address this limitation, we propose a new paradigm: extracting rich interaction pri
Abstract: Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion… ▽ More Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction pri
Abstract: Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments.
Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment.
Few-shot approaches improve scalability across rooms but still r… ▽ More Generating audio that is acoustically consistent with a scene i
Abstract: Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still r… ▽ More Generating audio that is acoustically consistent with a scene i
Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative cap
Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity.
We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves… ▽ More Acceleration methods
Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves… ▽ More Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quali
However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts.
However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts.
In this paper, we propose a motion factorization framework that dec… ▽ More Compositional video generation aims to synthesize multiple inst
Abstract: Compositional video generation aims to synthesize multiple instances with diverse appearance and motion, which is widely applicable in real-world scenarios. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a motion factorization framework that dec… ▽ More Compositional video generation aims to synthesize multiple inst
Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and e
本文提出 X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection 方法/框架。
Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and e Against this backdrop, we take a… ▽ More The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increas
Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a… ▽ More The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Aga
However, the underlying mechanisms gove… ▽ More Recent breakthroughs in Diffusion Transformers (DiTs) have revo
To facilitate DiTs' capability of capturing meaningful internal representations, recent works such as REPA incorporate external pretrained encoders for representation alignment.
However, the underlying mechanisms gove… ▽ More Recent breakthroughs in Diffusion Transformers (DiTs) have revo
Abstract: Recent breakthroughs in Diffusion Transformers (DiTs) have revolutionized the field of visual synthesis due to their superior scalability. To facilitate DiTs' capability of capturing meaningful internal representations, recent works such as REPA incorporate external pretrained encoders for representation alignment. However, the underlying mechanisms gove… ▽ More Recent breakthroughs in Diffusion Transformers (DiTs) have revo
However, current models are predominantly optimized for single-event… ▽ More Recent advances in text-to-video diffusion models have enabled high-fidelity and
Abstract: Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis.
When handling multi-event prompts, w
Abstract: Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event… ▽ More Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, w
However, existing approaches either rely on full-sequence… ▽ More Recent advances in motion diffusion models have substantially improved the realism of human
Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis.
Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence… ▽ More Recent advances in motion diffusion models have substantially improved the realism of human
Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence… ▽ More Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real
Abstract: Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse… ▽ More
Abstract: Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse… ▽ More Text-to-Video (T2V) models are capable of synt
Existing safety evaluation methods,which focus on static image and text generation, are insufficient t
Abstract: Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse… ▽ More Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods,which focus on static image and text generation, are insufficient t
数据生成方向聚焦于自动化、大规模地构建训练数据,包括多模态数据生成、数据集蒸馏与压缩、以及利用大模型进行数据扩充。
To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, sc
To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated vis
To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, sc
Abstract: …using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream… ▽ More Cinematic Audio Source Separation (CASS) aims to decomp
Our method addresses these challenges through two key innovations.
Our method addresses these challenges through two key innovations.
First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k im… ▽ M
Abstract: …poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k im… ▽ More We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban
Abstract: …without relying on 3D priors.
We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training camer
We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-sh
Abstract: …without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, contin… ▽ More We introduce FaceCam, a system that generates video under customizable camera trajectories for
However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro
本文提出 Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation 方法/框架。
However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro
Abstract: With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled d
Abstract: Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data.
本文提出 Watch and Learn: Learning to Use Computers from Online Videos 方法/框架。
Existing datasets are narrow, static, and costly to annotate, while… ▽ More Computer-using agents (CUAs) must plan task workflows across diverse and evolving
Abstract: Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data. Existing datasets are narrow, static, and costly to annotate, while… ▽ More Computer-using agents (CUAs) must plan task workflows across diverse and evolving applications, yet progress is limited by the lack of large-scale, high-quality training data. Exis
Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored.
Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored.
We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stere… ▽ More
Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains underexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stere… ▽ More Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of
Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training&helli
Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training… ▽ More Gene
While recent approaches, such as R2F, address this challenge by utilizing LLM
Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training… ▽ More Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM
Abstract: Text-to-image generation powers content creation across design, media, and… ▽ More Text-to-image generation powers content creation
Abstract: Text-to-image generation powers content creation across design, media, and… ▽ More Text-to-image generation powers content creation across design, media, and data augmentation.
Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics.
Abstract: Text-to-image generation powers content creation across design, media, and… ▽ More Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce SOLACE (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external rewa
Abstract: …fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning.
While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization.
To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually… ▽ More Advances in large reasoning models hav
Abstract: …fine-tuning to route visual tokens or employ refocusing mechanisms during reasoning. While effective, these methods are computationally expensive, requiring large-scale data generation and policy optimization. To leverage the benefits of test-time compute without additional RL fine-tuning, we propose VisRef, a visually… ▽ More Advances in large reasoning models have shown strong performance on complex reasoning tasks by
Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingua… ▽ More
Abstract: Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data.
Abstract: Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingua… ▽ More
Abstract: Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingua… ▽ More Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet
图像生成是 CVPR 2026 中最活跃的方向之一,以扩散模型(DiT、LDM)为主流框架,在文生图、图像编辑、视频生成、3D生成等子任务上均有大量突破。关键趋势:推理加速(步数减少、Token压缩)、生成可控性、安全性对齐。
Abstract: …producing highly realistic and controllable sequences.
Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial.
Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors gener… ▽ More F
Abstract: …producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors gener… ▽ More Following major advances in text and image generation, t
Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images.
Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images.
Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these… ▽ More Recent rapid advancement of generative models has significantly im
Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these… ▽ More Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented
However, many of these models suffer from limited intr
Recent advances in identity-conditional diffusion models have enabled the generation of photorealistic and identity-consistent face images.
However, many of these models suffer from limited intr
Abstract: Synthetic… ▽ More Synthetic data has emerged as a practical alternative to authentic face datasets for training face recognition (FR) systems, especially as privacy and legal concerns increasingly restrict the use of real biometric data. Recent advances in identity-conditional diffusion models have enabled the generation of photorealistic and identity-consistent face images. However, many of these models suffer from limited intr
Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data.
Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data.
We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightw… ▽
Abstract: …greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightw… ▽ More Images can be viewed as layered compositions, foregroun
However, their potential for GUI grounding remains unexplored.
Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement.
Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement
Abstract: …grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative t… ▽ More Autoregressive (AR) vision-language models (VLMs) have
Existing solutions often rely on extra finetuning or stronger text conditioning, but suffer from high data and computational requirements.
The door to cost-efficient visual editing remains… ▽ More Text-guided diffusion models have advanced image editin
The door to cost-efficient visual editing remains… ▽ More Text-guided diffusion models have advanced image editin
Abstract: …during training, which leads to poor alignment between language and vision. Existing solutions often rely on extra finetuning or stronger text conditioning, but suffer from high data and computational requirements. We argue that diffusion-based editing capabilities aren't lost but merely hidden from text. The door to cost-efficient visual editing remains… ▽ More Text-guided diffusion models have advanced image editin
However, this enhanced semantic ability ma… ▽ More Recently, multimodal large language models (MLLMs) have emerged
Abstract: Recently, multimodal large language models (MLLMs) have emerged as a unified paradigm for language and image generation.
However, this enhanced semantic ability ma… ▽ More Recently, multimodal large language models (MLLMs) have emerged
Abstract: Recently, multimodal large language models (MLLMs) have emerged as a unified paradigm for language and image generation. Compared with diffusion models, MLLMs possess a much stronger capability for semantic understanding, enabling them to process more complex textual inputs and comprehend richer contextual meanings. However, this enhanced semantic ability ma… ▽ More Recently, multimodal large language models (MLLMs) have emerged
Abstract: Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles.
Abstract: Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles.
Yet, they share a fundamental proper
Abstract: Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity.… ▽ More Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental proper
Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training&helli
Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training… ▽ More Gene
While recent approaches, such as R2F, address this challenge by utilizing LLM
Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training… ▽ More Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM
Abstract: …sampling, making the trade-off between fidelity and efficiency a persistent challenge.
We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization.
Under this perspective, our training framework is… ▽ More Virtual Try-on (VTON) has become a core capability for online retail, wher
Abstract: …sampling, making the trade-off between fidelity and efficiency a persistent challenge. We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is… ▽ More Virtual Try-on (VTON) has become a core capability for online retail, wher
Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging.
Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging.
Diffusion-based models rely on stochastic noise-to-data tran
Abstract: Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-… ▽ More Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data tran
Abstract: Text-to-image generation powers content creation across design, media, and… ▽ More Text-to-image generation powers content creation
Abstract: Text-to-image generation powers content creation across design, media, and… ▽ More Text-to-image generation powers content creation across design, media, and data augmentation.
Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics.
Abstract: Text-to-image generation powers content creation across design, media, and… ▽ More Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce SOLACE (Adaptive Rewarding by self-Confidence), a post-training framework that replaces external rewa
Abstract: …models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths dat
Abstract: …models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models.
We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I… ▽ More Recen
Abstract: …models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I… ▽ More Recent text-to-image (T2I) diffusion models achieve remarkab
However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment.
Abstract: Diffusion models have achieved remarkable success in image and video generation tasks.
Abstract: Diffusion models have achieved remarkable success in image and video generation tasks. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle t… ▽ More Diffusio
Abstract: Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle t… ▽ More Diffusion models have achieved remarkable success in image and
Abstract: In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas
Abstract: In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas… ▽ More
Progress remains hindered by
Abstract: In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas… ▽ More In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas generative models largely simulate pixels. Progress remains hindered by
Abstract: While existing generation and unified models excel at… ▽ More While existing generation and unified models excel at general image ge
Abstract: While existing generation and unified models excel at… ▽ More While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep
To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic
Abstract: While existing generation and unified models excel at… ▽ More While existing generation and unified models excel at general image generation, they struggle with tasks requiring deep reasoning, planning, and precise data-to-visual mapping abilities beyond general scenarios. To push beyond the existing limitations, we introduce a new and challenging task: creative table visualization, requiring the model to generate an infographic
However, false claims of using official models can mislead users and harm model owners' reputations, making model verification essential to confirm whether an API's underlying model matches it
Abstract: As Text-to-Image (T2I)… ▽ More As Text-to-Image (T2I) generation becomes widespread, third-party platforms increasingly integrate multiple model APIs for convenient image creation.
Existing methods address this by using ver
Abstract: As Text-to-Image (T2I)… ▽ More As Text-to-Image (T2I) generation becomes widespread, third-party platforms increasingly integrate multiple model APIs for convenient image creation. However, false claims of using official models can mislead users and harm model owners' reputations, making model verification essential to confirm whether an API's underlying model matches its claim. Existing methods address this by using ver
Abstract: Despite significant progress in text-to-image… ▽ More Despite significant progress in text-to-image generation, aligning outputs wit
本文提出 Self-Corrected Image Generation with Explainable Latent Rewards 方法/框架。
In contrast, evaluating generated images is more tract
Abstract: Despite significant progress in text-to-image… ▽ More Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tract
Abstract: Diffusion models have demonstrated remarkable performance in image… ▽ More Diffusion models have demonstrated remarkable performance
Abstract: Diffusion models have demonstrated remarkable performance in image… ▽ More Diffusion models have demonstrated remarkable performance in image generation, particularly within the doma
Abstract: Diffusion models have demonstrated remarkable performance in image… ▽ More Diffusion models have demonstrated remarkable performance
Abstract: Diffusion models have demonstrated remarkable performance in image… ▽ More Diffusion models have demonstrated remarkable performance in image generation, particularly within the domain of style transfer. Prevailing style transfer approaches typically leverage pre-trained diffusion models' robust feature extraction capabilities alongside external modular control pathways to explicitly impose style guidance signals. However, t
This implicit use suffers from a fundamental misalignment in representation.
Abstract: Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval.
It also depends heavily on indirect feature extraction pipelines, which complicate the workflow… ▽ More Recent approaches for segmentation have leveraged pre
Abstract: Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow… ▽ More Recent approaches for segmentation have leveraged pretrained ge
However, when trained only for reconstruction, high-dimensional latent spaces often lose meaningful structure, making diffusion training h… ▽ More Reducing t
Abstract: Reducing token count is crucial for efficient training and inference of latent diffusion models, especially at high resolution.
However, when trained only for reconstruction, high-dimensional latent spaces often lose meaningful structure, making diffusion training h… ▽ More Reducing t
Abstract: Reducing token count is crucial for efficient training and inference of latent diffusion models, especially at high resolution. A common strategy is to build high-compression image tokenizers with more channels per token. However, when trained only for reconstruction, high-dimensional latent spaces often lose meaningful structure, making diffusion training h… ▽ More Reducing token count is crucial for efficient training and infe
However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we obs
Abstract: Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image… ▽ More Latent diffusion models have emerged as the dominant framework for high-f
However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we obs
Abstract: Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image… ▽ More Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we obs
However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results.
Abstract: Diffusion-based text-to-image (T2I) models have made remarkable progress in… ▽ More Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealis
This limitation arises from the long
Abstract: Diffusion-based text-to-image (T2I) models have made remarkable progress in… ▽ More Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long
However, the high-entropy nature of visual… ▽ More Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-
Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis.
However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck
Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual… ▽ More Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck
Abstract: Conditional image… ▽ More Conditional image generation methods are increasingly used in human-centric applications, yet existing hum
Abstract: Conditional image… ▽ More Conditional image generation methods are increasingly used in human-centric applications, yet existing human amodal completion (HAC) models offer users limi
Given an occluded person image, they hallucinate invisible regions while preserving visible ones, but cannot reliably incorporate user-specified constraints such as a desired pose or spatial extent.
Abstract: Conditional image… ▽ More Conditional image generation methods are increasingly used in human-centric applications, yet existing human amodal completion (HAC) models offer users limited control over the completed content. Given an occluded person image, they hallucinate invisible regions while preserving visible ones, but cannot reliably incorporate user-specified constraints such as a desired pose or spatial extent. As a result
Abstract: Text-to-image… ▽ More Text-to-image generation has advanced rapidly, but existing models still struggle with faithfully composing mu
Abstract: Text-to-image… ▽ More Text-to-image generation has advanced rapidly, but existing models still struggle with faithfully composing multiple objects and preserving their attributes in
We propose coDrawAgents, an interactive multi-agent dialogue framework with four specialized agents: Interpreter, Planner, Checker, and Painter that collaborate to improve compositional generation.
Abstract: Text-to-image… ▽ More Text-to-image generation has advanced rapidly, but existing models still struggle with faithfully composing multiple objects and preserving their attributes in complex scenes. We propose coDrawAgents, an interactive multi-agent dialogue framework with four specialized agents: Interpreter, Planner, Checker, and Painter that collaborate to improve compositional generation. The Interpreter adaptively decides b
Abstract: With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges.
Traditional methods are incapable of dealing with increasingly realistic… ▽ More With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges.
Traditional methods are incapable of dealing with increasingly realistic images generated by rapidly evolving image generation techniques.
Abstract: With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges. Traditional methods are incapable of dealing with increasingly realistic… ▽ More With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges. Traditional methods are incapable of dealing with increasingly realistic images generated by rapidly evolving image generation techniques.
Existing methods can effectively erase narrow concepts that are specific and concrete, such as distinct intellectual properties (e.g.
Abstract: Concept erasure is extensively utilized in image… ▽ More Concept erasure is extensively utilized in image generation to prevent text-to-image models from generating undesired content
However, their performance degrades on broad concepts such a
Abstract: Concept erasure is extensively utilized in image… ▽ More Concept erasure is extensively utilized in image generation to prevent text-to-image models from generating undesired content. Existing methods can effectively erase narrow concepts that are specific and concrete, such as distinct intellectual properties (e.g. Pikachu) or recognizable characters (e.g. Elon Musk). However, their performance degrades on broad concepts such a
Abstract: Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process.
Abstract: Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process.
Abstract: Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process. While feature caching techniques achieve effective acceleration at higher step counts (e.g., 50 steps), they exhibit critical limitations in the practical low-step regime of 20-30 steps.
Abstract: Despite achieving state-of-the-art generation quality, diffusion models are hindered by the substantial computational burden of their iterative sampling process. While feature caching techniques achieve effective acceleration at higher step counts (e.g., 50 steps), they exhibit critical limitations in the practical low-step regime of 20-30 steps. As the inte… ▽ More Despite achieving state-of-the-art generation quality, diffusio
Abstract: Diffusion models have become the dominant tool for high-fidelity image and video… ▽ More Diffusion models have become the dominant t
Abstract: Diffusion models have become the dominant tool for high-fidelity image and video… ▽ More Diffusion models have become the dominant tool for high-fidelity image and video generation,
To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at
Abstract: Diffusion models have become the dominant tool for high-fidelity image and video… ▽ More Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at
Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image… ▽ More Visual autoregressive (VAR) mo
Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image… ▽ More Visual autoregressive (VAR) models have recently emerged as a promising alternat
This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored
Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image… ▽ More Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored
Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization.
Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization.
Abstract: Reinforcement learning (RL) has demonstrated remarkable success in text and image… ▽ More Reinforcement learning (RL) has demonstrat
Abstract: Reinforcement learning (RL) has demonstrated remarkable success in text and image… ▽ More Reinforcement learning (RL) has demonstrated remarkable success in text and image generation, yet its potential in 3D generation remains largely unexplored. Existing attempts typically rely on offline direct preference optimization (DPO) method, which suffers from low training efficiency and limited generalization. In this work, we aim to e
To address this challenge, we introduce a novel method that strengthens the spatial und
To address this challenge, we introduce a novel method that strengthens the spatial und
To address this challenge, we introduce a novel method that strengthens the spatial und
Abstract: Recent progress in text-to-image… ▽ More Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial und
Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity.
Abstract: We present a method for generating a full 360° orbit video around a person from a single input… ▽ More We present a method for generating a full 360° orbit video around a person from
In contrast, recent video diffusion models have demonstrated th
Abstract: We present a method for generating a full 360° orbit video around a person from a single input… ▽ More We present a method for generating a full 360° orbit video around a person from a single input image. Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity. In contrast, recent video diffusion models have demonstrated th
Abstract: Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-… ▽ More
Abstract: Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-… ▽ More Text-to-image (T2I) diffusion model
We reveal a strong correlation between early diffusion cross-
Abstract: Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-… ▽ More Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-generation scenarios such as prompt iteration, agent-based generation, and flow-grpo. We reveal a strong correlation between early diffusion cross-
While existing methods can… ▽ More We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation.
While existing methods can… ▽ More We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation.
It is essential for synthesizing partially occluded objects with depth-consis
Abstract: We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can… ▽ More We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consis
Existing methods predominantly employ optical flow-based… ▽ More Image alignment is a fundamental task in computer vision with broad applications.
Existing methods predominantly employ optical flow-based… ▽ More Image alignment is a fundamental task in computer vision with broad applications.
However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded ali
Abstract: Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based… ▽ More Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded ali
Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead
Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling.
Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or… ▽ More Diffusion models have demonstrated re
Abstract: Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or… ▽ More Diffusion models have demonstrated remarkable success in image and video generation, yet their practica
Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, o
Abstract: Pre-trained diffusion models excel at generating high-quality… ▽ More Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their nati
Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, o
Abstract: Pre-trained diffusion models excel at generating high-quality… ▽ More Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than f
Existing lightweight variants predominantly compress the denoising U-Net or reduce the di
Abstract: Latent diffusion models such as Stable Diffusion 1.5 offer strong generative priors that are highly valuable for… ▽ More Latent diffusion models such as Stable Diffusion 1.5 offer st
Existing lightweight variants predominantly compress the denoising U-Net or reduce the di
Abstract: Latent diffusion models such as Stable Diffusion 1.5 offer strong generative priors that are highly valuable for… ▽ More Latent diffusion models such as Stable Diffusion 1.5 offer strong generative priors that are highly valuable for image restoration, yet their full pipelines remain too computationally heavy for deployment on edge devices. Existing lightweight variants predominantly compress the denoising U-Net or reduce the di
Abstract: Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge.
Abstract: Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge.
This paper identifies that this issue primarily stems from two ou… ▽ More Generating long videos using pre-trained video diffusion models, which are typicall
Abstract: Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two ou… ▽ More Generating long videos using pre-trained video diffusion models, which are typically trained on short c
Abstract: Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs.
While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often r… ▽ More Diffusion-based video super-resolution (VSR) has recentl
Abstract: Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often r… ▽ More Diffus
Abstract: Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often r… ▽ More Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While
However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints.
However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints.
Abstract: Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I)… ▽ More Diffusion Transformers (DiTs) have significant
Abstract: Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I)… ▽ More Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I) generation quality, enabling high-quality personalized content creation. However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints. To tackle these challenges, we propose a memo
However, most existing… ▽ More Spatial transcriptomics (ST) enables spot-level in situ e
Recent advances explore using score- or flow-based generative models to estimate the conditional distribution of gene expression from histology, offering a flexible alternative to deterministic regression approaches.
However, most existing… ▽ More Spatial transcriptomics (ST) enables spot-level in situ e
Abstract: …profiling, but its high cost and limited throughput motivate predicting expression directly from HE-stained histology. Recent advances explore using score- or flow-based generative models to estimate the conditional distribution of gene expression from histology, offering a flexible alternative to deterministic regression approaches. However, most existing… ▽ More Spatial transcriptomics (ST) enables spot-level in situ e
However, current discrete… ▽ More Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm s
Abstract: Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures.
However, current discrete… ▽ More Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm s
Abstract: Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete… ▽ More Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. Howeve
Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative cap
Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity.
We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves… ▽ More Acceleration methods
Abstract: Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves… ▽ More Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quali
Abstract: Diffusion Transformers have become a dominant paradigm in visual… ▽ More Diffusion Transformers have become a dominant paradigm in v
本文提出 SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer 方法/框架。
Among common training-free techniques, caching offers high acceleration efficiency but often compromises fidelity, whereas pruning shows the opposite trade-off.
Abstract: Diffusion Transformers have become a dominant paradigm in visual… ▽ More Diffusion Transformers have become a dominant paradigm in visual generation, yet their low inference efficiency remains a key bottleneck hindering further advancement. Among common training-free techniques, caching offers high acceleration efficiency but often compromises fidelity, whereas pruning shows the opposite trade-off. Integrating caching with pruni
Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting… ▽ More Generating high-quality 360° panoramic vid
Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting… ▽ More Generating high-quality 360° panoramic videos from perspective input is one of the crucial a
Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting… ▽ More Generating high-quality 360° panoramic vid
Abstract: Generating high-quality 360° panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting… ▽ More Generating high-quality 360° panoramic videos from perspective input is one of the crucial applications f
Abstract: Diffusion models achieve strong… ▽ More Diffusion models achieve strong generative performance but remain slow at inference due to t
Abstract: Diffusion models achieve strong… ▽ More Diffusion models achieve strong generative performance but remain slow at inference due to the need for repeated full-model denoising passes.
Abstract: Diffusion models achieve strong… ▽ More Diffusion models achieve strong generative performance but remain slow at inference due to t
Abstract: Diffusion models achieve strong… ▽ More Diffusion models achieve strong generative performance but remain slow at inference due to the need for repeated full-model denoising passes. We present Token-Adaptive Predictor (TAP), a training-free, probe-driven framework that adaptively selects a predictor for each token at every sampling step. TAP uses a single full evaluation of the model's first layer as a low-cost probe to comp
Abstract: Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models.
Abstract: Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models.
In this paper, we expl
Abstract: Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we explore a unified framework called CFG-Ctrl, which reinterprets CFG as a control applied to the first-order continuous-time… ▽ More Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we expl
However, current models are predominantly optimized for single-event… ▽ More Recent advances in text-to-video diffusion models have enabled high-fidelity and
Abstract: Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis.
When handling multi-event prompts, w
Abstract: Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event… ▽ More Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, w
However, existing approaches either rely on full-sequence… ▽ More Recent advances in motion diffusion models have substantially improved the realism of human
Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis.
Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence… ▽ More Recent advances in motion diffusion models have substantially improved the realism of human
Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence… ▽ More Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real
Abstract: Text-to-Image (T2I) diffusion models have demonstrated significant advancements in… ▽ More Text-to-Image (T2I) diffusion models have
Abstract: Text-to-Image (T2I) diffusion models have demonstrated significant advancements in… ▽ More Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generati
Abstract: Text-to-Image (T2I) diffusion models have demonstrated significant advancements in… ▽ More Text-to-Image (T2I) diffusion models have
Abstract: Text-to-Image (T2I) diffusion models have demonstrated significant advancements in… ▽ More Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are ave
Abstract: While diffusion models have shown great potential in portrait… ▽ More While diffusion models have shown great potential in portrait
Abstract: While diffusion models have shown great potential in portrait… ▽ More While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and c
Abstract: While diffusion models have shown great potential in portrait… ▽ More While diffusion models have shown great potential in portrait
Abstract: While diffusion models have shown great potential in portrait… ▽ More While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized detail
Abstract: Existing diffusion codecs typically build on text-to-… ▽ More Existing diffusion codecs typically build on text-to-image diffusion f
Abstract: Existing diffusion codecs typically build on text-to-… ▽ More Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion.
To address it, we introduce \textbf{CoD}, the first \textbf{Co}mpression-oriented \textbf{D}iff
Abstract: Existing diffusion codecs typically build on text-to-… ▽ More Existing diffusion codecs typically build on text-to-image diffusion foundation models like Stable Diffusion. However, text conditioning is suboptimal from a compression perspective, hindering the potential of downstream diffusion codecs, particularly at ultra-low bitrates. To address it, we introduce \textbf{CoD}, the first \textbf{Co}mpression-oriented \textbf{D}iff
自动标注方向利用半监督学习、伪标签、大模型辅助等技术降低人工标注依赖,是数据高效学习的核心支撑技术。
Abstract: Models for long-term point tracking are typically trained on large synthetic datasets.
Abstract: Models for long-term point tracking are typically trained on large synthetic datasets.
Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends… ▽ More Models for long-term p
Abstract: Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends… ▽ More Models for long-term point tracking are typically trained on lar
However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro
本文提出 Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation 方法/框架。
However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro
Abstract: With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled d
However, existing datasets such as CULane and TuSimple contain relatively limited data under extreme weather conditions, including rain, snow, and fog.
As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task in autonomous driving,
As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task
Abstract: …is a crucial task in autonomous driving, as it helps ensure the safe operation of vehicles. However, existing datasets such as CULane and TuSimple contain relatively limited data under extreme weather conditions, including rain, snow, and fog. As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task in autonomous driving,
Abstract: …consistency -- a critical failure point in navigation, robotics, and autonomous driving.
ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based… &
ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpass
Abstract: …consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based… ▽ More We present ReMoT, a unified training paradigm to systematically address the fundamental shortc
Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks.
Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks.
RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering ap
Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering approximately 165,000 images, over 800 object instances, 1.3 million… ▽ More Visual-language grounding aims to establish semantic correspon
检测与分割方向 CVPR 2026 侧重于多模态融合(RGB+热红外/点云)、医学图像分割、视觉-语言接地等细分赛道,提出了多种高效 backbone 和跨模态交互机制。
Yet, despite rapid progress, existing HOI generation research r
(4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real… ▽ More Hand-object interaction (
Yet, despite rapid progress, existing HOI generation research r
Abstract: …depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real… ▽ More Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research r
Our method addresses these challenges through two key innovations.
Our method addresses these challenges through two key innovations.
First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k im… ▽ M
Abstract: …poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k im… ▽ More We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban
Abstract: …producing highly realistic and controllable sequences.
Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial.
Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors gener… ▽ More F
Abstract: …producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors gener… ▽ More Following major advances in text and image generation, t
Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images.
Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images.
Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these… ▽ More Recent rapid advancement of generative models has significantly im
Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these… ▽ More Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented
Abstract: …we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words.
Abstract: …we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words.
This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectivenes… ▽ More
Abstract: …we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectivenes… ▽ More Fingerspelling is a component of sign languages in which
However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro
本文提出 Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation 方法/框架。
However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of contro
Abstract: With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of… ▽ More With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled d
Abstract: Video matting remains limited by the scale and realism of existing datasets.
To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality… ▽ More Video matting remains limited by the scale and realism of exist
To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality… ▽ More Video matting remains limited by the
Abstract: Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality… ▽ More Video matting remains limited by the scale and realism of exist
However, their potential for GUI grounding remains unexplored.
Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement.
Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement
Abstract: …grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative t… ▽ More Autoregressive (AR) vision-language models (VLMs) have
However, existing datasets such as CULane and TuSimple contain relatively limited data under extreme weather conditions, including rain, snow, and fog.
As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task in autonomous driving,
As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task
Abstract: …is a crucial task in autonomous driving, as it helps ensure the safe operation of vehicles. However, existing datasets such as CULane and TuSimple contain relatively limited data under extreme weather conditions, including rain, snow, and fog. As a result, detection models trained on these datasets often become unreliable in such environments, which may lead… ▽ More Lane detection is a crucial task in autonomous driving,
Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions.
Abstract: Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency.
We present CURE, an error-aware curriculum learning framework that improves grounding a… ▽ More Medical vision-language models can automate the generation of
Abstract: Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding a… ▽ More Medical vision-language models can automate the generation of r
However, this assumption causes cascading errors: false detections and invalid relations propagate through the reasoning… ▽ More Referring Expression Compreh
While such approaches achieve interpretable reasoning and strong zero-shot generalization, they assume that intermediate reasoning steps are accurate.
While such approaches achieve interpretable reasoning and strong zero-shot generalization, they assume that intermediate reasoning steps are accurate.
Abstract: …reasoning, decomposing queries into structured programs and executing them step-by-step. While such approaches achieve interpretable reasoning and strong zero-shot generalization, they assume that intermediate reasoning steps are accurate. However, this assumption causes cascading errors: false detections and invalid relations propagate through the reasoning… ▽ More Referring Expression Comprehension (REC) aims to locali
This implicit use suffers from a fundamental misalignment in representation.
Abstract: Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval.
It also depends heavily on indirect feature extraction pipelines, which complicate the workflow… ▽ More Recent approaches for segmentation have leveraged pre
Abstract: Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow… ▽ More Recent approaches for segmentation have leveraged pretrained ge
However, practical deployment in real-world applications - especially on resource constrained edge devices - requi
Abstract: Scene Graph Generation (SGG) aims to extract a detailed graph structure from an… ▽ More Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a repre
However, practical deployment in real-world applications - especially on resource constrained edge devices - requi
Abstract: Scene Graph Generation (SGG) aims to extract a detailed graph structure from an… ▽ More Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a representation that holds significant promise as a robust intermediate step for complex downstream tasks like reasoning for embodied agents. However, practical deployment in real-world applications - especially on resource constrained edge devices - requi
Existing methods primarily rely on… ▽ More Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between tes
Existing methods primarily rely on… ▽ More Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between test images and normal templates.
Existing methods primarily rely on image reconstruction or template retrieval but face a fundamenta
Abstract: Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between test images and normal templates. Existing methods primarily rely on… ▽ More Unsupervised Anomaly Detection (UAD) aims to identify abnormal regions by establishing correspondences between test images and normal templates. Existing methods primarily rely on image reconstruction or template retrieval but face a fundamenta
However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networ
本文提出 PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation 方法/框架。
However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networ
Abstract: Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion… ▽ More Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion detection and radiotherapy target delineation. However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networ
Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks.
Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks.
RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering ap
Abstract: …the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering approximately 165,000 images, over 800 object instances, 1.3 million… ▽ More Visual-language grounding aims to establish semantic correspon
Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely
Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth
Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely
Abstract: Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries. The proposed framework, \textbf{STaRC}, overcomes this limitation by supervising frame-level saliency through a high… ▽ More Existing retrieval-augmented approaches for Dense Video Caption
Abstract: …compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic
This demonstrates that contrastive distillation provides a principled and effic… ▽ More Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and
Abstract: …compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic This demonstrates that contrastive distillation provides a principled and effic… ▽ More Foundation models are transforming Earth Observation (EO), yet the di
Abstract: …compromising performance on optical-only inputs, achieving state-of-the-art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and effic… ▽ More Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and
Abstract: …in both visual quality and quantitative metrics.
本文提出 3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion 方法/框架。
More materials: https://github.com/work-su… ▽ More The miniaturization of thermal sensors for mobile platforms inherently limits their spatial res
Abstract: …in both visual quality and quantitative metrics. More importantly, the thermal images enhanced by 3M-TI lead to substantial gains in critical downstream tasks like object detection and segmentation, underscoring its practical value for robust mobile thermal perception systems. More materials: https://github.com/work-su… ▽ More The miniaturization of thermal sensors for mobile platforms inherently limits their spatial res
Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and e
本文提出 X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection 方法/框架。
Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and e Against this backdrop, we take a… ▽ More The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increas
Abstract: The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Against this backdrop, we take a… ▽ More The surge of highly realistic synthetic videos produced by contemporary generative systems has significantly increased the risk of malicious use, challenging both humans and existing detectors. Aga