Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models

¹Boston University, ²Johns Hopkins University, ³Runway

*Equal contribution

Abstract

Text-to-image generative models often struggle with long prompts detailing complex scenes, diverse objects with distinct visual characteristics, and spatial relationships. In this work, we propose SCoPE (Scheduled interpolation of Coarse-to-fine Prompt Embeddings), a training-free method to improve text-to-image alignment by progressively refining the input prompt in a coarse-to-fine-grained manner. Given a detailed input prompt, we first decompose it into multiple sub-prompts which evolve from describing broad scene layout to highly intricate details. During inference, we interpolate between these sub-prompts and thus progressively introduce finer-grained details into the generated image. Our training-free plug-and-play approach significantly enhances prompt alignment, achieving an average improvement of more than +8 in Visual Question Answering (VQA) scores over the Stable Diffusion baselines on 83% of the prompts from the GenAI-Bench dataset.

Method

Scheduled Interpolation of Coarse-to-Fine Prompts

SCoPE improves prompt-image alignment by dynamically breaking down the input prompt into a sequence of sub-prompts, evolving from a broad scene description to increasingly fine-grained details.

Prompt decomposition — SCoPE decomposes the input prompt into progressively detailed sub-prompts and interpolates between their embeddings across timesteps, gradually introducing semantic details into the generation.

During inference, SCoPE interpolates between these sub-prompts throughout the denoising process, allowing the model to first establish the global structure and then refine finer details. A Gaussian-based weighting mechanism controls how influence smoothly shifts across timesteps.

SCoPE in a Nutshell

➔ Decomposes prompts into coarse-to-fine sub-prompts
➔ Interpolates between sub-prompts during denoising
➔ Uses a Gaussian schedule to modulate influence over time
➔ Training-free, lightweight, and improves text-image alignment

Sub-prompt Generation

Starting from the original prompt, SCoPE generates a sequence of sub-prompts using GPT-4o, each describing the same scene with increasing levels of detail. We obtain embeddings for each sub-prompt using a frozen text encoder.

Interpolation Schedule

Each sub-prompt is assigned a timestep that determines when it should maximally influence the denoising process. Early sub-prompts guide broad structure, while later ones refine finer semantic details. The final detailed prompt takes over fully after a predefined interpolation period.

Gaussian Weighting

To smoothly transition between sub-prompts, SCoPE applies a Gaussian weighting at each timestep, centered around the assigned timestep of each sub-prompt. The standard deviation \( \sigma \) controls the sharpness of this transition — smaller values produce sharper switches, while larger values enable a gradual blend.

The final text-conditioning at each timestep is computed as a weighted combination of all sub-prompts, allowing the model to progressively refine the image from global structure to intricate detail.

Results

SCoPE against baseline models

Comparison of mean VQA Scores and CLIP Scores between SCoPE and baseline models. Win% indicates the percentage of prompts where SCoPE-generated images outperform the baseline. We observe that SCoPE consistently improves over the baselines, regardless of the model.

SCoPE for different tags from the GenAI-Bench dataset

The first five tags (Attribute, Scene, Action Relation, Part Relation, Counting Comparison) are categorized as “Basic,” while the remaining tags (Differentiation, Negation, Universal) fall under the “Advanced” category. We observe that SCoPE consistently outperforms the baseline Stable Diffusion-2.1 across both basic and advanced prompt categories.

More Visuals (SD-2.1)

BibTeX

@misc{saichandran2025progressivepromptdetailingimproved, title={Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models}, author={Ketan Suhaas Saichandran and Xavier Thomas and Prakhar Kaushik and Deepti Ghadiyaram}, year={2025}, eprint={2503.17794}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.17794}, }