ThinkDrag: Semantic Drag-Based Image Editing with Visual Reasoning

Armando Fortes, Tianyi Wei, Chindanai Trakantannarong, Shangchen Zhou, Xingang Pan

S-Lab, Nanyang Technological University



TL;DR ThinkDrag turns sparse drag handles into semantic image edits by first inferring the intended transformation, then using that visual thinking trace to guide generation. Users can also add text prompts to disambiguate edit intent.

Teaser figure

Abstract

Drag-based image editing provides an intuitive interface for spatially controlled image manipulation by allowing users to specify handle and target points. Existing methods have made substantial progress through optimization-based, guidance-based, and feed-forward formulations, but they often interpret drag instructions as geometric displacements. This limits their effectiveness when the desired edit depends on interpreting image content and inferring a plausible semantic transformation from sparse point constraints.

We introduce ThinkDrag, a unified multimodal framework for drag-based image editing with visual reasoning. ThinkDrag is trained to associate point constraints with meaningful object- and scene-level transformations, and can optionally follow an explicit reasoning path that interprets the intended edit before image generation. This reasoning path improves interpretability and is especially useful for challenging ambiguous cases where the same drag instruction may admit multiple plausible edits.

To support this framework, we construct a supervised dataset of semantic drag transformations paired with reasoning traces and introduce DragBench++, a benchmark targeting challenging drag-based editing scenarios with reference edit solutions. Experiments show that ThinkDrag achieves state-of-the-art performance, improving generation quality and plausibility while maintaining competitive point-following precision.


ThinkDrag Dataset

Starting from synthetic image pairs, the pipeline filters implausible edits, matches semantic keypoints, and creates reasoning chains that explain how each drag maps to the intended transformation.

ThinkDrag dataset construction pipeline

The resulting dataset covers diverse transformation types, including pose, gaze, object motion, resizing, deformation, and rotation.

Full-Body Pose

Input example for a full-body pose edit Output example for a full-body pose edit

Head & Gaze

Input example for a head and gaze edit Output example for a head and gaze edit

Open / Close

Input example for an open and close edit Output example for an open and close edit

Object Position Shift

Input example for an object position shift edit Output example for an object position shift edit

Object Resize

Input example for an object resize edit Output example for an object resize edit

Jointed Motion

Input example for a jointed motion edit Output example for a jointed motion edit

Scene Shape Deformation

Input example for a scene shape deformation edit Output example for a scene shape deformation edit

Rigid Rotation

Input example for a rigid rotation edit Output example for a rigid rotation edit

ThinkDrag Model

ThinkDrag represents drag instructions through structured text tokens and spatial endpoint markers in a unified multimodal model, then generates edits directly or after producing a reasoning trace that interprets the edit intent.

ThinkDrag unified multimodal model overview

Results

Semantic Transformations

Each row compares the same input and drag handles across competing drag-based image editing methods.

Input
DragDiffusion
DragLoRA
GeoDrag
LightningDrag
ThinkDrag (Ours)
Input image of a dog
DragDiffusion dog edit
DragLoRA dog edit
GeoDrag dog edit
LightningDrag dog edit
ThinkDrag dog edit
Input image of Michelangelo's David
DragDiffusion David edit
DragLoRA David edit
GeoDrag David edit
LightningDrag David edit
ThinkDrag David edit
Input image of a parrot on a branch
DragDiffusion parrot edit
DragLoRA parrot edit
GeoDrag parrot edit
LightningDrag parrot edit
ThinkDrag parrot edit
Input image of feet
DragDiffusion feet edit
DragLoRA feet edit
GeoDrag feet edit
LightningDrag feet edit
ThinkDrag feet edit
Input image of a window
DragDiffusion window edit
DragLoRA window edit
GeoDrag window edit
LightningDrag window edit
ThinkDrag window edit
Input image of a hippo
DragDiffusion hippo edit
DragLoRA hippo edit
GeoDrag hippo edit
LightningDrag hippo edit
ThinkDrag hippo edit

Thinking-Guided Edits

ThinkDrag first infers the intended edit from the handles, then uses that thinking trace to guide generation.

Text-Guided Drag Variations

The same handles can lead to different edits when the drag instruction is paired with different user text prompts.

Citation

TBD