ThinkDrag

Abstract

Drag-based image editing provides an intuitive interface for spatially controlled image manipulation by allowing users to specify handle and target points. Existing methods have made substantial progress through optimization-based, guidance-based, and feed-forward formulations, but they often interpret drag instructions as geometric displacements. This limits their effectiveness when the desired edit depends on interpreting image content and inferring a plausible semantic transformation from sparse point constraints.

We introduce ThinkDrag, a unified multimodal framework for drag-based image editing with visual reasoning. ThinkDrag is trained to associate point constraints with meaningful object- and scene-level transformations, and can optionally follow an explicit reasoning path that interprets the intended edit before image generation. This reasoning path improves interpretability and is especially useful for challenging ambiguous cases where the same drag instruction may admit multiple plausible edits.

To support this framework, we construct a supervised dataset of semantic drag transformations paired with reasoning traces and introduce DragBench++, a benchmark targeting challenging drag-based editing scenarios with reference edit solutions. Experiments show that ThinkDrag achieves state-of-the-art performance, improving generation quality and plausibility while maintaining competitive point-following precision.

ThinkDrag Dataset

Starting from synthetic image pairs, the pipeline filters implausible edits, matches semantic keypoints, and creates reasoning chains that explain how each drag maps to the intended transformation.

The resulting dataset covers diverse transformation types, including pose, gaze, object motion, resizing, deformation, and rotation.

Full-Body Pose

Head & Gaze

Open / Close

Input example for an open and close edit

Output example for an open and close edit

Object Position Shift

Object Resize

Jointed Motion

Scene Shape Deformation

Rigid Rotation

ThinkDrag Model

ThinkDrag represents drag instructions through structured text tokens and spatial endpoint markers in a unified multimodal model, then generates edits directly or after producing a reasoning trace that interprets the edit intent.

ThinkDrag unified multimodal model overview

Results

Semantic Transformations

Each row compares the same input and drag handles across competing drag-based image editing methods.

Input

DragDiffusion

DragLoRA

GeoDrag

LightningDrag

ThinkDrag (Ours)

Thinking-Guided Edits

ThinkDrag first infers the intended edit from the handles, then uses that thinking trace to guide generation.

ThinkDrag refrigerator edit without explicit reasoning — W/o thinking

Model thinking

Drag #1 pulls the top-right corner of the refrigerator door horizontally to the left.

Drag #2 shifts the bottom-right corner of the door to the left.

Because both act on the right edge of the same door, they work together to swing the door open along its left-side hinge.

ThinkDrag refrigerator edit with reasoning — With thinking

ThinkDrag taxi edit without explicit reasoning — W/o thinking

Model thinking

Drag #1 lifts the front of the hood from its resting position up to an open angle.

Drag #2 pulls the upper edge of the hood upward, ensuring the entire panel tilts naturally.

Because both act on the taxi hood, they work together to transition the vehicle from a closed state to an open one.

ThinkDrag taxi edit with reasoning — With thinking

ThinkDrag painting edit without explicit reasoning — W/o thinking

Model thinking

Drag #1 pulls the crest of a wave on the left side of the painting upward.

Drag #2 shifts a prominent wave on the right side of the sea upward toward the top.

Because both move foam in an ascending direction, they create the appearance of a taller, more dramatic breaking wave.

ThinkDrag painting edit with reasoning — With thinking

Text-Guided Drag Variations

The same handles can lead to different edits when the drag instruction is paired with different user text prompts.

Candle input with drag handles — Input with drags

First candle text-guided drag output — *make the candle taller*

Second candle text-guided drag output — *make the flame taller*

Tulip input with drag handles — Input with drags

First tulip text-guided drag output — *make the flower wider*

Second tulip text-guided drag output — *open the petals*

Citation

TBD

ThinkDrag: Semantic Drag-Based Image Editing with Visual Reasoning

Abstract

ThinkDrag Dataset

Full-Body Pose

Head & Gaze

Open / Close

Object Position Shift

Object Resize

Jointed Motion

Scene Shape Deformation

Rigid Rotation

ThinkDrag Model

Results

Semantic Transformations

Thinking-Guided Edits

Text-Guided Drag Variations

Citation