Ten articles. Seven about vocabulary. Two about workflow. One about a dimension most people forgot existed. All of them share a quiet assumption: you are starting from text.

Most of the time, you should not be.

Image-to-video changed the math. A single reference frame carries palette, composition, depth of field, lighting direction, wardrobe, set design, and facial identity simultaneously. That is forty words you no longer need to type and forty words the model no longer needs to guess. The prompt's job shrinks from "describe everything" to "describe what changes." Motion, camera behavior, timing, audio. The things a still image cannot show you.

This is what Frame to Motion was built for. Two prompts instead of one. A still image prompt that builds the frame. A motion prompt that brings it to life. The weird tab nobody else has.

Why two prompts exist

A text-to-video prompt carries every instruction in one breath. Subject, environment, lighting, camera, color, mood, action. You have maybe eighty productive words before the model starts averaging your intent. That is the architectural limit from article eight and it has not changed. If your vision is specific enough to need a hundred and fifty words, the model will address none of them strongly and all of them weakly.

Frame to Motion splits the burden. The image prompt handles the static elements: what the frame looks like before anything moves. The motion prompt handles the dynamic elements: what happens after the first frame. Two separate generations, each with its own attention budget, each focused on what it does best.

Image generation models are extraordinary at composition, detail, and visual fidelity. They have been good at this for longer than video models have existed. A Midjourney or DALL-E or Flux frame gives you a starting point that no amount of text description to a video model could match. The reference image is not a crutch. It is a higher-resolution input than language.

Video models, in turn, are better at motion when they are not simultaneously inventing the world. Hand them a frame and say "slow dolly forward, her hair moves in the wind, camera pulls focus to the background at the two-second mark" and they commit to the motion. Hand them nothing and ask for the same shot from text, and half their processing budget goes to deciding what the hair looks like, what the background contains, how the light falls. Less budget left for the dolly.

What the image prompt does

The still image prompt in Frame to Motion uses CinePrompt's full vocabulary. Every panel from the Single Shot tab applies: subject, environment, lighting, color, lens language. This is where you spend your descriptive budget. Be specific. Be precise. The image generator has no motion to worry about, so it gives all its attention to getting the frame right.

Think of it like setting up a shot on a physical set. Before the director calls action, the DP lights it, the production designer dresses it, the wardrobe department pins every fold, the focus puller marks their distances. None of that is motion. All of it determines what the motion looks like when it starts.

The specificity that video models struggle with in text-to-video, image models handle routinely. "85mm equivalent, shallow depth of field, woman in a navy peacoat standing at a rain-streaked window, warm interior light from the left, cool blue city glow outside, muted earth tones with one red umbrella in the background." A good image model gives you exactly that. A video model given the same text starts negotiating which details to honor and which to approximate.

What the motion prompt does

The motion prompt is a different animal. It does not describe what the frame looks like. The frame already exists. It describes what happens to the frame over time.

Camera movement, subject action, environmental changes, audio cues, timing. These are the motion prompt's domain. And because the reference frame has already locked down the visual identity, the motion prompt can be shorter and more focused than any text-to-video prompt needs to be.

A strong motion prompt for the window scene above: "Slow push in toward her face. She turns from the window. Rain intensifies on glass. Soft focus shift from her profile to the city lights outside. Quiet ambient rain, muffled traffic below."

That is forty words. Every one of them is about time and change. Zero of them are about what the scene looks like, because the image already said that.

The efficiency gain is not just about token count. It is about attention allocation. The motion model receives one clear job instead of two competing ones. In testing, the same camera movement requested in a text-to-video prompt and in an img2vid motion prompt produces noticeably more reliable results in the latter. The model that does not have to invent the world is better at moving through it.

Model behavior in img2vid mode

Kling 3.0 thrives on reference frames. Its img2vid mode preserves the input image's visual DNA with high fidelity while executing motion instructions precisely. Camera movements are cleaner. Character consistency across the clip is dramatically better than text-to-video. The motion prompt can be concise. Kling's first-forty-words attention pattern still applies, but forty words of pure motion go further than forty words splitting duties between description and action.

Runway Gen-4.5 was built for this workflow. The entire Director Mode architecture assumes a reference image or prior generation as input. Upload a frame, describe the motion, and Runway executes with the highest fidelity to your specific instruction of any model in the lineup. Where Runway lacks native multi-shot, it compensates with per-shot reference precision. This is the model for people who want control.

Veo 3.1 has an interesting relationship with reference frames. It respects them but it also has opinions. Hand Veo a carefully constructed image and a motion prompt, and it will honor the composition while gently adjusting the lighting toward its own aesthetic preferences. This can be beautiful or infuriating depending on whether you wanted the model's taste or your own. Veo's scene extension feature (continue from last frame) is a natural complement to Frame to Motion for building longer sequences.

Seedance 2.0 preserves input frame color science better than any other model. If your reference image has a specific palette, Seedance carries it through the video without the drift that plagues other models. Motion execution is reliable within its vocabulary. Lip sync from reference frames is tight. The trade-off: less improvisational. Seedance does what you ask, rarely adds what you did not.

Sora 2 interprets reference frames more loosely. It uses the image as a narrative anchor rather than a pixel-level constraint. This means Sora will shift framing and lighting if it decides the story calls for it. For some workflows this is creative liberation. For others it is an unwanted co-director. If frame-level fidelity matters, Sora is the riskier choice. If you want the model to riff on your image like a jazz musician riffs on a melody, Sora does that better than anyone.

Where Frame to Motion changes the game

Consistency. The problem from article seven, the one that makes sequences feel like five strangers in a lineup. Frame to Motion addresses it structurally.

Generate one reference image. Use it as the anchor for every shot in a sequence. Each motion prompt describes a different camera angle, a different action, a different moment in the scene. The visual identity stays locked because every clip starts from the same DNA. You are not asking five separate video generations to independently invent the same character, the same location, the same palette. You are handing them the answer and asking only for the motion.

This is what CinePrompt's Multi-Shot tab does when combined with Frame to Motion. Global settings define the world. Per-shot prompts define what happens in each cut. The reference frame is the bridge between the two. It is not a hack or a workaround. It is how professional cinematographers think: lock the look, then choreograph the movement.

The other place it changes things is speed. A text-to-video prompt that fails means you adjust wording and regenerate from zero. A Frame to Motion workflow that fails at the motion stage means you keep the image and retry just the motion. Half the pipeline survives. The iteration loop tightens because the expensive visual description only needs to succeed once.

When not to use it

Quick exploration. If you are noodling, trying to find a mood, throwing ideas at the wall, text-to-video is faster. The overhead of generating a reference image first only pays off when you know what you want the frame to look like. If you do not know yet, the text box is still the fastest path to surprise.

Highly dynamic scenes where the first frame is not representative. A car chase that starts with a parked car. An explosion that begins with stillness. The reference frame anchors the beginning, but if the beginning is not the visual identity of the scene, the anchor works against you.

Model-specific features that bypass the workflow entirely. Kling's storyboard, Seedance's native multi-shot, Sora's narrative threading. These produce multi-shot sequences from text alone and their consistency comes from internal mechanisms, not external reference frames. They are a different approach to the same problem and for some use cases they are simply better.

The dual-prompt thesis

Ten articles of vocabulary and workflow, and the underlying pattern is always the same: give the model fewer things to figure out simultaneously and it figures out each thing better. Decompose "cinematic" into six components. Split a prompt into structured layers. Separate image description from motion description.

Frame to Motion is that principle taken to its structural limit. Two generations, each with a focused job, producing a result that neither could achieve alone. The image model builds the world. The video model moves through it. Each one does what it is best at.

Nobody else has built this as a dedicated workflow tab. CinePrompt did, because this is where production-quality AI video actually lives: not in a single magical prompt, but in a pipeline where every stage carries only the weight it was designed for.


Bruce Belafonte is an AI filmmaker at Light Owl. He has generated the same reference frame eleven times to get the light right and considers the motion prompt the easy part.