A filmmaker controls two things. What the audience sees and when they see it. The first twelve articles in this series have been entirely about the first one. Color, light, lens, movement, sound, structure. All spatial. All about the frame at any given instant.

Time is the other half. And it is the half these models barely speak.

What we mean when we say timing

On a real set, time is everywhere. The editor controls pacing across cuts. The DP controls it within the shot: slow motion at 120fps stretches a two-second fall into an eight-second event. A time-lapse compresses fourteen hours of cloud movement into six seconds. Speed ramping shifts mid-shot from real time to slow to fast, creating emphasis the way a musician uses a fermata. A held beat before action. The half-second delay before a character turns. The lingering close-up that outlasts comfort.

These are choices. They communicate. A punch in slow motion tells a different story than a punch at full speed. A sunrise in time-lapse tells a different story than a sunrise in real time. The content is identical. The timing is what makes it mean something.

In a prompt, you have almost no vocabulary for any of this.

What "slow motion" actually does

"Slow motion" is the most commonly attempted temporal keyword. It is also the most inconsistently handled. What you want: the physics of normal-speed action rendered at a slower playback rate, preserving the weight and fluidity of real movement. What you often get: normal-speed action that just kind of... happens slowly. Characters move like they are underwater. Objects fall with the wrong weight. Hair drifts rather than arcs.

The distinction matters. Real slow motion is captured fast and played back slow. The inter-frame information is dense. AI slow motion is generated at the output frame rate with slower motion baked in. The inter-frame information is identical to everything else the model produces. It looks close enough to fool a thumbnail. It does not fool anyone watching for more than two seconds.

Kling 3.0 handles slow motion better than anyone right now. Its physics engine produces convincing weight in slow-motion water splashes, fabric movement, particle dispersal. Still wrong sometimes, but wrong in ways that feel like artistic license rather than broken simulation. The keyword "slow motion" works. "120fps slow motion" works marginally better. "Overcranked" does nothing.

Veo 3.1 produces gorgeous slow motion when the subject cooperates. Flowing materials, natural environments, atmospheric effects. It falls apart with fast mechanical action. A slow-motion car crash on Veo is a car gently folding. The beauty engine does not turn off for violence and physics.

Runway Gen-4.5 treats "slow motion" as a speed modifier on its existing motion engine. Reliable for simple subjects. Inconsistent for complex multi-element scenes. Director Mode helps if you pair it with specific timing cues like "action unfolds over four seconds."

Sora 2 occasionally produces something that genuinely feels like high-speed capture. A coin spinning, a drop hitting a surface, a bird's wing mid-beat. More often it produces a vaguely sluggish version of its standard output. Coin flip. The narrative engine seems to understand the dramatic purpose of slow motion better than the physics engine can deliver it.

Seedance 2.0 responds to "slow motion" but applies it uniformly. Everything slows down, including elements that should not. Background crowds, ambient wind, secondary motion. Real slow motion isolates the subject temporally from an environment that is also slowed but reads differently at different scales. Seedance does not make that distinction.

Time-lapse is a harder problem

Slow motion asks the model to stretch time. Time-lapse asks it to compress hours, days, seasons into seconds. The model has to understand not just that clouds move, but how they move across hours. Not just that shadows exist, but that they rotate as the sun arcs overhead. Not just that flowers exist, but that they open and close on a cycle measured in half-days.

Most models have seen enough time-lapse footage in training to produce something recognizable. Clouds streaking. City lights flickering on as the sky darkens. Construction sites progressing. The signature look is there. But the physics are usually a hallucination dressed in the right costume.

"Time-lapse of a sunset" will give you moving clouds and changing sky colors on every model. It will not give you accurate shadow rotation, correct star trail arcs, or the specific way light changes color temperature in the minutes before and after the sun crosses the horizon. It will give you the poster. Not the photograph.

Veo produces the most visually convincing time-lapses because its aesthetic engine fills in the gaps with beauty. Kling produces the most physically plausible ones because its simulation engine attempts real shadow and light behavior. Neither is actually compressing real time. Both are generating what compressed time looks like.

There is a meaningful difference between those two things, and it is the same difference that runs through this entire series: the model learned the appearance, not the mechanism.

Speed changes do not exist

Speed ramping is one of the most expressive tools in modern cinematography and post-production. Real-time action accelerates into slow motion at the moment of impact, then snaps back. A fight scene breathes. A sports replay emphasizes contact. A music video rides the beat.

No current AI video model can do this from a text prompt.

You cannot write "normal speed then slow motion at the moment of impact." The models do not parse temporal phase transitions within a single generation. They commit to a speed at the start and hold it. The concept of variable time within one clip is not in the vocabulary.

This is not a prompting failure. It is an architecture limitation. These models generate frames sequentially with consistent temporal spacing. Asking for variable temporal spacing within a fixed-length output is asking the model to change its own frame rate mid-generation. Current diffusion transformer architectures do not support that.

Kling's motion brush gets closest to a workaround, not because it changes speed but because you can define which elements move and how much, creating the visual illusion of temporal variation even though the actual frame rate is constant. It is clever. It is not the same thing.

The hold and the beat

Here is where it gets genuinely frustrating. Forget slow motion and time-lapse. Those are at least partially addressable. The real temporal gap is dramatic timing.

A character hears a noise. They freeze. Two beats. Then they turn. That two-beat hold is the scene. Without it, the turn is a mechanical rotation. With it, the turn carries dread, or surprise, or recognition. The hold is where emotion lives.

Try prompting for it. "The woman pauses for a moment before turning." What you get: a woman who turns. Maybe slightly slower than usual. The pause, the held stillness that precedes the action, is almost never rendered. The model reads "pauses" and "turning" and splits the difference. Continuous motion, slightly delayed onset. That is not a pause. A pause is a choice to do nothing, held long enough for the audience to feel why.

Sora 2 comes closest, occasionally. Its narrative comprehension sometimes produces a genuine beat of held stillness before an action. But it is not reliable and it is not promptable. You cannot consistently produce a two-beat hold by asking for one. It shows up when Sora decides the story needs it, which is sometimes exactly right and sometimes completely absent.

The other four models treat stillness as a bug. If you say "static camera, woman standing still," you get micro-movements: breathing, eye shifts, subtle sway. That is good. That is lifelike. But it is not the same as held dramatic stillness that precedes intentional action. These models do not understand the difference between idle animation and dramatic restraint.

Duration as a creative decision

Even the length of a generation is only crudely controllable. Most models offer fixed duration options: four seconds, six seconds, ten. Some offer ranges. None let you specify that this particular shot needs exactly 7.2 seconds because that is how long it takes for the action to land with the right weight.

In traditional filmmaking, shot duration is one of the most debated decisions in the edit bay. A two-second cutaway is information. A ten-second cutaway is meditation. The same shot at two different durations tells two different stories.

AI video gives you a dropdown menu. Pick your duration bracket and hope the action fits inside it. If it does not, the model either rushes or pads. Rushing looks frantic. Padding looks like the model forgot what it was doing.

Veo's scene extension is the most interesting partial solution. You can extend a generation, effectively letting the shot breathe past its initial boundary. But extension is additive, not architectural. You cannot say "this shot should unfold over exactly eight seconds with the key moment at second five." You generate, evaluate, extend if needed. Temporal structure emerges through iteration, not intention.

Why this matters for CinePrompt

CinePrompt's motion prompt field is where temporal language lives. "Slow push in." "She turns to face the camera." "Rain intensifies on the glass." These are all temporal instructions even when they read as spatial ones. The motion prompt is, by definition, a description of change over time.

But CinePrompt cannot give you controls that the models have not built yet. The speed panel says "slow" or "dynamic" because those are the dials that exist. There is no keyframe timeline. There is no speed curve. There is no "hold for two beats then act" button. Not because nobody thought of it. Because no model would know what to do with it.

That will change. Duration control will get granular. Speed variation within clips will arrive, probably through keyframe interfaces rather than text prompts. Dramatic timing will improve as models train on more precisely annotated footage where the beats and holds are labeled, not just the actions. The gap between spatial control and temporal control will narrow the way all the other gaps have narrowed across this series.

For now, the honest advice is: keep temporal ambitions modest in generation and handle the rest in post. Speed ramping in DaVinci Resolve or After Effects is mature, precise, and does exactly what you tell it. AI generation handles the raw material. Traditional tools handle the clock.

It is not elegant. But it works. And knowing which part of the pipeline handles which job is, at this point, the entire game.


Bruce Belafonte is an AI filmmaker at Light Owl. He once spent four days capturing a time-lapse of fog rolling through a valley and now watches AI approximate it in four seconds with mixed feelings.