There is a moment in every AI video workflow where you type "slow dolly forward" and the model gives you a zoom. Not a dolly. A zoom. The background does not shift. The parallax is wrong. The perspective stays flat. You asked for a physical camera movement and got a lens trick instead.

This happens constantly. And it is not random.

Each model has a different internal dictionary for camera movement. Some words trigger precise mechanical behaviors. Others get interpreted loosely, folded into the model's general understanding of "the camera should probably do something here." Knowing which words land and which ones float is the difference between directing a shot and rolling dice.

The words that work everywhere

Let's start with the reliable ones. Across Runway Gen-4, Kling 3.0, Veo 3.1, Sora 2, and Seedance 2.0, these camera movement keywords produce consistent, recognizable results:

Pan left / pan right. Every model understands this. Horizontal rotation on a fixed axis. It is the most basic camera movement in existence and the most reliably interpreted one. If you need horizontal scanning and nothing else, pan is your word.

Tilt up / tilt down. Vertical rotation, fixed position. Also universally understood. "Tilt up to reveal the building" gives you what you expect almost every time. The problems start when you combine tilt with other movements, but on its own it is solid.

Zoom in / zoom out. Here is the catch. Models understand zoom perfectly, but they also use zoom as a fallback when they do not understand what you actually asked for. You type "dolly in" and the model does not know what dolly means, so it zooms instead. The output looks similar to an untrained eye. To a trained one, it looks wrong. Zoom compresses space. Dolly changes perspective. Different animals.

Static / locked off / fixed camera. Surprisingly useful. When you do not specify camera movement, most models will add subtle drift or gentle motion because their training data is full of handheld footage. Explicitly saying "static camera, no movement" or "locked-off tripod shot" gives you a genuinely still frame. Kling and Runway respect this well. Sora sometimes adds micro-drift anyway.

The words that depend on who is listening

This is where it gets interesting. And frustrating.

Dolly in / dolly out. A dolly is a physical camera movement toward or away from the subject. The background shifts. Perspective changes. Parallax is visible. Runway Gen-4 in Director Mode understands this distinction. It will give you actual perspective shift, foreground objects moving faster than background. Kling 3.0 also handles dolly well, producing genuine depth movement. Veo tends to treat "dolly" and "zoom" as near-synonyms. Sora splits the difference, sometimes nailing the parallax, sometimes just zooming. Seedance 2.0 handles it better than most when you pair it with speed and distance cues: "slow dolly in from medium shot to close-up" works. "Dolly in" alone is a coin flip.

Tracking shot / follow shot. You want the camera to move alongside a subject. Maybe a person walking, a car driving, a runner on a trail. "Tracking shot" is well understood by Runway and Kling. Both produce lateral movement that maintains subject framing. Seedance handles sequential tracking well because of its multi-action parsing. Veo prefers "camera following" over "tracking shot" as phrasing. Sora understands the intent but can struggle with maintaining consistent distance from the subject over longer generations.

Crane up / crane down. Vertical movement of the entire camera, not just the angle. A crane up reveals context above and around the subject. A tilt up just looks up. The difference matters. Runway executes crane shots when you specify them in Director Mode. Kling 3.0 understands crane vocabulary and produces appropriate vertical travel with perspective change. Veo is hit-or-miss. Sora tends to interpret "crane" as "tilt" about half the time. For Sora, "camera rises vertically, revealing the landscape below" gets you closer than "crane up."

Orbit / arc shot. Camera circles the subject. This is a showpiece movement and models love it because the training data is full of product shots and hero reveals that use orbital paths. Runway, Kling, and Seedance all handle orbit well. "Camera slowly orbits clockwise around the subject" is one of the most reliable complex movement prompts you can write. Veo responds better to "arc" than "orbit" for reasons known only to Google. Sora handles orbits but occasionally loses subject tracking mid-rotation on longer clips.

The words that mostly do not work

Steadicam. In real filmmaking, a Steadicam is a specific rig that produces smooth, floating handheld movement. It is not a tripod. It is not a dolly. It has a particular quality: human-paced movement with mechanical smoothness. Almost no AI video model distinguishes "Steadicam" from "smooth camera movement." Type "Steadicam" and you get the same output as "smooth tracking." The word is wasted specificity. Just describe the movement you want and add "smooth" or "stabilized."

Jib. Similar to crane but the word itself does not register with most models. "Crane" works. "Jib" does not. They are mechanically similar in real production (a jib is a type of crane arm), but the training data overwhelmingly labels these shots as "crane" not "jib." Use crane.

Whip pan. You want a fast horizontal pan with motion blur. Some models will give you a fast pan if you say "whip pan." Others will give you a normal-speed pan. The more reliable phrasing is "very fast pan left with heavy motion blur" because it describes the visual result rather than the technique name. Kling handles whip pan better than most. Runway understands it in Director Mode. The others need the description spelled out.

Rack focus. This is not camera movement. It is a lens operation. But people include it in camera direction prompts constantly. "Rack focus from foreground to background" works in Runway Gen-4 and Veo. Kling sometimes interprets it correctly. Sora and Seedance tend to ignore it or produce a cut rather than a smooth focus pull. If focus control matters to your shot, Runway is currently the most reliable interpreter.

The speed problem

Camera movement keywords without speed modifiers are half-finished instructions. "Dolly in" tells the model what to do but not how fast. And speed changes everything about the emotional register of a shot.

"Slow dolly in" builds tension, intimacy, dread. "Fast dolly in" creates urgency, surprise, impact. The word "slow" before any camera movement keyword reliably produces slower motion across all models. "Fast" is less reliable. Runway and Kling respect speed modifiers well. Veo occasionally ignores "fast" and gives you medium-speed movement. Seedance responds well to explicit speed language: "rapid," "gradual," "creeping."

The real unlock is combining speed with distance. "Slow dolly in from wide shot to medium close-up over the full duration" tells the model the speed, the starting framing, the ending framing, and the timeframe. That is four constraints on a single movement. The more constraints you give, the less the model improvises. And model improvisation is where things go sideways.

Combining movements

Real cinematography rarely uses one movement at a time. A crane up might combine with a slow pan. A dolly in might pair with a slight tilt down. This is where most AI video models start sweating.

Runway Gen-4 is the best at combined movements right now. Director Mode lets you layer camera instructions, and the model executes them simultaneously rather than sequentially. "Slow dolly in while panning slightly right" produces a compound movement, not "dolly in then pan right."

Kling 3.0 handles combined movements reasonably well in text prompts, but its motion brush gives you finer control. You can paint motion paths for different parts of the frame, which is a different paradigm entirely. Less prompting, more drawing.

Seedance 2.0 handles sequential movements better than simultaneous ones. "Camera tracks right then cranes up to reveal the skyline" works. "Camera tracks right while craning up" sometimes produces one or the other, not both.

Sora and Veo both struggle with compound movements. If you need two things happening at once, describe the resulting motion rather than the two components. "Camera moves diagonally upward and to the right" sometimes works better than "dolly right while craning up."

What this means for your workflow

The takeaway is simple and slightly annoying: you cannot use the same camera direction vocabulary across every model and expect the same results. "Slow crane up revealing the city at dawn" is a perfectly clear instruction to a human DP. To Runway, it executes. To Sora, it tilts. To Veo, it depends on the day.

CinePrompt gives you the full cinematography vocabulary and keeps it precise. A dolly is a dolly. A crane is a crane. The models are not all there yet, but they are getting there fast. A year ago none of them understood "dolly" as distinct from "zoom." Now Runway and Kling produce real parallax. The direction is clear. CinePrompt is built for where this is headed: a world where every model speaks the same language a DP does.

In the meantime, knowing which words work where is still your edge. This article is a snapshot. It will age. That is the point.


Bruce Belafonte is an AI filmmaker at Light Owl. He has strong opinions about the difference between a dolly and a zoom and will explain them to you at length whether you asked or not.