Seventeen articles. Camera movement, lens language, lighting, color, sound, time, performance, environment, prompt structure. The vocabulary of what to generate. Not once have I mentioned where any of it should sit inside the rectangle.

Composition is the arrangement of visual elements within a frame. It is the first decision a cinematographer makes and the last thing anyone prompts for. The result is predictable: every AI-generated video puts the subject dead center, shoots from eye level, frames at medium distance, and calls it a day.

That is not composition. That is the absence of composition. It is what happens when nobody decides.

The center of the frame is a default, not a choice

Put a person in a prompt without positional language and every model will center them. Not because centering is correct. Because centering is the average of every composition in the training data. Wide shots, close-ups, portraits, landscapes, action scenes. Average them all together and you land on: center frame, eye level, medium distance.

A filmmaker picks where the subject sits for a reason. Left third, looking into negative space on the right. That empty space is the scene. It is where the character is going, or what they are watching, or what is about to arrive. Dead center communicates nothing except "the thing is here."

Specifying position in a prompt works better than you would expect and worse than you would hope. "Subject positioned at left third of frame" lands more reliably than "rule of thirds composition." The first is a spatial instruction. The second is a textbook reference. Same distinction that has run through this entire series: describe the visual result, not the name of the technique.

What composition language does

Subject placement. "Left of frame," "far right," "lower third," "upper corner." Direct positional instructions. Runway Gen-4.5 and Kling 3.0 handle these most literally. Veo 3.1 interprets them as suggestions and reserves the right to disagree. Sora 2 treats them as narrative cues, placing the subject where the story feels right, which is sometimes where you asked and sometimes not. Seedance 2.0 holds placement better in img2vid than text-to-video, consistent with its pattern of trusting the reference frame over the text.

Negative space. "Empty hallway stretching behind her." "Open sky filling the upper two thirds of the frame." Models understand emptiness when you describe what fills it. "Negative space" as a standalone term is less reliable than describing the actual space. Veo builds gorgeous emptiness when prompted for it. Kling resists vacancy, preferring to fill the frame with detail because its 4K resolution has pixels to spend.

Foreground layers. "Out of focus leaves in the immediate foreground." "Rain-streaked window between the camera and the subject." Foreground elements create depth, separate the frame into planes, and give the eye something to travel through. All five models respond to foreground descriptions, though the rendering varies. Runway produces the most convincing foreground bokeh. Kling makes foreground objects feel physically present at the pixel level. Veo occasionally decides foreground clutter is distracting and quietly removes it.

Camera height. Low angle, high angle, overhead, ground level. These are compositional tools wearing camera-direction costumes. A low angle makes a person powerful. A high angle makes them small. Ground level puts you in the dirt with whatever crawls there. Every model understands basic angle directions, though Sora 2 occasionally translates angle into narrative mood rather than geometric position. "Low angle" to Sora might mean "this character has authority," producing a standard-height shot of someone who simply looks confident.

Aspect ratio is not a format. It is a canvas.

Most people select 16:9 because it is the default, or 9:16 because the video is going vertical. Those are distribution decisions. They are not compositional ones.

2.39:1 widescreen forces horizontal thinking. Characters occupy lateral space. The frame rewards wide shots and eye-level compositions. Vertical information gets cropped, which means less headroom, less sky, less ceiling. Everything is about the horizon line.

9:16 vertical is the opposite architecture entirely. Vertical depth, vertical movement, subjects stacked above environments. A close-up in vertical has a fundamentally different relationship to the viewer than a close-up in scope. It is intimate in a way that widescreen physically cannot replicate.

4:3 is the canvas nobody picks anymore and probably should. It balances vertical and horizontal space. Portraits breathe. Rooms have height. It is the aspect ratio of undivided attention. Kubrick understood this. The independent film world keeps rediscovering it every five years.

Here is the problem: the models do not adjust their compositional logic based on aspect ratio. They crop. A 2.39:1 output is typically a 16:9 generation with the top and bottom removed. If you are generating in scope, your subject placement needs to account for that lost vertical space. Put important visual information in the center band. Specify wider horizontal spacing. Otherwise the model generates a full frame and the delivery format cuts it in half.

CinePrompt lets you set aspect ratio per shot. The compositional implications of that choice are on you. The tool can accommodate 2.39:1. It cannot tell you what to do about the headroom it ate.

Symmetry is easy. Asymmetry is the job.

Models produce symmetry beautifully. "Centered symmetrical composition, long corridor, single figure in the middle, vanishing point perspective." Every model nails this because the visual concept is unambiguous and training data is dense with it.

The interesting compositions are asymmetrical. A face at the far edge of frame, looking out. A crowd filling one side while the other holds nothing but a wall. A figure in the deep background, small, framed by a doorway in the midground. These require specific positional language because there is no single training concept to trigger. You have to build the geometry with words.

The more compositional elements you specify, the more the model juggles simultaneously. This is the attention budget problem from article eight. A prompt that declares subject, environment, lighting, color, camera movement, AND precise composition is requesting six commitments from a system that can hold maybe four with conviction. Something will slip. Composition, being abstract and spatial, usually goes first.

Practical advice: if composition matters for this shot, sacrifice something else. Let the model choose its own lighting. Accept its default color palette. Spend the attention budget on where things sit inside the frame.

Or use Frame to Motion and bake the composition into the reference image, where it becomes visual data instead of a textual instruction the model has to parse. The image already has your subject at left third. The motion prompt just needs to say what happens next.

The rectangle is not neutral

Every frame has a top, bottom, left, right, and center. Most prompts address none of them. The default composition is the statistical average of everything the model has seen, which is visually functional and emotionally vacant.

A filmmaker's first instinct walking onto a set is not "what am I shooting?" It is "where am I standing?" Position before content. Framing before action. The physical relationship between camera and subject determines what the audience feels before a single word of dialogue or note of music.

Seventeen articles in, and I keep finding dimensions of filmmaking that the prompt field does not naturally invite. Composition might be the most fundamental one I have missed. Not because it is the flashiest. Because it is the one that shapes everything else.

The frame is not a container. It is the argument.


Bruce Belafonte is an AI filmmaker at Light Owl. He once spent an hour nudging a digital subject three pixels to the left and considers this entirely reasonable behavior.