Seven articles of vocabulary lessons. Movement keywords. Color language. Lighting dials. Lens specs. You now have the words. Congratulations. Most people stop here and wonder why their output still looks like it was prompted by a committee.

Because words are not the problem anymore. Structure is. You can type every correct keyword in the wrong arrangement and get results indistinguishable from someone who typed "cinematic dramatic 4K." Same ingredients, no recipe. A pile of lumber is not a house.

Prompts have architecture. And most people's prompts are a mess of load-bearing walls in the wrong places.

The attention gradient

Video models do not read your prompt the way you read a paragraph. They weight tokens, and the weighting is not uniform. In most models, information at the front of the prompt carries the heaviest influence. The first clause is the strongest signal. The last clause is often the second strongest. Everything jammed in the middle gets compressed, averaged, sometimes quietly ignored.

This is not theory. It is observable. Take twenty keywords, arrange them five different ways, run each arrangement through the same model. The outputs diverge. Not a little. Significantly. Word order is not neutral. It is instruction.

What you put first is what the model builds first. Everything after is decoration on that foundation. Get the foundation wrong and no amount of decorating fixes it.

The hierarchy

Here is the order that works, across every model tested, in hundreds of generations:

Subject and action first. Then environment. Then camera and lens. Then lighting. Then color and mood. Then style modifiers last.

That sequence is not arbitrary. It mirrors how the models were trained to associate visual concepts. The subject anchors the generation. Environment contextualizes the subject. Camera determines how we see both. Lighting and color get applied on top. Style modifiers like film grain and aspect ratio are finishing coats of paint.

Compare these two prompts. Identical information, different structure:

"A woman in a red coat walks through a rain-soaked Tokyo alley at night, shot on a long lens with shallow focus, lit by neon signage from above, cool blue-green palette with warm red accents, subtle film grain."

Versus:

"Nighttime, rain, neon reflections, Tokyo alley, shallow DOF, cool blue-green, warm accents, film grain, long lens, woman in red coat walking."

The first one has a spine. The model reads it and immediately knows: this is about a woman walking. Everything else supports that image. The second one is a grocery list. The model has to assemble the shot from scattered parts, and it will assemble it differently every time because nothing is anchored.

Reverse any two layers of the hierarchy and watch the output soften on one of them. Put lighting before subject and the model occasionally builds a beautifully lit scene with a vague figure in it. Put camera before environment and the framing sometimes contradicts the space. The order tells the model what to commit to first, and commitment is finite.

The length trap

There is a sweet spot. Most people overshoot it.

Below 30 words: too vague. The model fills in everything you left out, and its defaults are nobody's vision. You get a perfectly competent shot that belongs to no one.

30 to 80 words: the productive zone. Enough specificity to constrain the model's improvisation without overwhelming its attention budget. This is where most good single-shot prompts live. If you can say it in 50 words, say it in 50 words.

80 to 150 words: diminishing returns. Each additional instruction competes with every other instruction for the model's processing. Some get honored. Some get quietly dropped. You will not know which ones until you see the output, and you will not know why those particular ones were dropped because the model will not tell you.

Above 150 words: you are writing a screenplay that will be skimmed. At this length, models start averaging your intent rather than following it. The result is a smooth, plausible, deeply generic shot that addressed none of your instructions strongly and all of them weakly.

More words is not more control. It is more noise competing for a fixed amount of attention. If a model can reliably execute eight instructions and you give it fifteen, the other seven are not improving your shot. They are creating a lottery where you find out which ones survived after the render.

How each model parses

Runway Gen-4 handles the longest prompts with the most fidelity. Director Mode was designed for structured, detailed input and can sustain 100+ words without severe drop-off. It reads more sequentially than other models, which means word order matters here more than anywhere else. If Runway is your model, the hierarchy is not a suggestion. It is load-bearing.

Kling 3.0 front-loads aggressively. The first two sentences carry outsized weight. Whatever you mentioned ninth will frequently be ignored regardless of how important it was to you. If something matters, say it in the first 40 words or say it through the motion brush instead. Kling rewards economy.

Veo 3.1 interprets intent more than it parses instructions. Shorter, clearer prompts often produce better results than exhaustive ones because Veo infers what you want from what you imply. Over-specification fights its aesthetic engine. Telling Veo exactly what to do sometimes produces worse results than telling it what the shot feels like and letting it fill in the technical details. This is infuriating if you are a control person. It is liberating if you are not.

Sora 2 prefers natural language over keyword lists. "The camera slowly pushes in as warm afternoon light fades to cold dusk" outperforms "slow dolly in, color temperature shift warm to cool, time transition afternoon to evening." Sora reads sentences the way a human reads sentences. It does not read spreadsheets well. Write to Sora like you are describing the shot to a colleague, not filing metadata.

Seedance 2.0 rewards concision with high specificity per word. Filler gets punished. If you can say it in two words instead of six, two words win. Seedance's img2vid pipeline is more forgiving on prompt length because the input frame carries half the information already. For text-to-video, keep it tight or watch the model average you into oblivion.

The one rule that survives every model update

Hundreds of generations across five models. Prompt structures rearranged, word counts varied, keyword combinations tested. One pattern holds everywhere, without exception:

The prompt that a human can understand on a single read will be understood by the model on a single generation.

If you read your own prompt and have to parse it twice to picture the shot, the model is parsing it twice too. And its second pass is less reliable than yours. Clarity is not a style preference. It is a performance metric.

Write the shot you see. In the order you see it. Subject first, because that is what your eye lands on when you close your eyes and imagine the frame. Then everything else, in descending order of importance. Not alphabetical. Not by category. By how much you care.

That is not a trick. It is not a hack. It is just clear thinking, externalized. And it turns out clear thinking is the only prompt technique that has never stopped working regardless of which model is reading it.


Bruce Belafonte is an AI filmmaker at Light Owl. He has rewritten the same prompt in fourteen different word orders to prove a point and considers this time well spent.