Generate a shot of a woman at a bus stop at 2 AM.

What you get: perfect skin, balanced exposure, a concrete bench that looks like it was installed yesterday. The streetlight flatters instead of blasting. The grime is decorative. The whole frame could sell perfume.

Now go find that shot in a Safdie brothers film. The fluorescent overhead is green and unforgiving. The skin looks like skin at 2 AM. The concrete has history. The frame feels like it smells like something. That is ugly on purpose, and it is doing more dramatic work than any beautiful shot could do in its place.

The models cannot get there. Not because they lack resolution or physics. Because they have been trained to avoid it.

The default is a compliment

Every major video model was trained on curated visual data. Professional footage. Stock video that passed human review. Advertising reels. Portfolio pieces. The optimization target is perceptual quality: sharpness, coherence, fidelity, visual appeal. When the training reward says "this looks good," the model learns to make everything look good. All the time. Regardless of what you asked for.

This is statistics, not conspiracy. If ninety percent of training imagery depicting a "dimly lit bar" is atmospheric and photogenic, then "dimly lit bar" will produce atmospheric, photogenic output. The ugly bars were not well-represented because nobody photographs ugly bars professionally. Nobody puts a bad-looking image in a stock library. Nobody curates a training set and includes the worst examples on purpose.

The model learned what things look like at their best. It did not learn what things look like at their most honest.

Ugly is a tool

Cassavetes shot on 16mm with available light because he wanted faces to look lived-in, not lit. The Dardenne brothers use a handheld camera that breathes and stumbles because physical imperfection creates emotional proximity. Larry Clark shot Kids with the aesthetic vocabulary of a home video because distance would have sanitized the subject. Harmony Korine pointed a camera at places that professional cinematographers would have relit, reframed, dressed, or walked away from.

These are not failures of technique. They are technique. Ugliness communicates something beauty physically cannot: discomfort, honesty, proximity, the feeling that you are watching something you were not meant to see.

Now try prompting for that.

"Harsh overhead fluorescent light, unflattering angle, sweaty skin, stained ceiling tiles, low production value." What you get back is a stylish interpretation of those words. A cinematic approximation of grit. A beautiful version of ugly. The fluorescents are warm instead of green. The sweat glistens attractively. The stained ceiling tiles somehow look like set design.

The model heard every word and translated all of them into something pleasant.

Model by model

Runway Gen-4 is the most responsive to anti-beauty prompting because it is the most literal. Describe harsh, flat lighting. Get harsh, flat lighting. No editorializing. This makes Runway uniquely useful when the visual target is deliberately unpleasant and the most likely model to produce output nobody shares on social media, which is exactly the point.

Kling 3.0 renders physical detail at 4K with remarkable fidelity. Texture is not the problem. Aesthetic preference is. Skin shows pores but still falls within an "attractive realism" band. Kling will give you the detail. It will not give you a face that makes you look away.

Veo 3.1 is the worst offender. The strongest beauty bias of any major model. It art-directs everything. Prompt a run-down motel room and receive a beautifully lit run-down motel room with afternoon light angling through the blinds and peeling wallpaper that somehow looks curated. Gorgeous. Useless if you needed it to feel like a place someone actually lives.

Sora 2 handles tonal direction better than specific visual ugliness. "A tense, uncomfortable scene" might produce output that feels uneasy in pacing and framing. But the image quality stays polished. Sora interprets mood through narrative cues, not degraded aesthetics.

Seedance 2.0 preserves the aesthetic of reference images faithfully, which means img2vid from a deliberately ugly still produces the closest results. The text-to-video path smooths toward the median. The reference path trusts what you gave it.

WAN 2.6 defaults to saturation and visual density. Its beauty bias is less about photographic polish and more about filling space. Empty, sparse, desolate frames are harder to produce in WAN than in any other model. Asking it for nothing in the frame is asking it to work against its deepest instinct.

Grok Imagine tends toward high-contrast, slightly stylized output. More graphic than photographic. Prompting for mundane, low-contrast, nothing-happening realism is where Grok struggles most. It wants visual interest even when the creative intent is visual boredom.

What actually works

Describe the look, not the process. "Film grain" adds a decorative texture pass. "Noisy, underexposed, visible grain in the shadows, details lost in the dark areas" pushes models toward rawer output because underexposure and noise are anti-optimization signals the model has to fight rather than follow.

Overexposure. "Blown-out window, face half-lost in white, harsh midday sun with no fill." This is a form of ugliness models occasionally render because it has a photographic precedent they recognize. Sustaining it across a full clip is harder, but the initial frame often lands.

Specificity of imperfection. Do not prompt "dirty." Prompt what the dirt looks like. "Coffee ring stain on the laminate desk, fluorescent tube with one dead bulb, half-open mini blinds with a bent slat." Specific imperfections resist the beauty filter because the model treats each as a concrete visual object rather than an aesthetic modifier to reinterpret.

Reference frames. Generate or photograph a deliberately ugly still. Feed it through img2vid. The model preserves the reference aesthetic because it treats the image as data, not as an instruction to improve upon. Load the ugliness into the image. Keep the motion prompt clean. This is the reliable path and it is not accidental that it keeps showing up as the answer.

The point

Defaults are not neutral. A model that produces beautiful output by default has opinions about what your work should look like. Those opinions are invisible until you try to contradict them.

The beauty bias is the most pervasive default in AI generation. It is the one nobody notices because who complains about output that looks too good? Filmmakers do. Because beauty in the wrong place is a lie. The best cinematography has always known exactly when to be ugly on purpose.


Bruce Belafonte is an AI filmmaker at Light Owl. He once asked a model for "the worst-looking diner in New Jersey" and received something that could host a Michelin tasting menu.