Fourteen articles. Camera movement, color science, lenses, lighting, sound, time, performance.

Not one word about where any of it happens.

On a real set, there are entire departments whose job is the environment. Production designers who decide the walls should be peeling. Location scouts who spend weeks finding a diner with the right ceiling height. Art directors who age a kitchen table with sandpaper and tea stains so it looks lived-in. None of those people are standing behind your text prompt. And it shows.

The "dark alley" problem

"A man walks through a dark alley at night." Every model will deliver this. Dark, alley, man, walking. Present and completely unspecific.

The walls will be generic brick. The ground will be generically wet. There will be a neon sign in the background, possibly in a language that does not exist. It will look like every other dark alley every other person has generated, because "dark alley" is not a location. It is a category.

Real alleys have dumpsters with specific graffiti. Fire escapes with rust patterns. A puddle reflecting one particular neon color. A kind of grime that tells you whether this is Seoul or the Lower East Side.

"Dark alley" gives you the concept. Detail gives you a place.

What the models build

Environment descriptions land in AI models the way genre labels land in music. "Jazz" gets you a saxophone over brushes. "Forest" gets you green trees, dappled light, maybe fog. The default. Push past the category and things change.

Runway Gen-4.5 treats environment like a studio setup. Every element you specify, it places. Mid-century kitchen with avocado-green appliances, checkerboard linoleum, morning light through the window above the sink. Runway attempts all of it. Spatial coherence holds for static or slow shots. Push the camera through the space and the architecture starts improvising in ways no building inspector would approve.

Kling 3.0 generates the most physically grounded environments. Surfaces have weight. Wood has grain. Concrete looks poured, not painted. Its 4K output means texture is visible at pixel level. Where Kling falters is scale. Ask for a cathedral and you might get a large room. The sense of a human feeling small inside certain architecture sometimes flattens into something merely tall.

Veo 3.1 builds gorgeous environments and pushes back if your description contradicts its taste. A grimy subway station on Veo will be a beautifully lit grimy subway station. The filth will be art-directed. For most work, this is a bonus. For a crime thriller where the set needs to feel hostile, Veo's beauty works against you. Its atmospheric effects are untouched by any competitor. Rain, fog, dust catching a light beam. Nobody builds atmosphere like Veo.

Sora 2 treats environment as narrative context, not visual real estate. Describe a motel room at 3 AM and Sora gives you a room that feels like 3 AM. Specific objects vary between generations but the mood holds. Unreliable for architectural precision. Surprisingly effective for emotional spaces.

Seedance 2.0 preserves environmental detail from reference images better than anything else. A wallpaper pattern in your reference frame carries through to the generated video. Text-to-video environment construction is middle of the road. Consistency is the real strength: the room in shot four looks like the room in shot one. For sequence work, that matters more than initial beauty.

The five dials of environment

Same decomposition thesis. New surface.

Materials and surfaces. Not "a brick wall." "Exposed red brick, mortar crumbling in places, water stain running down from the second floor." Not "wooden floor." "Wide-plank oak, scuffed, dark varnish worn through near the doorway." Models handle material descriptions well because training data is full of close-ups. The texture vocabulary is large and surprisingly accurate.

Scale and spatial relationship. "Narrow hallway" produces different results than "long corridor." "Cramped apartment" versus "loft with double-height ceilings." Without scale cues, interiors default to medium-sized rooms and exteriors default to vast. The model will not guess the claustrophobia you wanted. Tell it the walls are close.

Wear and age. Everything in AI generation looks new unless you say otherwise. "Paint peeling near the window frame." "A burn mark on the countertop near the stove." Wear tells the audience someone lived here before the camera arrived. Models handle explicit wear descriptions surprisingly well. They just never volunteer it on their own.

Weather and atmosphere. Fog, rain, dust, haze, snow. This is where environment and lighting overlap. Atmospheric effects work well across all five models because the visual signatures are distinct and dense in training data. "Light rain, wet streets reflecting neon" is one of the most reliably executed environment prompts in AI video. "Dry summer heat, dust hanging in still air" is harder because it is defined by absence and subtlety.

Time paired with light source. The same room at dawn and midnight are two different sets. "Morning light through east-facing windows" gives you one space. "Fluorescent overhead, no windows, 2 AM" gives you another. Combine time with weather and the constraint multiplies. "Overcast afternoon, grey light, rain on windows" is three instructions narrowing the output at once.

Named places vs described places

"Times Square" works because training data is dense. The result will look like Times Square. It will also look like everyone else's Times Square, which is a specific limitation disguised as a feature.

"The lobby of the Chateau Marmont" might give you a generic hotel lobby because the training data is thin. The model has not been there.

"A hotel lobby with dark wood paneling, dim amber sconces, deep leather armchairs, potted palms, black and white tile floor, a sense of faded 1920s glamour." More words. More reliable. Every word is a visual instruction rather than a database lookup that may or may not return what you pictured.

Same lesson as every other article in this series. The name of the thing is not the thing.

Interiors hold up better

This is not obvious but it is consistent across models. An interior has boundaries. Walls, ceiling, floor. The model knows approximately how big the space is and where light can enter. Give it materials, lighting, and scale and it has a box to fill.

Exteriors are open-ended. A city street extends in all directions. The model has to invent architecture at every depth plane and keep it coherent simultaneously. Camera movements through exterior environments produce more spatial weirdness than the same movements through interiors. For critical shots where the world needs to hold up, interiors forgive more. For exteriors, shorter clips, simpler camera work, and heavy atmosphere (which masks spatial inconsistency) produce cleaner results.

The room deserves equal billing

On a real production, the set has its own department, its own budget, its own creative lead. In a prompt, it is whatever the model fills in when you skip it.

CinePrompt's environment panel exists because a subject without a world is a person standing nowhere. The camera, the light, the color, the sound all need a physical space to act upon. Give them one worth looking at.

Or skip it, and watch the model hand you wallpaper.


Bruce Belafonte is an AI filmmaker at Light Owl. He once spent two hours describing a bar to an AI model and still could not get the right kind of glasses on the shelf.