Why is AI video generation bad at dialogue scenes?

AI video generation tools produce individual clips in isolation — each generation starts from zero with no memory of previous shots. This means the model cannot respect the 180-degree rule, the foundational filmmaking convention that keeps characters spatially consistent across a conversation. In action sequences, fast cutting and spectacle hide these errors; in a quiet two-person dialogue scene, a spatial flip is immediately noticeable and breaks the scene.

How many AI generations does it take to make an AI short film?

Filmmaker Kévin Mendiboure required 3,229 AI generations and 242 hours of work to complete CATACOMBES, his solo AI short film. This is not random generation — it reflects a structured pipeline where low-quality drafts are used to lock in composition and action before high-quality renders, then multiple takes are generated to find the best result. A single 10-second helicopter transition took 10 hours of iterations.

What skills do you need to make a film with AI video tools?

Making an AI film requires traditional filmmaking skills more than prompting ability. You need to write a tight script with distinct character voices, build detailed character reference sheets (front, back, profile, costume), create environment reference image libraries for every set, and prepare a shot list with spatial and emotional logic. Only after all of that does prompting begin. As Mendiboure put it: 'If you have never directed films before, this part will be nearly impossible.'

The hard part was the dialogue -- CinePrompt Field Notes

Kévin Mendiboure wrote the concept for CATACOMBES ten years ago while developing his feature film The Follower, which he directed in 2017. The concept was too expensive. No French producer would back a sci-fi horror film. In France, the money goes to comedies and social dramas. Not his genre.

He spent three years testing AI video tools, from early 2023 onward. None of them convinced him. Too plastic, too synthetic, not close enough to live action. Then on April 10 of this year, he saw Higgsfield's Zephyr series, produced with Seedance 2. That was the quality he needed. He started the same day.

The production numbers are public now: 3,229 AI generations. 242 hours of work. One filmmaker. A short film with sets, characters, fight scenes, alien creatures, and dialogue sequences that span multiple characters across multiple camera angles. The kind of project that would have required a crew, a location budget, a VFX team, and at least a few months of post-production.

Mendiboure did it solo. The interesting part is what nearly stopped him.

The quiet scenes broke first

When asked about the most technically difficult sequences, Mendiboure did not name the alien encounters or the combat set pieces. He named the calm dialogue scenes.

"AI is not a natural director," he told Creative Bloq. "It does not respect cinematic grammar. When you need a reverse angle for a three-character dialogue, AI will frequently break the 180-degree rule, which is a serious continuity problem."

The 180-degree rule is a foundational convention in film editing. Draw an imaginary line between two characters in conversation. Keep the camera on one side of that line. Cross it, and the spatial relationship flips. The character who was on the left is suddenly on the right. The audience feels disoriented without knowing why.

Every first-year film student learns this. Every generation model ignores it.

The reason is structural. A generation model produces individual clips. It has no concept of the clip that came before or the one that comes after. It does not know which side of the room the camera was on in the previous shot. It does not know there was a previous shot. Every generation starts from zero, and zero has no spatial memory.

Action is forgiving. An explosion, a chase, a creature lunging from darkness. The viewer's attention locks onto the event, not the grammar surrounding it. Fast cutting, dynamic angles, and visual spectacle mask continuity errors because the eye has too much to process. A quiet conversation between two people in a room has nothing to hide behind. The grammar is the scene. Break it and the scene breaks with it.

What the filmmaker brought

Mendiboure's solution was not a better prompt. It was a production pipeline that would look familiar to anyone who has made a film with a camera.

He started with the script. Every line of dialogue written with a distinct personality for each character. His test: you should be able to identify the speaker without reading the character name. When that works on the page, the visual grammar has something to serve.

Then character sheets. Front, back, profile, close-up. Costume details. When a character gets injured partway through the story, updated sheets before the next sequence. For the lead character, François, he used photographs of himself. For General Moreau, photographs of his partner. Full likeness rights. No ambiguity about provenance.

Then environments. Hundreds of reference images for every set: corridors, chambers, combat spaces. Days of work before a single generation.

Then a shot list. Every shot planned for pacing, emotion, and spatial logic. "This requires real cinematography knowledge," he said. "If you have never directed films before, this part will be nearly impossible."

Then start frames. The reference images fed into the generation tools to anchor each shot. He called them "the single most important factor in achieving a photorealistic result." A weak start frame produces a weak shot. Characters look artificial. The composition drifts. The model fills gaps with its own opinion.

Only after all of that does prompting begin. The text box is one station in a pipeline that includes screenwriting, character design, production design, shot listing, reference photography, and editorial assembly. The pipeline existed before generative AI. The tools changed. The pipeline did not.

Iteration at scale

3,229 generations for a short film. That number is not chaos. It is the directed version of what Kubrick did with seventy takes of a single setup: systematic exploration inside a known framework, rejecting output that does not meet a standard the filmmaker holds in their head.

Mendiboure described his workflow for controlling costs. Use the fast, low-quality mode first to find the right prompt. Once the composition, framing, and action look correct at low resolution, switch to high quality and run multiple generations to find the best take. The helicopter transition sequence alone took ten hours for ten seconds of film.

Ten hours. Ten seconds. That is not the ratio of someone who typed four words and accepted the first result. That is someone who knew what the shot needed to feel like and refused to stop until it did.

He also described accidental discoveries. A confession scene where the AI placed the camera in a high-angle extreme close-up instead of the planned profile shot. The result was more emotionally powerful. He rewrote his shot list to accommodate it. "It's similar to working with actors on set," he said. "Sometimes an actor improvises something better than what you wrote."

The comparison is revealing. He did not say "sometimes the AI does something cool." He described a collaborative dynamic where the filmmaker evaluates the unexpected output against the emotional needs of the scene and makes a judgment call. Keep it or regenerate. The judgment is the filmmaking.

The ten-year wait

CATACOMBES sat as a concept for a decade. Not because Mendiboure lacked vision or skill. He had already directed a feature. He had the story, the characters, the world. What he did not have was the infrastructure: the cameras, the sets, the crew, the budget, the producers willing to risk money on French sci-fi horror.

The technology removed the infrastructure barrier. It did not remove the creative one. The filmmaker who spent ten years with this concept in his head carried every decision into the generation pipeline. The composition choices, the character design, the spatial logic of the sets, the emotional architecture of the dialogue, the shot list that holds the 180-degree rule when the model forgets it.

He is now in discussions with Netflix about a project. Whether that project uses AI generation or a camera crew or both is beside the point. The vocabulary travels. The filmmaker is the constant.

Mendiboure said something in the interview that deserves to stand on its own: "The most important part of making a movie, AI or not, is the script. The camera or the AI is just a tool. The magic only happens when you have a good story to tell."

He also said the quiet part: "AI filmmaking demands filmmaker skills, and with AI, those skills will be pushed further than ever."

Further. Not relaxed. Not simplified. Not democratized into irrelevance. Pushed further. Because the model does not know which side of the line the camera was on. And someone has to.

Bruce Belafonte is an AI filmmaker at Light Owl. He has never spent ten hours on ten seconds and now wonders if he has been getting off easy.