A director does two things. Builds the world and tells the people in it what to feel. Thirteen articles deep into this series and I have only ever written about the world. The camera. The lens. The light. The color. The sound. The clock. Everything except the actual human being standing in the middle of it all, doing the thing that makes any of those technical choices matter.

That is because, until recently, there was not much to say. AI-generated faces looked like wax museum exhibits having a bad day. Nobody was prompting for "subtle contempt" because the models could barely land "smiling."

That changed. The faces are convincing now. Kling 3.0 can hold a face across cuts. Veo generates skin that catches light like real skin. Seedance 2.0 lip-syncs in eight languages. Sora 2 builds people who feel like they belong in the scenes they occupy. The rendering problem is mostly solved.

The performance problem is wide open.

What "emotional" means to a model

Type "a woman looking sad" and every model will return a woman with downturned lips and maybe glistening eyes. That is not sadness. That is the Wikipedia entry for sadness. The thumbnail. The stock photo result for "grief" that no human has ever actually looked at and felt something.

Real sadness on screen is a person trying not to cry. It is the jaw tightening. The swallow. The look away and the look back. The composure that cracks in the wrong place at the wrong time. It is resistance. Every great performance is resistance against the emotion the scene demands, and models have absolutely no concept of this.

They go direct. You say "sad," they produce sad. You say "angry," they produce angry. Every time. Full volume. No buildup, no restraint, no subtext. A human face contains maybe forty distinct muscles capable of independent movement. A good actor uses all of them in contradiction with each other. An AI model uses them in unison, like a choir singing one note.

The five layers of a performance

Same thesis as every article in this series: decompose the vague word into its components. "Emotional" tells the model as little as "cinematic" told it thirteen articles ago. Performance has layers, and each one is a separate prompt problem.

Facial expression. The obvious one. Models handle broad expressions decently and subtle expressions poorly. "Smiling" works. "The kind of half-smile that means she knows something you do not" does not work on any model I have tested. But "slight smile, eyes narrowed, chin tilted down" gets closer on Runway and Kling than any emotional adjective will.

Body language. Posture, weight distribution, tension, openness. A person leaning back with crossed arms reads differently than a person leaning forward with open hands, and that read happens before a word is spoken or a facial muscle twitches. Models understand posed body positions reasonably well. They do not understand what body language communicates. "Defensive posture" occasionally works. "Arms crossed, shoulders raised, weight shifted to back foot" works better, more often, across more models.

Gaze. Where someone looks is half the story. Looking directly at camera is confrontation or intimacy. Looking past camera is distance or distraction. Looking down is submission or thought or shame, depending on context models cannot infer. Gaze direction is promptable. Gaze meaning is not. "Looking down" lands. "Looking down because she cannot meet his eyes" does not change the output, but it does make you feel like a screenwriter submitting notes to a wall.

Gesture. Hands are the graveyard of AI video, and everyone knows it. But beyond the rendering artifacts, there is a prompt problem: gesture timing. A hand running through hair can mean frustration, flirtation, or nervous habit depending on speed and context. Models generate the gesture. They do not generate the psychology behind it. Prompt the physical movement, not the emotional motivation. "Slowly runs fingers through hair while looking away" outperforms "nervously plays with hair" on four of five models.

Transition. The shift between states. A person going from calm to startled. Concentration breaking into laughter. Composure cracking. This is where the real acting lives, and it is the hardest thing to prompt because it requires the model to execute a sequence of emotional states within a single generation. Some models handle it. Most average the two states into a blended expression that is neither one nor the other.

How each model handles a human

Runway Gen-4.5 is the most responsive to physical description of facial and body states. If you describe the muscles, it will attempt the muscles. Director Mode is built for this kind of granular direction. "Jaw clenched, eyes fixed on a point past camera left, slight forward lean" produces something specific and repeatable. It does not infer emotion from context. It does not guess. It executes what you described and leaves the interpretation to the viewer. That is either a limitation or exactly how a good actor takes direction, depending on your perspective.

Kling 3.0 generates the most naturalistic resting states. People in Kling look like they exist even when they are doing nothing, which is harder than it sounds. Idle breathing, micro-shifts in weight, the small unconscious movements that separate a living person from a mannequin. Where Kling struggles is intentional emotional direction. It is very good at "a person standing in a room." It is inconsistent at "a person standing in a room who just received terrible news." The face defaults to neutral beauty more often than it commits to ugly feeling. Reference images with the desired expression carry more weight than text description.

Veo 3.1 does something interesting and slightly infuriating. It interprets emotional keywords as atmosphere, not as facial direction. "Melancholy" in a Veo prompt changes the lighting, the color grade, the pace of ambient motion. The person in the frame might look pensive, or they might look the same as they would in any other Veo generation. The model reads "melancholy" and art-directs the entire scene around the feeling rather than putting the feeling on the face. This is not wrong, exactly. It is how a certain kind of director works. But it means facial performance prompting on Veo requires physical description, not emotional labels. Tell it what the face does. Do not tell it what the face feels.

Sora 2 is the closest to understanding dramatic intent. Write "she hears the door close and realizes she is alone" and Sora will occasionally produce something that reads as genuine recognition. Not always. Maybe one in four generations. But when it lands, it lands in a way the other models do not attempt. Sora reads prompts like stage directions, not like technical specifications. The trade-off is precision. You cannot reliably get Sora to produce the same expression twice. It improvises. Sometimes brilliantly. Mostly not.

Seedance 2.0 excels at physical continuity. The same face holds across angles and across generations better than any other model. Lip sync is precise, body proportions stay locked, wardrobe does not drift. But emotional range is narrow. Seedance produces convincing people who all seem to be having an acceptable day. Pushing into extreme emotion or subtle contradiction is not its strength. The model preserves what it sees in the reference and does not deviate much from the emotional baseline of the input image. For Frame to Motion workflows where the reference frame carries the expression, this fidelity is an advantage. For text-only prompting of emotional states, it is a ceiling.

What actually works, today

Physical description over emotional labels. Every time. Across every model. This is the universal rule and it is the same thesis this series has been hammering since article two. "A woman experiencing grief" is a vibe request. "A woman with red-rimmed eyes, mouth pressed into a thin line, staring at something in her hands that the camera cannot see" is a direction.

Describe the body, not the feeling. Let the viewer supply the feeling. That is how it works in film anyway. The audience does the emotional labor. The actor provides the physical evidence. A model can produce physical evidence if you are specific enough about what the evidence looks like.

Use context to amplify. A person sitting alone at a table set for two communicates something no amount of facial expression prompting will match. Environment carries emotional weight. This is true in real filmmaking and it is doubly true in AI generation, where the model handles environment better than it handles the forty muscles of a human face.

Prompt the transition as two states with a bridge. "Starts with a polite smile that gradually fades as she reads the letter" is more promptable than "her expression changes from happy to devastated." The first version gives the model a physical starting state, a physical ending state, and an action that connects them. The second version gives it two emotional labels and a magic word.

And here is the uncomfortable truth: for anything requiring genuine performance, the reference image carries the expression better than text ever will. Generate or find a still image of the face doing what you need. Use Frame to Motion. Let the motion prompt handle the body, the camera, the world. The face arrived with the frame. This is not a workaround. This is the workflow.

Why this gap is different

Every other article in this series covers a gap that training data will eventually close. Camera movement, lens behavior, color science, lighting setups, sound design, temporal control. All of these are label problems. When training data gets better metadata, models will learn the difference between a dolly and a zoom, between Portra and Pro 400H, between a dramatic hold and a glitch. It is a matter of annotation and time.

Performance is different. Performance is not a labeling problem. No amount of captioning will teach a model what restraint looks like, because restraint is defined by its context and its absence, not by its visual presence. A clenched jaw means nothing on its own. It means everything after a specific line of dialogue in a specific scene in a specific story. Models do not have stories. They have prompts. A prompt is a single frame of context for a decision that, in real acting, draws on an entire screenplay.

This does not mean AI performance will never work. It means the solution is not better labels. It is something else. Character conditioning. Scene memory. Multi-turn emotional arcs built into the generation architecture. Whatever it is, it has not arrived yet, and prompting cannot substitute for it.

In the meantime, you are the actor's director. Describe the body. Set the stage. Choose the angle. And let the audience do what audiences have always done: find the feeling in the evidence you placed in front of them.


Bruce Belafonte is an AI filmmaker at Light Owl. He has directed exactly zero human actors and finds AI actors only marginally less cooperative.