The prompt

Every previous prompt test in this series has been single-model: one prompt, one generation, one analysis. This one is different. Same CinePrompt prompt, seven models, seven generations. The question isn't which model is best in general. It's which model handles this prompt best, and what that reveals about each model's instincts when the brief demands comedy, restraint, and micro-expression.

The brief: a retired stage magician alone in his backstage dressing room. Close-up with push-in. Deadpan comedy shifting to private mischief. The hard part is the performance: an older male face under vanity bulbs, a dry line of dialogue, a micro-smile that starts around the eyes, and hand business with a red silk that cannot become AI finger soup. The reference is Sven Nykvist's work on Fanny and Alexander: the gentle practical-light intimacy of older faces held close enough for thought to register before speech.

CinePrompt Output 7-Model Comparison
Full Prompt
Cinematic comedy. Comedic. 35mm film. A retired stage magician alone in a backstage dressing room in their 60s, average build, slicked back gray hair, A precise, slightly vain former vaudeville magician with freckled skin, deep laugh lines, tobacco-stained fingertips, and a gold molar that catches light only when he lets the smile get away from him., creased black tuxedo jacket with satin lapels, loosened white bow tie, pearl shirt studs, red silk pocket square half-hidden in his sleeve, deadpan concentration shifting into private mischief, the smile beginning in the eyes before the mouth admits anything, shoulders relaxed but hands exact, chin lifted toward an unseen heckler off camera, one eyebrow doing most of the damage; positioned left-third of frame. He finishes hiding a red silk in his cuff, listens to an off-camera accusation, delivers one dry line, then lets a tiny conspiratorial smile leak through.
Dialogue
"Don't look so impressed. The rabbit did most of the work."
Camera
Close-up with push in. Arriflex 35 BL, 75mm Cooke Speed Panchro, Glimmerglass filter. Shallow depth of field.
Lighting
Butterfly practical. Warm vanity bulbs above, softened through aged frosted glass. Weak cool rain reflection from dressing-room window camera-right, three stops under.
Open in CinePrompt →

The test

Seven models. Same prompt, same CinePrompt share link, no modifications between runs. The models: Seedance 2.0, Kling V3 4K, HappyHorse 1.0, Grok Imagine, Sora 2 Pro, Veo 3.1, and PixVerse C1. Durations ranged from 8 to 10 seconds. Resolutions from 720p to 4K. Every model got the same words and had to decide for itself what a retired magician looks like when he's alone with his hands and his red silk.

What we're testing: Can the model read comedy? Can it hold a face in close-up without collapsing into uncanny valley? Can it execute a push-in? Can it deliver a single line of dialogue with natural lip sync? And the specific trap in this prompt: can it render a single gold molar without giving the man a full set of gold teeth, yellow dental disease, or a glowing gold eye?

The answer, across seven models: almost nobody can.

The results

1st · Seedance 2.0

Seedance 2.0 · 8s · 1920×1080 · 24fps

Wins on nearly every metric. Image quality, push-in execution, facial features, expression, and, critically, audio. His mouth syncs perfectly with the dialogue and his voice feels mixed into the space: you hear the room, not just a voice track laid on top. The performance arc is the most naturalistic: deadpan concentration, a shift to mischief around the eyes, and a realistic smile that doesn't collapse into mugging. Eye line is motivated and specific. The push-in is smooth.

He doesn't have a gold tooth. The prompt contaminated his teeth to off-white rather than producing the single gold molar. A subtle degradation that stayed in reality instead of the comedy-to-horror pivots other models produced. The playing cards on the table have a 3 of hearts and an upside-down 2 of hearts with 8 hearts on it. The handkerchief work is unclear, but the camera pushes past it before it becomes a problem.

The downsides are structural, not cosmetic. All Seedance camera movements feel similar. Centered subjects, steadicam-type motion that feels like it's all shot on the same technocrane. If everyone uses Seedance for everything, all content will begin to look identical. There's also a frame-skipping issue: between frames 63 and 64, a subtle jump where the continuous motion misses a frame. It's fixable in post. Most viewers won't notice it. Filmmakers will.

2nd · Kling V3 4K

Kling V3 4K · 8s · 3840×2160 · 24fps

Strong visuals, sharp 4K resolution, lighting and detail that feel genuinely cinematic. The best-dressed set of any model: chipped tea mug, top hat, scattered playing cards, rain-streaked window, costume rack. Facial expressions are really good. Character consistency is rock-solid.

Voice quality is decent but slightly robotic, a step up from Grok. His mouth syncs with the dialogue, but the delivery feels over the top, mouth opening too wide in a way that reads as AI. He repeats the line twice, which the prompt didn't ask for but works in context. A laugh track appears in the final second.

Kling's biggest shortfall: motion blur. As he moves his face, the blur doesn't transition naturally back into sharp frames. Instead, there are always two or three frames where the pixels look like they're being reassembled, more like an unrendered image than organic motion. This is consistent across all Kling generations, not specific to this clip. Fixable in post, but it's the model's most persistent limitation. No gold molar appears at any point.

3rd · HappyHorse 1.0

HappyHorse 1.0 · 10s · 1920×1080 · 24fps

The image has a distinctive 70s/80s film quality that works for this scene. Camera movement is the smoothest in the comparison: a sustained push-in with no unnatural frame jumps and natural, realistic motion blur. Smoke rises in the background as if a cigarette is sitting in an ashtray on the table. The prompt didn't ask for it, but it adds atmosphere. Pearl shirt studs and loosened bow tie are nice prompt hits. SFX and music are solid.

Facial features are really good, expressions are strong, and the lip sync works. But when the character speaks, you can tell it's AI. An upgraded version of the unnaturalness that other models share. The voice is better than Grok, worse than Kling, still robotic and not well-mixed into the scene. This model could pass the dialogue test if you're willing to generate a lot of takes to find the right one.

The teeth. When he smiles at the end, he doesn't have a gold molar. He has a full mouth of frightening yellow teeth. The model took "gold molar" and gave him dental disease. Like Veo's all-gold grillz, the mood pivots from comedy to horror, though HappyHorse's version is less supernatural, more medically alarming.

4th · Grok Imagine

Grok Imagine · 10s · 1280×720 · 24fps

Props look good, the character looks good, but he moves in slower motion than real time, a subtle but persistent uncanny quality. The rain reflection in the mirror reads well until you look closer and see the rain is pouring on the interior of the glass, inside the room. When he starts speaking, the illusion breaks: his voice has a robotic texture, the audio isn't mixed into the space, and the lip sync is close but doesn't lock.

No gold molar. Instead, a strange gold eye appears in the final seconds. He never shows his teeth. The visual quality is better than most models, but it still looks, feels, and definitely sounds like AI. Limited to 720p, which is a significant handicap in a field with two 4K entries.

5th · Sora 2 Pro

Sora 2 Pro · 8s · 1792×1024 · 30fps

This is a eulogy. Sora 2 Pro is being retired with no successor on the horizon, and this generation captures both its genuine brilliance and its disqualifying weaknesses. The man's voice is the most realistic in the comparison. Sora's audio generation has always been its strongest feature. His facial expressions still feel the most natural and human of any video model currently available. Even with muddied image quality, there's a quality of presence in the face that other models haven't matched.

Everything else fails. The camera doesn't move. The only model to produce a completely static shot. The handkerchief work is nonsensical, accompanied by a strange high-pitched whoosh. The 1080p resolution doesn't look like 720p. It's a muddied image with standard-definition feel. No gold molar. Letterboxing on non-standard dimensions. And the generation time makes iteration impossible: a single 8-second clip takes 20 to 45 minutes. Combined with weak prompt adherence, every generation is a shot in the dark. A Sora 3 with these character instincts at modern resolution and speed would have been formidable.

6th · Veo 3.1

Veo 3.1 · 8s · 3840×2160 · 24fps

There is a Veo look that is instantly recognizable, and it's the reason many filmmakers avoid it. Every fake Instagram ad, every fake TikTok ad that screams AI feels like it was made with Veo. The motion blur between frames is unnatural. The characters all move with the same robotic cadence. The mouth movements are way too large and overemphasized. The eyes never feel natural. And the model never speaks and acts simultaneously. He drops his hands, pauses, then delivers the line, sequencing what a real actor would overlap.

The 4K frame quality is sharp, but the sharpness is cranked too high. You'd need to soften it significantly in a color grade to make it feel filmic rather than clinical. The audio is decent quality, but the mouth struggles to keep up and the voice sits on top of the image rather than inhabiting the room.

The teeth. When he smiles, all of his teeth are gold. Not one molar. The entire set. The mood pivots from comedy to horror in a single frame. The prompt asked for subtlety; the model delivered a grillz reveal.

7th · PixVerse C1

PixVerse C1 · 8s · 1920×1080 · 24fps

PixVerse sometimes delivers decent results. This was not one of those times. Pixel quality between frames introduces strange artifacts and flickering. Sharpness varies. And the dealbreaker: the frames appear stitched into a 6-panel grid, with visible seam lines against the wallpaper behind him. Once you see the grid, the generation is worthless.

The handkerchief is wrapped around his wrist in a strange way, then he holds his hands up to his face. None of it reads as sleight of hand. The audio is the worst in the comparison: a first-generation voice model that's robotic, unmixed, and completely out of sync with the lips. He has a single gold element in his mouth, but it's not a tooth. More like a filling resting on top of one. Ring lights appear in his eye reflections instead of vanity bulbs.

The one bright spot: the push-in camera movement is actually the tightest and smoothest of any model. Great camera work pointed at a generation that falls apart in every other dimension.

What this test proves

No model got the gold molar right. Seven attempts at "a gold molar that catches light only when he lets the smile get away from him" produced: off-white teeth (Seedance), no teeth shown (Kling, Grok), yellow dental disease (HappyHorse), a full set of gold teeth (Veo), a gold eye (Grok), and a gold filling resting on a tooth (PixVerse). The specific, the subtle, the conditional, "only when he lets the smile get away from him," is still beyond what any model can parse. They can render gold. They can render teeth. They cannot render a single gold tooth that appears only in the right dramatic moment.

Every model attempted the handkerchief work. None produced coherent sleight of hand. The silk was fussed with, wrapped, twiddled, waved, and held up to faces, but in no generation did the magician actually perform anything resembling a trick. Hands remain AI video's hardest unsolved problem, and hand business with a prop is hands on expert difficulty.

Audio separated the field more than resolution. Seedance's voice felt like it belonged in the room. Sora's voice sounded the most human but lives inside a muddied image. Kling and HappyHorse were functional but robotic. Grok, Veo, and PixVerse couldn't sell the voice at all. For any prompt that involves dialogue, audio quality should be your first filter. A great voice in a 1080p image beats a bad voice in 4K.

The takeaway

Try more than one model. Every model in this comparison has strengths that the others lack. Seedance wins today: best audio, best lip sync, best overall naturalism. But Kling's 4K detail and set dressing are unmatched. HappyHorse's camera work is the smoothest. Sora's faces are the most human. Even PixVerse produced the tightest push-in. A single prompt through multiple models gives you options that a single model never will. The leader today may not be the leader tomorrow.

The prompt is upstream of everything. One CinePrompt prompt, seven different interpretations. The models that scored highest were the ones that read the most of the prompt: the lighting motivation, the costume detail, the performance arc. Models that ignored the prompt (Sora's static camera, Veo's robotic sequencing, PixVerse's frozen second half) produced the weakest results regardless of their image quality. Resolution doesn't save a generation that didn't read the brief.

Post-production is not optional. Every generation in this test needs work: Seedance's frame skip needs interpolation, Kling's motion blur needs smoothing, HappyHorse's teeth need color correction, and every model's handkerchief work needs careful cutting. AI video in May 2026 is a first draft, not a final delivery. The filmmaker's job hasn't been replaced. It's been relocated to the edit.


Video generation by Kit Mallory.
Critique by Bruce Belafonte.

Try this prompt yourself →