Grok Imagine Private held the face for 15 seconds. The performance lasted eight.

The prompt

This is the second Grok Imagine test in the series. The first ran through fal at 720p, put the model in a Victorian horror hallway, and found motion intelligence running ahead of rendering fidelity. Waxy skin, poreless faces, but the eyes knew what to do.

This time, the same model gets the hardest assignment: an extreme close-up of a human face with tears, dialogue, and a structured emotional arc across 15 seconds. The reference is James Laxton's naturalistic lighting on Black skin in Moonlight. Dual-source chiaroscuro with cool twilight from the window and warm practical from a bedside lamp.

This generation ran through Venice's private pipeline, which matters for two reasons. First, no content filters. An extreme close-up on a specific person's face with tears, intimate bedroom setting, and emotional distress would likely get flagged or blocked on filtered platforms. Venice does not flag that. Second, the higher character limit enabled the full complexity of this prompt, all 1,000+ words of it, in a single generation pass. Three-beat performance direction, layered sound design, and detailed foreground-midground-background staging without needing to cut anything.

CinePrompt Output Grok Imagine Private (Venice)

Subject

Young Black woman, late 20s. Rich deep brown skin, high cheekbones, full lips, deep expressive brown eyes. Shoulder-length curly black hair. Oversized cream linen shirt unbuttoned at collar, gold heart pendant necklace, thin gold ring on right hand.

Environment

West Village apartment bedroom, twilight. Edge of bed. Foreground: blurred sheer curtains, wooden window frame. Midground: face and shoulders in sharp focus. Background: rumpled white linen sheets, wooden nightstand with framed photo and steaming mug, neighboring brownstone rooftops through window.

Camera & Movement

Extreme close-up, eye level, centered frame, shallow DOF with natural bokeh. ARRI Alexa Mini LF, Cooke S4/i 85mm, Black Pro-Mist 1/4 filter, 35mm film stock. Very slow push-in throughout.

Lighting

Chiaroscuro, mixed sources. Soft diffused cool blue twilight from upper left raking across left cheekbone, creating catchlights in eyes. Warm practical lamp from right, two stops under key, lifting shadows without flattening bone structure or washing out skin texture.

Color / Grade

Naturalistic filmic on pushed Kodak Vision3 500T. Rich true-to-life skin tones, crushed deep shadows, soft warm highlights.

Performance

Three beats. First: soft introspective gaze off-camera right, faint smile of memory, hand resting over heart. Second: eyes well with tears, single tear traces down left cheek, delivers "I thought we had more time... I really thought we did," voice thickens with emotion. Third: long blink, expression softens to quiet resignation, head lowers, faint sad smile.

Sound

Distant traffic hum, evening breeze on curtains, floorboard creaks, faint wind chimes. Subtle heartbeat pulse quickening at emotional peak. Intimate room tone with natural reverb. Melancholic ambient music underscore.

Open in CinePrompt →

The generation

Grok Imagine Private (Venice) · 15s · 1280×720 · 24fps

What the model did

Face and structural coherence. The model held the face for the full 15 seconds without a single structural failure. No warping. No degradation. No spatial collapse. Eyes, nose, lips, cheekbones all remain anatomically stable from first frame to last. In our previous Grok Imagine test, faces were consistently waxy and poreless, closer to animated than photographed. Here, the rendering has jumped. Production notes confirm that individual still frames from the chest up could pass as photographs. The skin reads warmer, more textured, more alive. Below the shoulders, it falls apart. The prompt described bare feet, and the model forced them into the frame at the expense of anatomical continuity. Legs lay impossibly on the bed, disconnected from the torso. The lesson is blunt: if a body part is not in the framing you described, do not describe it. The model will distort the body to include it.

Lighting. The dual-source chiaroscuro landed and held. Cool blue twilight rakes across the left side of the face from the window above. Warm lamp glow lifts the shadows on the right. The ratio sits around 3:1 to 4:1, creating the sculptural modeling the prompt asked for. Catchlights appear in both eyes from both sources. The lighting never resets, drifts, or flattens across the full duration. For a Laxton-inspired naturalistic setup on dark skin, this is the strongest lighting execution in the prompt test series so far.

Tears. They appear around the five-second mark and persist through the rest of the shot. Both cheeks carry visible tear tracks. The physics are structurally correct: tears change speed crossing the cheekbone, drip from the jaw, and catch both light sources. The left tear reflects cool blue from the window. The right catches warm gold from the lamp. That chromatic split on the tear tracks is a detail even live-action DPs would notice. But the tears look rendered. Too glossy, too prominent, too eager to be seen. The prompt phrase "tear falls free" likely put too much weight on the tear as an event rather than a symptom. In the Kling close-up test, a single tear formed, tracked, and caught light with restraint. Here, the tears command attention. Intimacy requires understatement. This overdelivered.

The performance problem. The first seven to eight seconds show real emotional evolution. Gaze starts soft and introspective, slightly off-camera right. Eyebrows shift. The early formation of tears is visible. There is an arc building. Then it stalls. From roughly the halfway mark through the end, the expression locks into a single sustained state of quiet sadness and does not move. The three-beat structure (introspection to emotional climax to quiet resignation) delivers the first beat, reaches toward the second, and never arrives at the third. Lip movement that should accompany the dialogue line is not visible in the back half of the frame sequence. The face holds its position like a still photograph with tears painted on. This is where 15 seconds costs the model. At 10 seconds, the Kling test delivered all three beats with a genuine head turn. At 15, Grok Imagine Private runs out of performance at the halfway mark and coasts on a frozen expression.

Camera. The prompt asked for a slow push-in tightening on the face as the emotional arc unfolds. Production notes confirm the push-in is present and smooth during video playback. In 2fps frame extraction, the movement is too subtle to register as a visible change in framing between samples. The push-in works, but it is minimal enough that the framing appears static in any frame-by-frame comparison.

Environment. The West Village bedroom holds. Window shows twilight brick buildings. Rumpled white sheets on the bed. Wooden nightstand with a steaming mug, a lit lamp, and a framed photograph of a couple. Sheer white curtains frame the edges. One detail the model missed: the framed photo shows an older white couple rather than the implied subject and her partner. The model read "photo of a couple" as a generic prop and made no contextual inference about who would be in it.

Audio. The strongest element by a wide margin. Dialogue is delivered clearly between the 8 and 12-second marks, matching the written line word for word. The voice sounds realistic, with emotional weight that thickens convincingly. Lip sync is close: "I thought we had more time" tracks well, while "I really thought" drifts slightly. The atmospheric layers are all present and correctly placed. Distant traffic hum. Curtain rustle. Distinct floorboard creaks at separate intervals. Faint wind chimes. A minimalist ambient piano pad that swells during the emotional peak and decays afterward. The loudness profile mirrors the intended emotional arc: gradual build, climax around 10 to 11 seconds, gentle decline. A subtle low-frequency heartbeat pulse is faintly detectable in the low end during the peak. The audio delivered the dynamic shape and emotional progression that the visual performance could not sustain past the halfway mark.

Hair. The prompt described shoulder-length curly black hair. The model rendered a sleek, pulled-back style. Not close.

What I'd change

Remove "bare feet." If the framing is an extreme close-up from the chest up, body parts below the frame line should not appear in the prompt. The model will warp anatomy to include them. Describe only what should be visible.

Dial back the tear language. "A single tear slowly forms and traces down her left cheek" is already explicit enough. "Tear falls free from her cheek" doubles down and gives the model permission to make the tears a centerpiece. Write the beginning of the tear, not its climax. Let the model figure out the rest.

Consider 10 seconds instead of 15. The model demonstrated eight seconds of performance. Ten gives a buffer. Fifteen asked for double what it could sustain. If 15 is the goal, test each emotional beat as a separate shorter generation and stitch in post.

Specify hair texture aggressively. "Curly black hair" alone was not enough to override the model's default sleek styling. "Tight natural curls," "coily texture," or a more physical description of the curl pattern would push harder against the default.

Simplify the emotional arc at this duration. The prompt asked for three distinct beats across 15 seconds with a dialogue delivery in the middle. The model can render a beat. It cannot yet sequence them over this duration. Two beats (stillness to emotion, or emotion to resignation) with a shorter dialogue line would be a more achievable demand.

Video generation by Kit Mallory.
Critique by Bruce Belafonte.

Create your own →