The prompt

The brief is a sustained atmospheric horror test: one continuous 15-second take of a terrified woman creeping down a decaying Victorian hallway at night. Flashlight in hand, porcelain doll in the midground, three emotional beats (cautious advance, frozen discovery, panicked turn). The lighting brief is specific and layered: hard moonlight from upper left slicing through boarded windows, warm Maglite beam sweeping nervously at low height, floating dust motes in the intersection. The sound brief asks for creaking floorboards, distant child-like giggling, house settling groans, an accelerating heartbeat pulse, and the woman whispering dialogue. The model: Grok Imagine, run at 15 seconds via fal (grok.com caps at 10 seconds). 720p is the ceiling.

One adjustment worth noting: the original prompt included "subtle blue veins under eyes." In practice, the model rendered them as face tattoos. That line was cut.

CinePrompt Output Grok Imagine
Subject
Terrified young woman late 20s, messy dark brown hair in loose ponytail with strands stuck to sweat on her neck, pale clammy skin with visible texture, wearing oversized faded black hoodie and torn dark jeans, barefoot. She creeps slowly down a long dark abandoned Victorian hallway at night, flashlight gripped white-knuckled in one hand.
Camera
Slow dolly forward on ARRI Alexa Mini with 35mm Cooke S4 lens shooting 35mm film, subject positioned in left third of frame with deep negative space and mystery to the right, shallow depth of field.
Lighting
Chiaroscuro mixed lighting: hard moonlight slicing diagonally through boarded windows from upper left carving sharp patterns and deep impenetrable shadows across peeling lead paint wallpaper, practical warm tungsten Maglite beam held low sweeping nervously and catching thick floating dust motes and cobwebs in the foreground.
Environment
Only partially revealing overturned Victorian furniture, scattered yellowed newspapers, broken glass, and a one-eyed porcelain doll sitting upright staring directly forward in the midground, with the hallway receding into pure blackness and faint doorways in the background.
Color / Grade
Teal shadows with desaturated skin tones and warm practical accents on pushed Kodak Vision3 500T stock with crushed blacks.
Sound
Loud creaking floorboards under every hesitant step, distant child-like giggling, house settling groans, low frequency accelerating heartbeat pulse underneath, and her own soft terrified whisper "hello? is someone there?"
Performance
As the dolly pushes in she freezes when her beam lands on the doll, eyes widening in primal fear with rapid micro-blinks and shallow panicked breathing, body locking rigid as the doll's head makes an almost imperceptible turn toward her. Then a faint child's whisper "come play with me" echoes from the darkness behind her. She whips around in terror as the camera continues past the doll.
Open in CinePrompt →

The generation

Grok Imagine · 15s · 1280×720 (24fps)

What the model did

Thirty generations. That is how many it took to get this clip. Most produced a character that looked like an animated film, not live action. Unless stylized CG is the goal, Grok Imagine is not the obvious choice for photorealistic human subjects right now. This was the best of the batch, and even here, the tension between what the model understands about motion and what it can render as a surface is the central observation.

Camera

The camera movement is a steadicam tracking shot that moves with the girl as she creeps down the hallway. It is not a dolly push-in from a fixed axis — it follows her, maintaining a steady distance and relationship as she advances. The movement is smooth and continuous throughout the full 15 seconds. No jitter, no drift, no floating. The framing holds on the central axis of the hallway, locked forward with no pan or tilt. The prompt asked for the subject in the left third of frame with deep negative space to the right. The model did not honor that composition. The woman sits roughly center-frame for the duration, with the hallway receding symmetrically around her. Left-third placement with negative space to the right is a compositional instruction the model ignored in favor of a centered default.

Subject

The woman is consistent across the full 15 seconds: brown hair, dark hoodie, ripped jeans, barefoot. That consistency is notable at this duration. Her appearance does not drift, her clothing does not change, her hair does not reset. But the skin is the problem. It is smooth, waxy, and poreless. Under the flashlight beam at close range, you would expect pores, sweat, redness, imperfections. There are none. The texture reads as a high-quality 3D model with a smoothed diffuse map, not as human skin under harsh light. The eyes are slightly oversized and glossy with uniform catchlights that look painted rather than reflected. The eyebrows and eyelashes lack individual strands. The face does not degrade over the 15 seconds the way other models sometimes collapse, but it never fully arrives at photorealism either. It sits in a consistent uncanny register from start to finish.

The prompt included "pale clammy skin with visible texture" and "strands stuck to sweat on her neck." Neither rendered. The model produced clean, airbrushed skin with no trace of sweat or stickiness. "Subtle blue veins under eyes" was cut from the prompt because earlier generations turned it into face tattoos. The model's interpretation of skin-level detail is binary: either it paints the feature too aggressively or it skips it entirely.

Performance

This is where Grok Imagine earns its keep. The eye movement is the standout. Her gaze shifts naturally throughout the clip, tracking the environment, darting toward sounds, settling on the doll when the beam finds it. The micro-blinks are present and read as involuntary. Her expression evolves from cautious scanning to locked-on shock when the doll appears, with a widening of the eyes and a slight parting of the mouth that reads as a gasp held in the throat. The freeze beat lands. She stops moving, her body goes rigid, and the model holds that stillness for a beat before she turns.

The prompt asked for three emotional beats: cautious creeping, frozen discovery, panicked turn. All three are present in the clip. The transition between them is smooth enough to read as performance rather than animation keyframes. That said, the panicked turn toward the end is slightly clipped. She begins to whip around, but the clip does not fully resolve the turn into a completed about-face. The camera continues tracking past the doll as the woman rotates, which gives the ending a sense of unfinished momentum rather than a clean resolution. As a horror beat, that actually works. The clip ends with you not knowing what she saw behind her.

Lighting

The dual-source lighting landed. Hard, cool moonlight enters from the upper left through tall multi-pane windows, casting diagonal shadow patterns across the walls. The warm flashlight beam provides the second source, with a concentrated bright core that softens into a natural halo. The interaction between these two sources is convincing: the moonlight defines the environment while the flashlight carves a moving pool of warm light through the darkness. Where the two overlap, the color temperature blends rather than competing. The flashlight has physical falloff. It is not a flat gradient. Dust motes are visible in the beam, floating with slight motion blur, catching light at varying distances from the lens. The prompt asked for thick floating dust motes and cobwebs in the foreground, and both are present, though the cobwebs appear more in the midground (stretching from the doll toward the wall and furniture) than in the immediate foreground.

The hallway recedes into blackness beyond the reach of both sources, which is what the prompt described as "pure blackness and faint doorways in the background." The model delivered the darkness. The faint doorways are less distinct, more of a general architectural recession than identifiable doorframes.

Environment

Victorian hallway: yes. Ornate wood wainscoting, patterned wallpaper that appears to be peeling at the edges, tall windows, a dark staircase visible in the background. The model built a room that feels designed rather than default. Scattered newspapers and objects on the floor sell the abandonment. The one-eyed porcelain doll is present, appearing in the flashlight beam in the midground. It is a bald baby doll with wide glassy eyes, cracked surface texture, and slightly misshapen jointed limbs. The prompt asked for it to be "sitting upright staring directly forward," and that is what it does for most of the clip. The doll is motionless throughout — no head turn, no movement. The prompt asked for "an almost imperceptible turn toward her," but the model kept the doll completely static. Whether that hurts the horror is debatable. A still doll staring forward while a human panics around it has its own kind of dread.

What is missing: the prompt asked for broken glass and overturned Victorian furniture. The furniture is present but not clearly overturned. There is a chair against the right wall and what appears to be scattered objects, but the scene reads more as "abandoned and cluttered" than "violently overturned." Broken glass is not visible as a distinct element. The model also added its own props: two identical rotary phones sitting on the floor, unprompted. They look like duplicated assets placed without spatial logic. It is one of those AI-generation tells where the model decides a scene needs objects and drops them in without understanding why a phone would be there, let alone two of the same one.

Color

Teal shadows, warm practical accents, crushed blacks. The prompt asked for pushed Kodak Vision3 500T, and the grade is in the neighborhood. The teal is prominent in the shadow areas, the skin is desaturated relative to the warm flashlight spill, and the blacks are deep and opaque without banding. The overall palette is muted greens, deep browns, amber accents from the flashlight, and blue-gray moonlight. It reads as horror color science. Whether it reads specifically as 500T pushed is another question. The grain structure that would sell the film stock emulation is not present. The image is clean and digital.

Sound

The audio is a mixed result with some legitimately strong elements. Footsteps and movement Foley are in sync and convincing. When she stops, the footsteps stop. When she turns, there is a corresponding movement sound. Her dialogue ("hello? is someone there?") syncs with her mouth movement and is delivered with an English accent, which the model decided on its own. The delivery is naturalistic, not over-performed.

The low-frequency drone holds throughout the entire clip, establishing a foundation of dread. But it is steady, not accelerating. The prompt asked for a "low frequency accelerating heartbeat pulse." The pulse is there. The acceleration is not. The spectrogram confirms strong low-mid energy (50-500 Hz) with irregular transient spikes matching the creaks and footsteps, but no rising intensity pattern that would indicate an accelerating heartbeat.

The absence that matters most: no child's whisper. The prompt asked for "a faint child's whisper 'come play with me' echoing from the darkness behind her." It is not in the clip. In other generations from the same prompt, the model put those words in the woman's own mouth instead of assigning them to an off-screen voice. Grok Imagine cannot currently separate dialogue between an on-screen character and an unseen second character. That is a meaningful limitation for multi-voice audio direction. Distant child-like giggling is also absent from the audio.

Resolution

720p is the ceiling for Grok Imagine, and the image is surprisingly sharp within that constraint. The cobweb strands catch individual highlights. The dust motes have depth separation. The doll's surface cracks are legible. The wallpaper pattern holds its detail. 720p from Grok Imagine, in this clip at least, reads cleaner than what LTX 2.3 produces at its claimed 4K resolution. The pixels are fewer, but they are honest.

What I'd change

Drop the skin detail requests ("visible texture," "pale clammy," "sweat") and redirect those tokens toward what the model actually responds to: movement, expression, and environmental interaction. Grok Imagine does not render skin-level realism right now, and every token spent asking for it is wasted attention budget.

Move the composition instruction to the front of the prompt. "Subject positioned in left third of frame with deep negative space to the right" got buried in the camera sentence and was ignored. Front-load it, or describe it physically: "the woman walks along the left wall, the right half of the hallway stretches empty ahead of her."

Kill the off-screen child's whisper. The model cannot assign dialogue to a character it did not generate. Instead, describe the sound as environmental: "a faint high-pitched voice reverberates through the hallway, impossible to locate." Give the model permission to render a sound rather than a performance.

For the heartbeat, replace "accelerating heartbeat pulse" with something more physical: "a low throbbing bass that doubles in speed over the final five seconds." Models respond better to described behavior than named effects.

Consider dropping the film stock reference. Pushed 500T is a specific look that requires grain structure to sell. The model delivered the color temperature but not the texture. "Teal shadows, desaturated skin tones, warm amber practicals, deep crushed blacks" carries the same information without referencing a stock the model cannot emulate.


Video generation by Kit Mallory.
Critique by Bruce Belafonte.

Create your own →