The prompt

HappyHorse 1.0 is Alibaba's entry that topped the Video Arena leaderboard anonymously before anyone knew what it was. It does T2V, I2V, joint audio, and lip sync. It ships on fal, Venice, and EvoLink. CinePrompt integrated all twelve variants on day one. This is its first blog test.

The assignment is a fashion film. Not a product shot, not a lookbook still — a moving editorial with a three-beat performance arc, deliberate fabric interaction with water, and a rim lighting setup that separates silhouette from background without burning out the highlights or losing the shadow side. The cinematographic reference is Bruno Delbonnel: controlled color palettes, textured elegance, muted richness that sits just slightly above reality without tipping into fantasy.

The specific technical challenge here is the rim light. Most models either overexpose the outline into a blown-out halo or fail to wrap enough fill to reveal fabric texture and skin tone on the shadow side. The prompt asks for a hard rim from behind and above camera-left, paired with a soft bounce fill from camera-right at waist height. That is a deliberate two-source setup where the ratio between rim and fill determines whether you get a silhouette with detail or a silhouette with darkness.

CinePrompt Output HappyHorse 1.0
Subject
New York fashion model in her 20s. High cheekbones, porcelain skin, elongated neck, delicate collarbones visible at gown neckline. Long black hair with flyaway strands catching the light. Flowing crimson velvet gown with dramatic train, deep V-neck, long sleeves, cinched waist.
Environment
Classical European courtyard at twilight. Raised stone fountain with shallow water. Foreground: wet cobblestones reflecting warm practical lights, scattered crimson rose petals catching rim light. Midground: model, fountain edge, swirling water. Background: stone columns, warm amber windows, fog rolling through archways.
Camera & Movement
Full shot, head to toe. ARRI Alexa Mini LF, Leica Summilux-C 35mm, Hollywood Black Magic filter, anamorphic 35mm film. Slow steadicam arc around subject. Left-third placement.
Lighting
Rim light from mixed sources. Hard rim from behind, slightly camera-left, 45° above subject height — warm gold outline on silhouette, catching flyaway hair. Soft bounce fill from large white reflector camera-right at waist height — lifts shadow side to reveal fabric texture and facial bone structure without flattening the rim.
Color / Grade
Warm tones on Kodak Vision3 250D. Deep crimson, warm amber, cool blue shadow, ivory skin palette. Warm golden skin tones with subtle peach undertones.
Performance
Three-beat pivot in place. Beat one: facing away, shoulders squared, gown train trailing in water, mist at ankles. Beat two: slow turn to three-quarter profile, fabric swirling around legs, expression shifts to serene confidence, rim light catches cheekbone. Beat three: completes turn to face camera, head tilts slightly upward, single tear forming that catches the warm light.
Sound
Rain falling, deep bass rumble and sub-frequency vibration, soft distant city ambience with occasional church bell, ethereal hypnotic ambient atmospheric music.
Open in CinePrompt →

The generation

HappyHorse 1.0 · 15s · 1920×1080 · 24fps

What the model did

The arc. The camera starts three-quarter behind the subject, framing her from roughly mid-calf up with the crimson train trailing through the water behind her. Over 15 seconds it arcs smoothly around her, transitioning through three-quarter profile and arriving at a tight frontal close-up of her face. This is genuine steadicam behavior — not a zoom, not a crop, not a reframe. The background parallax confirms it: the archway that begins behind her rotates into the background as the camera moves. The arc also tightens the framing naturally, pulling from a near-full shot at the start to an intimate close-up by the end. This is the first model in the prompt test series to deliver a meaningful camera arc around a human subject. The movement is smooth, continuous, and motivated by the performance. On a real set, the operator would keep their job.

The pivot. All three beats are present. Beat one opens with the subject facing away from camera, shoulders squared, the gown train trailing through shallow water, bare feet visible on the wet stone with concentric ripples radiating outward. This is the establishing beat the prompt asked for. Beat two delivers the turn to three-quarter profile — fabric swirls around her legs as she rotates, the rim light catches her cheekbone, and her expression shifts to composed confidence. Beat three completes the rotation to face camera directly, head tilting slightly upward. The physical choreography is there. The emotional choreography is where it breaks.

Environment and production design. The courtyard is the quiet star of this generation. Stone arches frame the background with a large archway serving as both focal point and primary light source. The fountain is present, shallow water covers the courtyard floor, and crimson rose petals are scattered across the surface. Fog rolls through the archway and rises from the water, creating genuine volumetric depth. Warm amber windows glow in the background. Wet cobblestones reflect practical lights. The environment holds for the full 15 seconds without a single structural failure — no pop-in, no geometry shift, no lighting reset. For a model on its first blog test, this is a strong opening statement. The courtyard feels like a place that was dressed, not generated.

Fabric. The crimson velvet gown is rendered with convincing weight and material behavior. Folds drape naturally. The train pools on the wet ground and responds to the subject's movement with appropriate physics — it trails behind her in the opening beat and swirls around her legs during the turn. The velvet catches backlight and reflects the water's warm glow with the correct sheen variation you get from velvet at different angles to a light source. The deep V-neckline stays stable. No floating, no stretching, no clipping through the body. Where the train drags through the water, the base of the fabric reads slightly darker and heavier — a subtle but correct nod to saturation. This is not the dramatic clinging you would see on a real set, but the model acknowledged the interaction rather than ignoring it.

Rim lighting. This is where the model earns its reputation. The hard rim from behind and above camera-left lands exactly as prompted. A warm gold outline traces the subject's silhouette and catches individual flyaway hair strands with backlit luminosity. As the camera arcs around her, the rim light evolves correctly — broad and dramatic on her back in the opening frames, then sculpting the edge of her jaw and cheekbone as she turns to profile, and finally creating a halo effect that separates her from the courtyard background in the frontal close-up. The fill side is less precise. Rather than a controlled bounce from waist height, the fill reads as general ambient light from the foggy environment. It lifts the shadow side enough to reveal fabric texture and facial structure, but it does not feel like a motivated reflector. Still, the net result — silhouette with detail rather than silhouette with darkness — is what the prompt asked for. Around the eight-second mark, a warm natural lens flare blooms into the frame as the rim source crosses behind her hair — the kind of flare a Summilux would actually produce with a strong backlight at that angle. It is not a post-production overlay; it responds to the camera's position and fades as the arc continues. The lighting never resets, drifts, or flattens across the full duration. Strongest rim light execution in the prompt test series.

Face and coherence. No warping. No degradation. No spatial collapse. Facial features remain anatomically stable from the first profile glimpse through the final frontal close-up. Skin reads warm and textured, not waxy. Eyes, cheekbones, and lips are consistent across all frames. The face holds up to the tight close-up framing that the arc delivers in the final seconds. This is the most demanding coherence test in the series — the subject's face transitions from a distant rear view to a frame-filling close-up, meaning the model has to render her at multiple scales within one continuous shot. She reads as the same person throughout.

The tears. They arrive in the final seconds. Two tear tracks stream down her cheeks, catching the light. The physics are structurally present — the tears follow the contour of her face and glint with reflected warmth from the rim light. But they arrive robotically. The tears stream out with mechanical precision while her face holds a neutral, almost blank expression. There is no welling. No eye reddening. No brow tension. No quiver in the lips. The tears appear as if someone turned on a faucet behind her eyes. A real actor builds to the tear — the emotion comes first, the moisture follows. Here the model reversed the sequence: it delivered the physical artifact of crying without any of the performance that earns it. The face forgot to feel.

Water interaction. The opening frame delivers what the prompt asked for: bare feet on wet stone, shallow water, concentric ripples radiating outward from her feet. Rose petals float on the surface. As the camera arcs tighter and the framing shifts from near-full shot to close-up, the feet and water naturally leave the frame. The water interaction exists — but only in the establishing beat, where the wide framing allows it. Once the arc tightens past the waist, the courtyard floor becomes implied rather than visible.

Audio. The sound design is atmospheric but compressed. A continuous bass drone sits in the sub-100Hz range — present but flat, lacking the physical modulation of a real rumble. Rain is detectable as a diffuse high-frequency shimmer but does not read as actual falling rain with spatial detail or transient impact. The church bell appears as a clean, isolated transient — too precise and sudden to feel distant. An ethereal ambient pad hovers in the upper frequencies with slow pitch modulation, serving the mood but lacking harmonic development. Dynamic range is heavily compressed: everything sits within a narrow loudness window. The audio establishes mood without grounding the viewer in a physical space. For a fashion film, this is acceptable. For immersive cinema, it would need real-world foley and spatial mixing.

Color grade. The warm-amber-with-cool-blue-shadow palette holds from first frame to last with no drift or saturation shift. Deep crimson reads correctly against the stone architecture. Skin tones carry the warm golden quality the prompt specified. The overall grade sits in that Delbonnel zone of muted richness — slightly heightened without crossing into fantasy. The Kodak 250D reference does not translate as a visible grain structure or halation character, but the tonal range is in the right neighborhood. Color is the one department where HappyHorse delivered exactly what was asked for and sustained it for 15 seconds without a single frame of deviation.

The gap that remains. Viewed on a phone, this generation could pass for a fashion film teaser. On a larger screen, the synthetic origin is still visible — skin texture that smooths where it should pore, mist that moves too uniformly, water reflections that are a shade too perfect. It is obviously AI-generated. But it is closer to indistinguishable than anything else in this series. The distance between "clearly synthetic" and "wait, is that real?" is shrinking, and HappyHorse 1.0 just moved the line.

What I'd change

Build the tear from the face, not the eyes. The prompt described "a single tear forming in the corner of one eye that catches the warm light." The model rendered two mechanical tear tracks with no facial emotion behind them. For the next attempt, describe the performance before the tear: "brow softens, lips press together, eyes glisten" — and then let the tear follow as a consequence, not a standalone event. The emotion earns the tear. The tear does not create the emotion.

Hold the framing wider longer. The arc from full shot to close-up is cinematically compelling, but it means the feet and water interaction — one of the prompt's most specific requests — are only visible for the first few seconds. If the barefoot-in-water moment matters, either hold the wide framing through beat two before tightening, or specify that the close-up should not go tighter than a medium shot.

Simplify the emotional arc at this duration. The prompt asked for introspection to confidence to tearful vulnerability across 15 seconds while also pivoting 180 degrees. The model delivered the physical choreography but ran out of emotional range in the final beat. Two emotional states — composed stillness turning to quiet vulnerability — would give the model less to sequence and more time to inhabit each beat.


Video generation by Kit Mallory.
Critique by Bruce Belafonte.

Create your own →