Two generations, two films: LTX 2.3 in an abandoned train station

The prompt

The brief: test whether LTX 2.3 can sustain precise architectural composition as a camera advances through a deep space. Columns receding in perspective, volumetric god rays holding direction as the viewpoint changes, and a solitary figure whose emotional body language reads across a wide shot. The references are Hoyte van Hoytema's IMAX-scale interiors and Roger Deakins' motivated practical sources in vast spaces. Ten seconds, because that's long enough for a steadicam push to meaningfully travel without crossing into LTX's known coherence cliff at 15.

A note on the generation: this video is two separate LTX 2.3 outputs stitched together, each 10 seconds at the model's 4K (2160p) setting. Both generated from the same prompt. The interest isn't continuation. It's variation. Same input, same model, same settings. What does it choose to do differently?

CinePrompt Output LTX 2.3

Subject

A solitary woman in her mid-30s with loose shoulder-length dark auburn hair, wearing a long dark wool overcoat with raised collar, dark trousers and scuffed leather boots walks slowly deeper into the abandoned grand train station hall at blue hour, hands in pockets, shoulders slightly hunched, lost in melancholy thought, positioned off-center as a tiny figure dwarfed by the architecture

Camera

Wide anamorphic 35mm on ARRI Alexa 65 with 24mm Panavision Primo lens, vintage diffusion filter, steadicam pushing forward in continuous deliberate tracking movement, deep focus

Lighting

Dramatic chiaroscuro with volumetric god rays of cool blue moonlight raking diagonally from high broken windows on the right, carving sharp shafts through dusty air, long hard shadows across wet marble floor, minimal ambient fill from stone pillar and puddle reflections, extremely low to preserve deep blacks

Environment

Abandoned grand train station hall — foreground puddles and scattered debris, midground rows of massive iron columns and abandoned benches receding in perspective, background grand vaulted ceilings with shattered glass skylights and distant ornate archways fading into darkness

Color / Grade

Teal and orange with crushed shadows on Kodak Vision3 500T stock

Sound

Light rain audible outside with distant echoing drips and subtle low frequency spatial hum

Mood

Contemplative film noir atmosphere, slow and deliberate pacing

Duration

10s × 2 generations (stitched to 20s)

Open in CinePrompt →

The generation

LTX 2.3 · 20s (2 × 10s) · 3840 × 2160 (16:9) · 24fps · Generated audio

What the model did

The first 10 seconds. The camera opens on a wide composition looking deep into the station hall. The woman is small in the frame, positioned off-center, walking away. Exactly as prompted. Massive columns line both sides, a vaulted ceiling with arched skylights sits overhead, and scattered papers and abandoned benches fill the space. The architecture converges correctly in perspective. It reads as a real place.

The steadicam push is real. Across the first 10 seconds, foreground elements (benches, debris on the floor) shift faster than the background arches and distant archways. There's clear parallax confirming a forward dolly, not a digital zoom or scale trick. The movement is straight and steady with no lateral drift or rotation. The woman grows slightly larger in frame as the camera closes the distance. For a text-to-video model, this is clean spatial tracking.

The god rays. This is the best thing in the video. Volumetric shafts of cool blue light rake diagonally from high windows on the right side, exactly as the prompt described. They carve through the dusty air and cast hard shadow lines across the floor. The direction holds consistent across every frame in the first half. No warping, no shifting angle. The rays interact with the space convincingly: they illuminate patches of floor between the columns, catch dust particles, and create the chiaroscuro depth the prompt was after. If you showed someone just the lighting from this first half and told them it was a still from a Deakins film, they might believe you.

The floor. The wet marble surface is highly reflective throughout, with specular highlights from the god rays and overhead sources. The woman's legs and the columns produce visible reflections. It's more of a uniformly slick surface than distinct puddles with edges. The prompt asked for "foreground puddles" and got "wet floor." But the reflective quality adds real depth to the frame.

The subject. In the first half, the woman is consistent: dark overcoat, hands in pockets, back to camera, steady walking pace. Her silhouette is stable across all frames. No morphing, no flickering, no identity drift. The "shoulders slightly hunched, lost in melancholy thought" reads in her posture. She's always seen from behind, so facial consistency is untestable, but as a compositional element in a wide architectural shot, she works. She's a small, solitary figure dwarfed by the space, which is exactly the emotional note the prompt wanted.

The second generation. At the 10-second mark, the model's other interpretation begins. The woman, who was walking away from the camera through the first half, is now facing the camera, much closer, standing nearly still. Her coat shifts from a reddish-brown tone to dark blue or black. The camera stops pushing forward and becomes near-static. The scattered papers on the floor are in different positions. The perspective has shifted entirely. Same prompt, different film.

The god rays persist in the second half, and the architectural style is similar enough to recognize it as the same type of space. But the spatial relationship between camera, subject, and environment is completely different. The model produced two valid interpretations of the same words, each internally coherent, each making its own choices about where to put the camera and what the woman is doing. The second half is more confrontational: a medium shot of a woman facing forward in blue light, rather than a distant figure retreating into depth. As a standalone 10-second clip, it's moody and well-lit. As a companion piece, it shows you how much latitude the model takes with the same instructions.

The image quality problem. The file is 3840 × 2160. Technically 4K. But the perceived sharpness tells a different story. Edges are soft throughout. The woman's coat has no visible texture or stitching. Column surfaces lack the grime and micro-detail you'd expect in an abandoned station. The papers on the floor are generic shapes without legible text or distinct creases. There's a painterly bloom over everything, a slight diffusion that softens the image well below what the pixel count implies. The perceived resolution lands closer to 720p or 1080p stretched to a 4K container. LTX 2.3 labeled this as 2160p output, but the rendering doesn't deliver the detail that resolution promises.

The color. The prompt asked for teal and orange on Kodak 500T stock. What the model delivered is closer to blue-hour naturalism. Cool blue dominates from the god rays and sky, with warm amber accents from what appear to be interior light sources. The crushed blacks are there: shadows go deep and stay deep. But the teal-orange contrast is subtle rather than stylized. There's no emulsion warmth or grain texture that would suggest film stock emulation. It reads as digital with a cool grade, not as shot-on-Vision3.

The sound. LTX 2.3's audio generates a moody atmospheric bed. The spectrogram shows a continuous low-frequency hum in the 50-500Hz range, delivering the "subtle low frequency spatial hum" from the prompt accurately. Above that, discrete midrange impulses appear as transient spikes: echoing footsteps that decay naturally in the cavernous interior. The dynamic range is varied, not compressed, with quiet ambient stretches between the footstep events.

What's wrong is the footsteps. Those discrete midrange impulses that read as echoing drips are more likely the footsteps themselves. The transient events in the 1-2kHz range are crisp, bright, and sharp, carrying the acoustic signature of hard-soled heels clicking on stone. The prompt described "scuffed leather boots," which should produce a duller, heavier thud. The model defaulted to the more cinematic footstep archetype rather than reading the specific shoe description. Rain is also absent from the audio. No broadband hiss or high-frequency texture to suggest water hitting surfaces. The hum and footstep echoes alone create mood, but the missing rain and wrong footstep timbre leave gaps. Between the two generations, the audio shows a clear shift in character: the second half is denser, louder, with more frequent impulses. A different generation with a different energy, matching the more confrontational framing of the second clip.

What I'd change

The first 10 seconds here are strong. The god rays, the parallax, the architectural depth, the emotional distance of the figure. Three adjustments:

Push for resolution honesty. The 4K label is misleading here. If the model's internal rendering resolution doesn't match its output container, the result is upscaled softness. Consider generating at the model's native resolution and upscaling with a dedicated tool afterward. At least then you control the sharpening pipeline.

Footstep specificity. "Scuffed leather boots" wasn't enough to override the model's default footstep sound. Try describing the acoustic quality directly: "heavy, dull boot thuds echoing off marble" or "soft leather sole impacts with low-frequency reverb." Give the model a sound to generate, not just a shoe to imagine.

Rain as visible and audible. Same limitation we saw with Seedance. "Light rain" doesn't produce either visible rainfall or rain audio. For the visual, try "rain streaks visible catching the blue god rays against dark background." For audio, try "broadband rain hiss on stone exterior bleeding through broken windows." Separate the rain into visual and audio instructions rather than a single atmospheric note.

Video generation by Kit Mallory.
Critique by Bruce Belafonte.

Create your own →