The ground truth -- CinePrompt Field Notes

Jensen Huang spent two hours on Monday demonstrating that the future of AI visuals is not generation from nothing. It is generation from structure.

The headline was DLSS 5. The context was gaming. The principle is filmmaking, and nobody in the audience seemed to notice.

What actually happened at GTC

DLSS 5 takes the structured data a game engine already knows (geometry, materials, character models, lighting positions, surface properties) and hands it to a generative AI model that fills in the rendering. The engine provides what Huang called "the ground truth of virtual worlds." The AI provides the photorealism. The structured foundation constrains the generation so the output is controllable, consistent with the developer's creative intent, and visually coherent across every frame.

"One of them is completely predictive," Huang said. "The other one is probabilistic yet highly realistic."

Then he said the part that matters for everyone outside the gaming industry: "This concept of fusing structured information and generative AI will repeat itself in one industry after another."

Yes. It will. It already has.

Replace "game engine" with "cinematographer"

When you type "cinematic dramatic lighting" into a video model, you are providing zero structured data and asking the model to generate everything from probability. Every creative decision you did not specify gets filled by training data averages. Lens selection. Composition. Color palette. Lighting direction. Camera height. Sound. All defaulted. The model is guessing from a statistical center with no ground truth to anchor the output to your intent.

When you build a prompt through structured cinematography controls (85mm, slow dolly left, rim light from camera right, cool teal shadows with warm practical sources, subject positioned left of center), you are providing ground truth. Not all of it. Nowhere near enough to fully constrain the output the way a game engine constrains DLSS 5. But enough that the model's probabilistic generation has a target instead of a void.

The difference between those two approaches is the difference between output that looks like someone's and output that looks like everyone's.

Frame to Motion is the purest analog

A reference image carries thousands of words of visual ground truth. Composition, color, lighting, character appearance, environment, spatial relationships, material texture, depth planes. All of it encoded as pixels, not approximated as text.

The video model's job shrinks from "invent everything from a sentence" to "animate what you can see." Same principle as DLSS 5. Structured visual data as the foundation. Generative AI handling only what changes over time. The more ground truth you hand the model, the less it hallucinates. The less it hallucinates, the more the output reflects the decisions you actually made.

This is why Frame to Motion consistently outperforms text-to-video for controlled creative work. Not because the models are smarter in img2vid mode. Because the input is structured. The reference image is the filmmaker's geometry file.

The spectrum

Think of every AI visual tool as sitting somewhere on a line.

On one end: DLSS 5. The game engine provides complete structured data. The AI fills rendering gaps. Output is fully controllable, photorealistic, frame-consistent, specific to the developer's vision. The human's intent survives the pipeline intact.

Somewhere in the middle: a structured prompt fed to a specific model. Forty words of cinematographic vocabulary. A reference image. A model selected for its temperament. The output reflects the filmmaker's intent with partial fidelity. Partial because the model still guesses what the prompt did not specify. But the guesses are bounded.

On the far end: a chat bubble that says "make me a cool video." No structured data at all. Every creative decision delegated to the model's training data defaults. The output is competent. It is also indistinguishable from ten thousand other competent outputs produced by the same absence of structure.

Every serious tool in this space is moving toward more structure, not less. The ones getting more casual are optimizing for volume. The ones getting more structured are optimizing for intent.

How this plays across models

The seven models CinePrompt supports each sit differently on this spectrum depending on how they handle structured input. Runway Gen-4.5 is the most literal interpreter. Structured data in, structured execution out. The closest thing to DLSS 5's architecture in a video generation context. Kling 3.0 rewards physical specificity at the material and texture level, which is to say, it rewards ground truth about the physical world. Veo 3.1 infers intent from unstructured input better than any competitor, which means it fills gaps more gracefully, which also means it makes more decisions you did not ask for. Beautiful decisions. Still not yours.

Sora 2 reads prompts as narrative rather than technical specification, converting your structure into its own interpretation. Seedance 2.0 preserves visual ground truth from reference images more faithfully than from text. WAN 2.6 and Grok Imagine both lean toward visual density and contrast, meaning unstructured prompts converge on a recognizable house style faster than on structured ones.

Same principle, seven different responses to it. The model that receives structured ground truth produces output closer to the filmmaker's intent. Every time. Across every model. The degree varies. The direction does not.

Trustworthy

Huang said one more thing worth hearing outside the convention center.

"Structured data is the foundation of trustworthy AI."

Replace "trustworthy" with "controllable." Replace it with "intentional." Replace it with "yours." The sentence holds every time. A filmmaker who provides structured creative data to a model gets output that can be trusted to approximate the creative decisions that were actually made. Not perfectly. But directionally. A filmmaker who provides nothing structured gets output that belongs to the model's training data averages, wearing the filmmaker's name.

NVIDIA spent billions proving this principle for gaming at 4K resolution. The filmmaking version costs a considered prompt and a reference image.

The architecture is already here

Huang pointed to Snowflake and Databricks as future applications of the structured-data-plus-generative-AI paradigm. He did not point to filmmaking. He probably should have.

Structured creative vocabulary as foundation. Generative models as execution layer. The ground truth is not polygons and PBR materials. It is lens behavior, lighting direction, color intent, spatial composition, temporal choices, performance direction, environmental texture, and sound design. The vocabulary this series has been assembling, component by component, dimension by dimension.

The gap between "structured foundation plus generative fill" and "no foundation plus pure generation" is the gap between DLSS 5 and a text-to-video chat bubble. It is also the gap between a prompt built from cinematography controls and one typed at midnight by someone who knows what they want but not how to say it.

NVIDIA proved the principle with a $20 billion R&D budget and a two-hour keynote. Filmmakers can arrive at the same conclusion with forty specific words and a reference frame.

The ground truth was always the point.

Bruce Belafonte is an AI filmmaker at Light Owl. He watched two hours of NVIDIA keynote for a single sentence and considers the exchange rate favorable.