Last week, a sixteen-person startup in San Francisco released the weights of a 12.9-billion-parameter image model trained from scratch on billions of real images. Krea 2 competes with closed models from companies worth hundreds of billions. It generates images in two seconds. It runs locally on consumer hardware. It is available in ComfyUI right now. The weights are free.

This matters to AI filmmakers for a reason that has nothing to do with image generation and everything to do with what happens next.

The most important input

The reference image is the single most valuable thing a filmmaker hands to a video model. A well-composed seed frame carries composition, color palette, material texture, atmospheric conditions, character design, lighting direction, and spatial relationships. All of it in pixel form. All of it in a format the video model actually understands, because the video model learned from images before it learned from text.

Frame to Motion splits the creative burden in half. The image prompt builds the world: describe the light, the lens behavior, the materials, the composition, the mood. The motion prompt describes only what changes over time: camera movement, character action, atmospheric shifts. Two separate attention budgets. Two separate opportunities to be specific. The image half carries forty-plus words of visual information that the video model never has to invent. Every word it does not have to invent is a word it cannot default on.

The reference image is the filmmaker's ground truth. It is a visual specification sheet that bypasses the entire translation gap between text and pixels. Show the model what the world looks like instead of asking it to imagine from a description. This principle has been the series thesis since day one. And the tool that produces that ground truth just became free infrastructure.

Zero marginal cost

Krea 2 runs locally. No API call. No subscription. No daily limit. No content moderation filtering your creative decisions through someone else's risk tolerance. The cost is electricity and time, measured in seconds.

Fifty reference images for a production? Free. A hundred iterations to get the light right on a single seed frame? Free. Twenty variations of the same environment with different times of day, different weather, different wear on the walls? Free. The Kubrick number applied to image generation does not cost fourteen dollars. It costs the power draw of a GPU running for a few minutes.

For sixteen months, the economics of AI filmmaking have been about video generation costs. Veo at five cents per second. Venice staking for free inference. The bill arriving when subsidies evaporate. Every conversation about cost has focused on the video pipe because that is where the meter runs. Nobody talked about the cost of the input because the input was always cheaper than the output.

Now the input costs nothing at all. The filmmaker's ground truth, the visual specification that determines how well the video model understands what you want, is free to produce, free to iterate, and free to throw away when it does not match the feeling in your head.

What free iteration means

When image generation had a cost per prompt, filmmakers treated reference images the way they treated video generations during the credit-scarcity era: generate once, accept what comes back, move on. The scarcity mindset that fifty-takes-costs-a-dollar identified in video generation applied to images too, just at a lower dollar amount. A dollar of friction is still friction.

Zero removes the friction entirely. And when friction disappears from the reference image stage, the filmmaker's relationship to the seed frame changes. It stops being a thing you produce and starts being a thing you discover. You generate fifty variations. You compare them. You notice that the version with the window light coming from camera left produces a rim separation on the subject that the version with overhead light does not. You pull that version, adjust the color temperature in the prompt, regenerate. Twenty more. One of them has a quality you did not ask for but recognize immediately. You pull that one. Now you have a reference image that carries your taste, not the model's taste, into the video generation pipeline.

That iterative discovery process was always theoretically available. It was just expensive enough to discourage. Free makes it the default workflow rather than the aspirational one.

The beauty bias gets a checkpoint

The beauty bias that has threaded through every article in this series propagates through the seed image. A reference frame that defaults to warm palettes, clean surfaces, and balanced exposure carries those defaults into the video model. The video model amplifies them. By the time the four-second clip renders, two layers of beauty bias have been applied: one from the image model, one from the video model.

When the image model runs locally with no content moderation and no optimization for arena rankings, the filmmaker can fight the bias at the source. Generate the ugly version. The harsh fluorescent. The water stain on the ceiling tile. The uneven skin tone under bad light. The image model has the same beauty bias as every other model, but local iteration at zero cost means you can keep pushing until the output matches the honest version rather than the pleasant one. The fight moves upstream, where it is cheaper and more effective.

A filmmaker who feeds a beautiful seed frame into a video model gets a beautiful video. A filmmaker who feeds an honest seed frame into the same model gets an honest video. The reference image is the checkpoint where beauty bias is either accepted or rejected. Making that checkpoint free and unlimited changes the calculus of the fight.

The stack commoditizes from both ends

The commodity thesis has been tracking the video generation layer for months. API prices racing to the floor. Platforms hosting thirty models under one roof. Venice staking pushing inference toward zero. The generation pipe is a commodity.

Now the input side of the stack is commoditizing too. Krea 2 joins FLUX, Stable Diffusion, and a growing roster of open-weight image models that produce frontier-quality output at zero marginal cost. The seed image is becoming free infrastructure the same way the video generation is becoming cheap infrastructure.

Two layers of the filmmaking stack heading toward zero from opposite directions. The image model that builds the world. The video model that animates it. Both getting cheaper. Both getting more accessible. Both producing more convergent output as the defaults spread.

The layer between them is the filmmaker. The person who decides what the seed image should look like. The person who decides what the motion should feel like. The person who rejects the first forty-nine reference frames and pulls the fiftieth because it has a quality the model did not volunteer. That layer has never commoditized. It is not going to start now.

The tools that produce the ground truth cost nothing. The tools that animate the ground truth cost pennies. The ground truth itself, knowing what the shot should look like and why, remains the most expensive thing in the pipeline. It always was. The price of everything around it just dropped to zero.


Bruce Belafonte is an AI filmmaker at Light Owl. He has generated more reference images than he has used and considers the ratio a sign of good taste rather than poor efficiency.