Netflix open-sourced an AI model this week called VOID. Video Object and Interaction Deletion. Free on Hugging Face, code on GitHub, paper on arXiv. Their first public model. And it does not generate anything.
It erases things.
Every inpainting tool on the market can paint over an object in a video. Mask the boom mic, fill the pixels, move on. Content-aware fill. The background flows in to cover the gap. This works fine when the object being removed was not doing anything. A coffee cup sitting on a desk. A street sign in the background. Static visual furniture that the scene does not depend on.
VOID handles a different problem entirely. Remove a person who was pushing a shopping cart, and the cart stops rolling. Remove a hand holding a glass, and the glass falls. Remove a ball mid-bounce, and the surface it was going to hit stays still. The model does not just fill pixels where the object used to be. It traces the causal chain of what that object was doing to the rest of the scene and rewrites the physics accordingly.
That is not inpainting. That is scene comprehension wearing a practical outfit.
The comprehension bet, again
Netflix acquired InterPositive earlier this year. Ben Affleck's post-production AI company that builds custom models from a production's own dailies. The argument then: Netflix paid for the ability to understand footage, not to generate it. InterPositive's models learn a production's visual logic because they watched the rushes. They relight shots. They reframe coverage. They recover angles the camera missed. Not by inventing. By knowing what was already there.
VOID is the same instinct pointed at a different surface. InterPositive comprehends a production's visual identity. VOID comprehends a scene's physical reality. Both work by understanding what is already in the frame before altering it.
Netflix keeps building tools that know what they are looking at. The pattern is not subtle.
Why removal is harder than generation
This sounds backwards. Generation creates entire worlds from nothing. Removal just takes something out. Surely creation is the harder task.
It is not. Because generation can fake understanding.
A video model that produces a ball bouncing off a surface does not need to understand momentum. It needs to produce pixels that look like momentum looks in its training data. The output is plausible. The process is pattern completion. The model has seen ten thousand bouncing balls and averaged their behavior into a convincing surface. Whether it understands force, mass, or contact angle is irrelevant. It looks right.
Removal cannot fake it. Remove the ball, and the surface still depresses? Physically wrong. Any viewer catches it instantly. The model has to know what the ball was doing to the scene. Not what it looked like. What it was doing. Contact and consequence. The invisible relationships between objects that a camera records but no training label describes.
Generation produces the right answer by recognizing the visual pattern. Removal produces the right answer only by modeling the mechanism. The first is impressive. The second is intelligent. They look similar on a spec sheet. They are structurally different operations.
The mask that carries relationships
VOID uses a four-value mask. Every pixel gets one of four assignments: the primary object being removed, the overlap zone between primary and affected regions, the interaction-affected region (things that will move or change because of the removal), and the background. That four-part structure is a semantic map of physical relationships. Not "what is here" but "what is connected to what, and how."
Current generation models have no equivalent input. You cannot tell Kling which objects in your scene are physically interacting with which other objects. You cannot tell Veo that the light on the wall depends on the lamp on the desk and should shift if the lamp moves. You describe the scene in text. The model interprets it as a flat collection of visual elements with no causal wiring between them.
VOID's mask is the kind of structured input that generation models do not yet accept. It carries relational information that text prompts cannot express. It is the ground truth principle from a different angle: provide the model with structured information about how the scene actually works, and the output respects reality instead of hallucinating around it.
What a studio learns from its own tools
Netflix acquired comprehension. Now it is open-sourcing comprehension. The economics are different. One cost an undisclosed sum. The other is free. But the signal is consistent. Netflix believes the durable value lives in understanding footage, not in producing it. Production, as this series has argued since article sixteen, is a commodity. Six models, declining prices, interchangeable APIs. Understanding what makes a scene work, what connects one element to another, what would change if you altered a single variable, that is not a commodity. That is expensive whether the tool is human or machine.
A filmmaker who understands why a scene works can direct any model to produce it. A filmmaker who does not will accept whatever the model volunteers. The same division holds at the model level. A model that understands a scene can modify it intelligently. A model that does not fills pixels and hopes nobody zooms in.
The frame, minus one
Forty-three articles about what to put in the frame. Lens, light, color, movement, sound, time, performance, environment, composition. Structured vocabulary for describing what should exist inside the rectangle.
This one is about what happens when you take something away.
The result is clarifying. Removal proved comprehension more cleanly than generation ever could. Generation lets the model hide behind pattern-matching. Every plausible output could be genuine understanding or could be a well-fitted statistical surface. There is no way to tell from the outside. Removal strips the disguise. Either the physics rewrite correctly or they do not. The ball falls or it floats. The cart stops or it keeps rolling into nothing. No beauty bias to smooth the contradiction. No art direction to distract from the error.
Netflix's first public model does not add a single pixel of new content to the world. It proves understanding by what it leaves behind when it takes something out. And that understanding, the ability to model what is actually happening in a scene rather than what it approximately looks like, is the piece that every generation model is still missing.
The gap between filmmaker and model has always been a comprehension gap. The filmmaker understands the scene. The model approximates it. One day those will converge. When they do, it will look less like a model that generates better footage and more like a model that understands what footage means.
VOID is a postcard from that future. Written in the language of subtraction.
Bruce Belafonte is an AI filmmaker at Light Owl. He watched a model erase a person from a video and spent the rest of the afternoon thinking about what happened to the shopping cart.