Nine articles in and I realize I have been writing about silent films.
Not on purpose. For most of the last year, that is what AI video was. Beautiful, occasionally stunning, completely mute. You worried about your lens language and your lighting dials and your color palettes and you dropped the output into a timeline and added audio later, same as you would with any other raw footage. Sound was a post problem.
It is not a post problem anymore.
Four of the five models this series covers now generate audio natively. Kling 3.0 calls it "Omni Native Audio." Veo 3 was arguably first to the party when it shipped synchronized sound last May. Sora 2 generates dialogue and ambience. Seedance 2.0 handles lip sync in eight languages. Runway Gen-4.5 is the holdout. Still silent. Still relying on you to handle sound externally.
Which means there is an entire prompt surface most people are treating like it does not exist.
The sound equivalent of "cinematic"
You know where this is going. The same thing that happened with visual keywords is happening with audio right now. People are generating clips with native sound and either ignoring the audio dimension entirely (accepting whatever the model defaults to) or adding a single instruction like "with realistic audio" and calling it done.
"With realistic audio" communicates nothing because every model's audio branch is already trying to be realistic. You are confirming the default, not directing it.
What you actually mean when you say "good audio"
Sound, like light, has components. A filmmaker's ear breaks any scene into at least five layers.
Dialogue is the voice. Who speaks, what they say, how they say it. Tone, pace, accent, volume. "She whispers" and "she shouts" produce meaningfully different outputs. Some models accept direct dialogue in quotes. Others infer speech from character descriptions and context.
Ambience is the room. Every location has a sonic fingerprint. A cathedral does not sound like a kitchen, and a kitchen does not sound the same empty as it does during a dinner party. City traffic at a distance. Wind through leaves. The buzz of a fluorescent tube. This is the layer that makes a visual feel like a place instead of a render.
Foley is the contact. Footsteps on gravel. Glass set down on marble. The rustle of a jacket sleeve. These are the sounds that happen when physical objects touch each other. In traditional production, a Foley artist performs them in a studio watching the cut frame by frame. In generated video, they happen (or do not happen) based on what the model infers from the visuals and your text.
Music is mood architecture. Background score, source music drifting from a radio in frame, a street musician down the block. Models can generate this but they tend to overcommit. You ask for "music" and you get a full string section. Sometimes you wanted a single piano at low volume.
Spatial quality is where the sound lives relative to the camera. Close sounds versus distant sounds. Reverberant versus dry. Echoing hallway versus carpeted bedroom. This is the subtlest layer and the one models handle least reliably, but it is worth requesting because when it works, the result sounds three-dimensional instead of pasted on top.
Five layers. Pin more of them down, get fewer defaults.
What works, model by model
Kling 3.0 generates audio and video in a single pass. Its Omni Native Audio is the most integrated approach available right now. Dialogue is solid. Ambient sound matches the visual environment convincingly. Voice Binding lets you attach voice profiles to specific characters and the model tracks who is speaking in multi-character scenes. Lip sync is accurate enough to not be distracting, which in February 2026 counts as high praise. The limitation: musical scoring defaults to generic. If you want specific instrumentation, describe it. "A single cello, played slowly" works. "Dramatic music" gets you the same string arrangement everyone else gets.
Veo 3.1 was the audio pioneer and still has the most natural-sounding ambient generation. Environments sound lived-in. Rain sounds like rain, not like a rain sound effect pulled from a library. Dialogue is capable but the model has opinions about how characters should sound, the same way it has opinions about how scenes should be lit. It will add vocal texture you did not ask for if it decides the scene calls for it. Spatial audio is best in class. Genuine left-right separation when the framing supports it. Sounds shift position as the camera moves.
Sora 2 handles the audio-visual relationship the way it handles everything: narratively. It understands that a door closing in frame should produce a door sound, that footsteps should match walking speed, that a conversation implies voices. Dialogue generation exists but it reads instructions more loosely than Kling or Veo. Describe a conversation and Sora delivers the feel of it. Word-for-word dialogue is a different bet. Ambience is good. Foley is inconsistent. Musical scoring will either enhance the clip or fight it, and you will not know which until you hit generate.
Seedance 2.0 earns its reputation on lip sync. Eight languages, regional accents, and the alignment between mouth shapes and spoken words is genuinely tight. It responds well to explicit audio cues in the prompt: "Sound of aggressive sizzling" or "distant sirens growing louder." The text prompt's audio interpretation picks up these descriptions reliably. The weakness is ambient subtlety. Room tone tends to either show up or not, without much gradient between silence and presence.
Runway Gen-4.5 generates no audio. This is a design choice more than a limitation. Runway's ecosystem assumes you are compositing, same as you would with traditionally shot footage. You bring your own sound design. For workflows that already route through a post-production audio pipeline, this is fine. For anyone who wants a finished clip from a single generation pass, it means Runway is the only major model that still hands you a silent film.
How to actually prompt for it
The same principles from nine articles of visual prompting apply. Specificity over generality. Describe the result, not the technique. You would not say "apply Foley" any more than you would say "add cinematography." You describe what you hear.
A prompt that treats audio as a first-class citizen might include: "Quiet footsteps on hardwood, a clock ticking in another room, muffled traffic from a closed window, her voice low and steady."
That is four audio instructions and a vocal direction, all in one sentence. Each one gives the model a constraint. Each constraint is one fewer default the model fills in for you.
The word order insight from article eight applies here too. Audio instructions buried at the end of a 150-word visual prompt will get compressed or ignored entirely. If sound matters to the shot, front-load it or pair audio cues directly with the visual elements they belong to. "A woman walks across a marble lobby, her heels clicking with each step" carries more audio information than a visual paragraph followed by "add footstep sounds."
Combine audio and visual description into the same breath and models handle both better. Separate them into distinct sections and one gets priority while the other gets leftovers.
The silent filmmaker habit
Here is where most AI video creators are right now: still prompting like sound does not exist because for most of last year, it did not. The mental model has not caught up to the tooling. Clips are being generated with synchronized audio that nobody specifically asked for, and the default result sounds the way "cinematic 4K" looks. Pleasant. Generic. Anybody's.
CinePrompt has had a sound panel since day one. It was built for this moment. The bet was always that audio would become part of the generation, not a separate step. That bet is paying off faster than expected.
You spent nine articles learning to decompose visuals into specific, directional components. Sound is the same exercise with different dials. The models already listen. The question is whether you have anything specific to tell them.
Bruce Belafonte is an AI filmmaker at Light Owl. He recently realized that nine articles about visual prompting were, technically, a monograph on silent cinema.