A note on claim structure: each of the four patents files the same substantive steps three times — once as a method claim, once as a system claim, and once as a non-transitory computer-readable medium claim. This is standard USPTO practice: it ensures the protection applies whether someone implements the invention as a process, as a hardware system, or as software. The three versions are substantively identical. In the claims reproduced below, only the method claim (Claim 1) is shown for each patent. The system and computer-readable medium claims cover the same steps in their respective statutory categories.

Patent number	Title	Date granted	Links
US 12,322,036 B1	LiDAR Data Utilization for AI Model Training in Filmmaking	2025	Google Patents ↗ USPTO ↗
US 12,438,995 B1	Integration of Video Language Models with AI for Filmmaking	7 Oct 2025	Google Patents ↗ USPTO ↗
US 12,511,837 B1	Artificial Intelligence-Based Video Content Creation with Predetermined Styles	30 Dec 2025	Google Patents ↗ USPTO ↗
US 12,511,904 B1	Method, System, and Computer-Readable Medium for Training a Captioner Model to Generate Captions for Video Content by Analyzing and Predicting Cinematic Elements	30 Dec 2025	Google Patents ↗ USPTO ↗

The spatial intelligence behind Netflix's AI filmmaking acquisition — what the LiDAR patent actually does

I read the foundational patent that sits beneath all of InterPositive's technology to understand why teaching an AI to see in three dimensions is the first problem any serious filmmaking AI has to solve.

Ben Affleck, founder of InterPositive. Credit: Netflix / InterPositive

Patent on Record

US 12,322,036 B1

Full title	LiDAR Data Utilization for AI Model Training in Filmmaking
Inventor	Benjamin Geza Affleck-Boldt
Assignee	InterPositive, LLC
Granted	2025
Claims	20 (method, system, non-transitory computer-readable medium)
WIPO counterpart	WO2025255446A1 / WO2025255425A1
Google Patents	US12322036B1 ↗

Claim 1 — Method US 12,322,036 B1 · 20 total claims · system & CRM claims cover the same steps

A computer-implemented method for enhancing artificial intelligence (AI) model training in filmmaking through a use of Lidar data, the method comprising:

Correlating two-dimensional video data with three-dimensional spatial data obtained from Lidar to simulate professional camera techniques
Receiving detailed metadata related to professional filmmaking techniques, including camera settings, shot composition, and lighting setups
Processing the received metadata alongside the Lidar data to provide one or more AI models with a granular understanding of spatial relationships and the physics of camera movement
Training the AI models using the processed metadata and Lidar data to accurately simulate professional filmmaking techniques, thereby enhancing realism and quality of generated video content

Source: US12322036B1 on Google Patents. The independent claims are notably compact — four steps — with the conceptual core being the correlation of 2D video with 3D Lidar spatial data as a training input. The international counterpart WO2025255446A1 seeks equivalent protection outside the US.

LiDAR in a filmmaking context — why depth data is the first problem

LiDAR stands for Light Detection and Ranging. In practice, the technology fires pulses of laser light at a scene and measures how long each pulse takes to return. Because the speed of light is known, the system can calculate the precise distance from the sensor to whatever the laser hit. Do this rapidly across a wide area and you build up a dense cloud of distance measurements — a point cloud — that maps the three-dimensional geometry of the scene with exceptional accuracy.

You're probably most familiar with LiDAR from autonomous vehicles. Waymo, Cruise, and other self-driving car companies mount LiDAR sensors on their vehicles precisely because cameras alone cannot reliably tell you how far away something is. A camera captures a flat, two-dimensional representation of the world. LiDAR gives you actual depth data — which object is 1.3 metres away and which is 8.7 metres away, rather than both simply appearing somewhere in the frame.

So why does LiDAR appear in a filmmaking AI patent?

For almost exactly the same reason. A camera — even a cinema camera — produces a flat image. An AI system trained only on flat images has to infer depth from visual cues: perspective lines, relative object sizes, parallax between frames, focus fall-off. These cues are real, but they are imprecise and ambiguous. You can often work out roughly where things are in a scene from a 2D image. You cannot work out exactly.

For a filmmaking AI trying to simulate camera movement, relight a scene, or insert a visual effects element convincingly, "roughly" is not good enough. The AI needs to know where every significant surface actually is in three-dimensional space — how far from the camera, at what angle, in what spatial relationship to the other objects around it.

That is what this patent addresses.

What US 12,322,036 actually claims

The core independent claim of US 12,322,036 describes a method in which a system captures LiDAR data from a sensor during filming. The captured data includes spatial coordinates, distance measurements, and the relative positional information of objects within the scene.

The system then integrates that LiDAR data with filmmaking metadata — the information about how the shot was captured, including camera settings, lens characteristics, and composition details. The integration combines the geometric information from LiDAR with the cinematographic information from the metadata, producing a combined dataset that gives the AI a three-dimensional spatial understanding of the scene: how deep it is, where objects sit relative to each other, what the actual physical relationships are between the foreground, the subject, and the background.

That combined, spatially-grounded dataset is then used to train downstream AI models. The patent covers the whole chain: capture, integration, and use in training.

In plain language: InterPositive's team mounted LiDAR sensors on their controlled soundstage alongside their cameras. When they filmed their proprietary training dataset, they weren't just recording images — they were simultaneously recording the precise three-dimensional geometry of every scene. That geometric data was then fused with the cinematographic metadata and fed together into the AI.

"Most AI vision systems are trained to recognise what is in the frame. This patent is specifically about understanding where things are — the actual physical geometry of the scene."

How 3D geometry changes what the AI can do

Consider what an AI model needs to understand in order to convincingly simulate a dolly shot — where the camera moves physically forward through a scene.

As the camera moves, every object in the scene shifts in the frame. But critically, they don't shift equally. Objects close to the camera appear to move much faster across the frame than objects far in the background. A chair that's two metres away sweeps dramatically across the image as the camera passes. A painting on the far wall barely appears to move at all. This differential — called parallax — is what gives moving shots their sense of physical depth and immersion.

To simulate parallax correctly, an AI needs to know the actual depth of every element in the scene. Not an approximation. Not a guess from visual cues. The real number. Otherwise the simulated shot will look wrong — elements will shift at incorrect rates, the sense of physical depth will collapse, and the result will betray itself as artificial.

LiDAR gives the system exactly those real numbers.

The same principle applies to relighting. Light falls differently on surfaces at different depths and angles. To convincingly change the lighting on a scene — to shift the key light from camera left to camera right, or to add a motivated practical light source — the AI needs to understand the physical geometry of the space. Which surfaces face which direction? What angle do they present to the light source? How would shadows fall across the three-dimensional arrangement of objects? A flat image cannot answer these questions reliably. A point cloud can.

Camera movement simulation. Parallax effects depend on knowing real depths. LiDAR provides them, so dolly moves, crane shots, and handheld tracking feel physically plausible.
Relighting. Surface normals and object geometry determine how light falls. The AI can't relight convincingly without knowing what shape the scene actually is in 3D space.
Wire removal and object replacement. Knowing the precise location of a stunt rig in three-dimensional space makes clean removal far more tractable than guessing from 2D.
VFX integration. Inserting a generated element into real footage requires matching its position, size, and shading to the geometry of the scene — all of which LiDAR defines precisely.
Focus pulling. The system can calculate depth of field accurately because it knows the real distances of objects from the lens, not just their apparent size in the frame.

Patent Fig 2B — Fig. 2B — The controlled soundstage capture environment. Multiple camera rigs surround a central subject; each rig is linked to the Filmmaker computing system via network. This is the physical setup that produced InterPositive’s proprietary training dataset. (US 12,438,995 B1)

Patent Fig 2C — Fig. 2C — The LiDAR and spatial capture configuration. The subject rotates on a turntable; camera tracks, LiDAR sensors, and lighting rigs record ground-truth spatial data simultaneously with the video. (US 12,438,995 B1)

The InterPositive pipeline processing a stunt shot for wire removal. The system’s 3D scene understanding — grounded in LiDAR data — is what makes clean removal tractable at a single-frame level. Credit: Netflix / InterPositive

Why spatial capture is the architectural foundation

Among the four US patents in InterPositive's portfolio, this one was granted first — in June 2025, several months before the others. That chronology is deliberate and revealing.

The LiDAR patent is foundational in an architectural sense. Every downstream model in the InterPositive system depends on the spatially-grounded training data this patent produces. The captioner model (US 12,511,904) is trained on footage with known depth. The language-model integration layer (US 12,438,995) operates with the understanding that the AI has been grounded in three-dimensional space. The style control system (US 12,511,837) applies aesthetic treatments to scenes whose geometry is understood.

Without this layer, the whole system would be trying to understand cinematography from flat images alone — which is how most AI video models work, and which is precisely what InterPositive was founded to improve upon. Affleck has said he found existing AI tools came up short because they lacked genuine understanding of the physical mechanics of cinematography. LiDAR is a significant part of what provides that understanding.

Ground-truth measurement versus inferred depth

LiDAR has been used in film production before — mostly in pre-production and virtual production contexts. Surveyors use it to create accurate digital models of shooting locations. Some virtual production pipelines use LiDAR scans of sets to calibrate camera tracking. The technology also appears in consumer devices: iPhones and iPads since 2020 have included LiDAR sensors, which filmmakers have experimented with for location scouting and basic augmented reality applications.

What is genuinely unusual here is using LiDAR data as training input for an AI model specifically designed for post-production work.

Most video AI training pipelines rely entirely on camera footage — flat images from which depth must be inferred. Some research efforts have used depth estimation algorithms to generate approximate depth maps from 2D footage. But these are estimates, not measurements. They are inherently less accurate than real sensor data, and the errors compound when you try to use them for precise spatial operations.

InterPositive's approach of capturing ground-truth LiDAR data during principal photography on their training soundstage — and fusing it directly with the cinematographic metadata — represents a notably different philosophy. Rather than asking AI to guess at depth, they gave it depth measurements accurate to centimetres. The patent language also explicitly extends the definition to include related sensing technologies: radar, ladar, and photogrammetry — which suggests the drafters were alive to obvious substitutions, though whether those alternatives would in practice fall within or outside the claims would depend on how a court construes the specific language.

Patent scope and design-around risk

The patent's claims are reasonably focused. They describe a specific pipeline: capture LiDAR data during filming, integrate it with filmmaking metadata, and use the combined dataset to train AI models. Competing systems could potentially use different approaches — monocular depth estimation rather than active LiDAR sensing, or photogrammetry from multiple camera angles — that fall outside the specific claims here.

On the current claim language, the most direct implementation path — deploying LiDAR sensors on a controlled filming set, capturing point cloud data alongside video, and using that combined dataset to train AI models — is what the granted claims appear to cover. The precise scope, and how far it extends to variants using related sensing technologies, would ultimately depend on claim construction and any applicable prosecution history. — is what the granted claims appear to cover. Competing approaches using different sensor modalities or depth-estimation methods would need to be assessed against the specific claim language independently.

A fundamentally different training philosophy

Most AI video companies train on internet-scale video — enormous quantities of footage scraped from YouTube, film archives, and other public sources. The advantage is volume. The disadvantage is that this footage comes with no ground-truth metadata about how it was actually shot. The AI has to infer lens characteristics, camera movement, depth relationships, and lighting conditions from the images themselves.

InterPositive took the opposite approach. Rather than training on billions of frames of unknown provenance, they filmed a smaller, purpose-built dataset in a controlled environment, instrumenting every shot with precise sensor data — LiDAR depth measurements, full camera metadata, lighting documentation. The resulting training data is dramatically smaller in volume but far richer in information per frame.

This is the difference between training a model on millions of unlabelled photographs and training it on thousands of photographs, each with precise technical specifications attached. For a system designed to assist professional cinematographers who work with precise technical specifications every day, the second approach makes considerably more sense.

That is what this patent protects: not just a technical method, but an entire philosophy of how to build a filmmaking AI with genuine craft knowledge rather than statistical pattern recognition alone.

Teaching an AI to speak the language of film — what the InterPositive language integration patent actually does

Getting AI to generate video that technically works is one problem. Getting it to understand what a filmmaker actually means when they say "give me a 50mm with a slow push and shallow focus on her eyes" is an entirely different one. I read the patent that tries to solve that translation problem.

Ben Affleck. Credit: Netflix / InterPositive

Patent on Record

US 12,438,995 B1

Full title	Integration of Video Language Models with AI for Filmmaking
Inventor	Benjamin Geza Affleck-Boldt
Assignee	InterPositive, LLC
Granted	7 October 2025
Filed	25 November 2024
WIPO counterpart	WO2025255437A1
Claims	19 (method, system, computer-readable medium)
Patent citations	40

Claim 1 — Method US 12,438,995 B1 · 19 total claims · system & CRM claims cover the same steps

A computer-implemented method for integrating one or more existing video large language models (LLMs) with custom AI algorithms for filmmaking, the method comprising:

Interfacing with an existing video LLM
Receiving detailed metadata related to professional filmmaking techniques, including camera settings, shot composition, and lighting setups
Processing the received metadata to adapt the existing video LLM to generate video content that simulates professional filmmaking techniques
Receiving Lidar data captured from a lidar sensor, the Lidar data including spatial coordinates, distance measurements and relative positional information of objects within a scene
Integrating the Lidar data with the processed metadata by combining spatial coordinates and distance measurements with filmmaking metadata to enhance the generated video content by providing a three-dimensional spatial understanding of the scene indicating depths and positional relationships among objects
Applying transfer learning techniques to the existing video LLM based on the processed metadata and Lidar data to refine its video content generation capabilities

Claim breadth note: The independent claims require receipt of Lidar data including spatial coordinates, distance measurements, and positional relationships. A system performing spatially-aware video generation without a Lidar capture step — using monocular depth estimation or other non-Lidar spatial methods — could argue the Lidar receipt steps are not met, which narrows the claim's reach against such alternatives.

The semantic gap between filmmaker and machine

Before I explain what this patent does, it's worth dwelling on the problem it's trying to solve — because it's easy to underestimate how fundamental it is.

Filmmaking is a craft with its own technical vocabulary. When a director of photography says "give me a 35mm on a 2:1, motivated from the window, with a falloff across the mid-ground," every experienced person on set knows exactly what that means. It describes a specific lens, a specific lighting ratio, a specific source direction, and a specific way the light should fade across the scene. The instruction is precise, efficient, and completely intelligible to anyone trained in the craft.

Now ask a state-of-the-art AI video model the same thing.

The patent is unusually direct about this problem. It describes a figure showing a graphical user interface in which cinematographic text instructions have been entered into a prior art AI system, the output that system produced, and then a third figure showing what the output should have looked like. The gap between the second and third figures is the problem this patent exists to close.

Existing AI video models are trained primarily on semantic metadata — what appears in a frame, not how it was captured. They can understand "a woman walking through a park" in a way that produces a recognisable image. They cannot understand "a slow dolly in on a 75mm at T2.8 with the background falling slightly out of focus" in a way that makes any practical difference to the footage they generate. The cinematographic instruction simply doesn't connect to anything in their training.

This is the semantic gap that US 12,438,995 is designed to bridge.

What the patent actually claims

The independent claim describes a method for integrating one or more existing video large language models with custom AI algorithms for filmmaking. Let me walk through the steps.

First, the system interfaces with an existing video LLM. This is significant — the patent is not about building a video generation model from scratch. It is about taking an existing model and extending it. Rather than competing with OpenAI or Google on raw video generation capability, InterPositive focused on what none of those models had, which is cinematographic understanding.

Second, the system receives detailed metadata related to professional filmmaking techniques, including camera settings, shot composition, and lighting setups. This metadata comes from the upstream captioner model (US 12,511,904) — which watches professionally photographed footage and generates structured descriptions of the cinematic choices it observes.

Third, the system processes that metadata to adapt the existing video LLM — specifically to enable it to generate video content that simulates professional filmmaking techniques. This adaptation is the core innovation.

Fourth, the system receives LiDAR data captured from a sensor, including spatial coordinates, distance measurements, and positional information of objects within the scene. The LiDAR data grounds the whole system in physical reality, giving the AI a three-dimensional understanding that flat camera footage alone cannot provide.

Fifth, the LiDAR data is integrated with the processed metadata, combining the geometric and cinematographic information. Sixth, transfer learning techniques are applied to the existing video LLM based on this combined data, refining its capabilities to reflect professional filmmaking understanding.

In plain language: the system takes a capable AI video model and teaches it to understand filmmaking. It does this by feeding the model a combination of cinematographic metadata and spatial data, then fine-tuning it so that filmmaking instructions — given in the technical language of the craft — produce corresponding outputs.

What prior AI video models understand

"A woman sitting at a desk in an office."

"A rainy street at night."

"Two people having a conversation."

What InterPositive's system understands

"Medium close-up, 75mm, directional key at 3:1, shallow focus, slow push-in."

"Wide establishing, locked off, 24mm, flat lighting, heavy rain practical in foreground."

"Over-the-shoulder coverage, 50mm, matching eyeline, motivated daylight from window left."

"The innovation is not building a video generator. It is building the layer that makes a video generator usable by an actual filmmaker — without that person having to learn to think like a machine."

The hub patent — why the language layer is architecturally central

Having now covered all four granted US patents in the InterPositive portfolio, I'd argue this one — the language integration layer — is the one that matters most from Netflix's perspective.

Consider the system as a whole. The LiDAR patent provides spatial understanding of the scene. The captioner patent enables the AI to read and describe existing footage in cinematographic terms. The style patent allows the AI to reproduce a predetermined visual identity. All of these are technically impressive and, in their own right, valuable.

But none of them are usable without the language layer.

A director or cinematographer working with InterPositive's tools doesn't think in point clouds, loss functions, or pixel tensors. They think in the vocabulary of their craft — the same vocabulary they use when talking to a camera department on set. The language integration patent is what allows them to use that vocabulary to drive the AI system, rather than having to develop an entirely new technical fluency.

Patent	Layer	Role in the system
US 12,322,036	Spatial foundation	LiDAR gives the AI real 3D geometry of each scene
US 12,511,904	Metadata reading	Captioner reads footage and extracts cinematographic descriptions
US 12,438,995	Language interface	Translates filmmaker instructions into parameters the AI executes
US 12,511,837	Style output	Applies a predetermined aesthetic during video generation

The arrow of dependency in this table runs in both directions through the language layer. Information flowing upward — spatial data from LiDAR, style metadata from the captioner — must pass through the language interface to become actionable creative parameters. Instructions flowing downward — from a filmmaker expressing creative intent — must pass through the same layer to become something the generation model can execute.

The language layer is the hub. Everything else is either upstream input or downstream output.

Patent Fig 2A — Fig. 2A — The Filmmaker Computing System architecture as shown in the patent. The left column handles metadata processing, video generation, LiDAR scene adjustment, and camera movement simulation. The right column handles lighting, narrative coherence, and the language model interface, training, and prompt processing layers. (US 12,438,995 B1)

The InterPositive system running inside a ComfyUI pipeline. Visible nodes include NormalCrafter (surface normal processing), PyraCanon, and Video Combine stages — the production implementation of the architecture described in the patent. Credit: Netflix / InterPositive

Training data: controlled captures and licensed dailies

On their controlled soundstage, the team filmed scenes with multiple cameras and angles simultaneously, with every variable documented. Lens types. Focal lengths. F-stops and T-stops. Camera positions and movement speeds. Lighting setups — source positions, ratios, colour temperatures. Shot types and framing across the full range: tight face shots, medium shots, wide shots, over-the-shoulders. All of this was captured as metadata alongside the footage, manually tagged by people who understood what it meant cinematographically.

The patent also describes training on dailies — the raw, unedited footage from professional film shoots. Dailies represent the richest possible source of real-world filmmaking data: actual cinematographers making actual decisions in actual production conditions. Licensing dailies from film studios also has a copyright advantage the patent explicitly notes — the footage comes from a controlled chain of rights, avoiding the provenance questions that have dogged other AI training datasets.

The combination of controlled soundstage data and licensed dailies gives the language model a vocabulary built from both precise, isolated technical examples and the full complexity of real professional production.

Patent scope and design-around risk

The core independent claim is fairly specific: it describes integrating an existing video LLM with filmmaking metadata and LiDAR data via a particular process — receiving, processing, integrating, and applying transfer learning. Competitors could build systems that achieve similar outcomes through different architectures — training a video generation model from scratch rather than fine-tuning an existing one, or using different sensor modalities instead of LiDAR. None of these would necessarily infringe the specific claims here.

On the current claim language, the most straightforward implementation — taking an existing video LLM, piping filmmaking metadata and spatial sensor data into it, and fine-tuning it on that combined dataset using the process the claims describe — is what the granted claims appear to cover most directly. Whether a materially different architecture achieving similar results would fall within or outside those claims would require independent assessment against the specific claim language.

A tool that speaks film, not machine code

US 12,438,995 is built around a clear premise: the filmmaker is the author of the creative decisions, and the AI's job is to execute those decisions faithfully. The language integration layer exists specifically so that the filmmaker's vocabulary — not the AI's internal parameter space — is the primary interface.

That framing is commercially and politically astute in an industry navigating significant anxiety about AI's role. But it also reflects something genuine about what the technology actually does. This isn't a system designed to replace a camera department. It's a system designed to give a camera department a new kind of tool — one that understands how they think.

Teaching an AI to see like a cinematographer — what the InterPositive style patent actually does

Getting AI to generate video that technically works is one problem. Getting it to generate video that looks like it was shot by a specific person, with a specific philosophy, on a specific film stock — that is an entirely different one. I read the patent that tries to solve it.

Patent on Record

US 12,511,837 B1

Full title	Artificial Intelligence-Based Video Content Creation with Predetermined Styles
Inventor	Benjamin Geza Affleck-Boldt
Assignee	InterPositive, LLC
Granted	30 December 2025
Filed	25 November 2024
WIPO counterpart	WO2025255426A1
Claims	20 (method, system, computer-readable medium)

Claim 1 — Method US 12,511,837 B1 · 20 total claims · system & CRM claims cover the same steps

A computer-implemented method for constructing and training an artificial intelligence model configured to generate video content with a predetermined style, the method comprising:

Capturing one or more control images corresponding to a scene using standard digital video as a baseline
Capturing one or more test images of the scene using different film formats to document visual effects
Applying post-production alterations to the captured footage
Constructing a training dataset that includes a variety of shots captured under varied lighting conditions
Training an AI model with paired comparisons to enable it to learn specific visual signatures
Reviewing footage generated by the AI model to assess its authenticity and using feedback to refine the model
Optimizing learning cycles to enhance an efficiency of the training

Dependent claims specify: film formats including 35mm, 16mm, 8mm, and Super 8mm; post-production alterations including push processing and bleach bypass; dataset construction including tight face shots, medium shots, and wide shots; control-versus-modified paired comparisons; and training optimisation by scaling data acquisition as the model demonstrates proficiency. Claim breadth note: Step 2 requires physical capture of test images using different film formats. A purely digital training approach — without physically shooting on multiple film stocks — could argue this step is not met.

The visual consistency problem

A feature film is not shot in sequence. Scenes from the end of the story are often filmed weeks before scenes from the beginning. A single sequence might be photographed across three or four different shooting days, with changing natural light, different crew call times, and varying conditions on set. By the time an editor cuts the film together, the footage has been shot across months, sometimes in multiple countries, under wildly different conditions.

Yet when you watch a well-made film, it looks unified. Every scene has the same visual character. The colour palette is coherent, the lighting ratios are consistent, the lenses have a recognisable quality, and the way the camera moves feels like it reflects a single sensibility. This is not an accident. It is the result of enormous effort from the cinematographer, the colourist, and the post-production team to establish a visual language at the beginning of a production and then maintain it, frame by frame, across everything that follows.

Now introduce an AI system into that workflow. The AI is used to relight a shot, extend a background, remove a stunt wire, or generate a pickup scene that couldn't be captured during principal photography. The generated content needs to match the established look of the film exactly — not approximately. Even a subtle discrepancy in colour temperature, contrast ratio, or the quality of specular highlights will be visible to a trained eye, and visible discrepancies undermine the credibility of the entire sequence.

This is the problem US 12,511,837 addresses. How do you encode a film's visual identity — its "predetermined style" — in a form that an AI can reliably reproduce?

What the patent actually claims

The core independent claim describes a method for training an AI model to generate video content with a predetermined style. The steps are worth going through in detail, because they reveal something important about how InterPositive approached the problem.

The first step is capturing control images using standard digital video as a baseline — the plain, unmodified image of a scene as captured by a digital sensor.

The second step is capturing test images of the same scene using different film formats. The patent specifically names 35mm, 16mm, 8mm, and Super 8mm. These aren't just examples — they represent the range of film stocks that carry distinct visual signatures: different grain structures, different colour responses, different dynamic range and contrast characteristics.

The third step involves applying post-production alterations to the captured footage. The patent specifically mentions push processing and bleach bypass. These are darkroom and photochemical processes that alter the tonal and colour characteristics of film in specific, recognisable ways. Push processing underexposes and overdevelops film to increase contrast and grain. Bleach bypass retains the silver layer in the final print, desaturating the image and increasing contrast simultaneously. Both produce distinctive looks associated with certain cinematographers and certain eras.

The fourth step constructs a training dataset from this material, across different shot types and varied lighting conditions. The fifth step trains an AI model using paired comparisons — control footage versus modified footage. The same scene, with and without the style treatment. And then the system iterates, reviewing outputs against authenticity criteria and refining accordingly.

"The system doesn't just learn what a 1970s film aesthetic looks like. It learns the precise difference between a digital image and one that has been run through that aesthetic — at every level from grain structure to colour response."

Why paired comparisons outperform imitation-based training

It would be tempting to assume InterPositive is simply training the AI on lots of stylised footage and asking it to imitate what it sees. That approach is how most style-transfer systems work, and it produces results that are often visually plausible but cinematically shallow.

The paired comparison method is different and more precise.

By filming the same scene in multiple formats and with multiple post-production treatments, InterPositive creates a dataset where the only variable between pairs of images is the specific style element being studied. The scene, the lighting, the subject, the camera position — all identical. The film stock, the processing, or the grading treatment — different.

This allows the AI to isolate exactly what a given style element contributes to the image. Not "this is what 16mm footage looks like" but "this is what 16mm footage looks like compared to the same shot in digital, and here are the specific differences in grain frequency, colour response, highlight rolloff, and shadow lift that account for that difference."

The choice to include physical film formats — 35mm, 16mm, 8mm, Super 8 — rather than only digital simulation is also telling. InterPositive isn't approximating film aesthetics from digital representations. They are training from the actual photochemical artefacts of real film, giving the AI access to the genuine optical and chemical characteristics of each format.

Colour grading consistency. The AI learns the precise colour palette of an established production and can apply it to newly generated or modified footage, frame by frame.
Lighting signature reproduction. Cinematographers develop distinctive lighting approaches. Encoded as style parameters, these can be applied to AI-generated shots so they match traditionally photographed footage.
Film stock simulation. The grain structure, colour response, dynamic range, and contrast characteristics of different film stocks are learned from real examples, not digital approximations.
Era-specific aesthetics. A period film set in the 1970s can adopt the visual language of 1970s cinematography — the optical and chemical characteristics of that era's actual equipment.
VFX integration. Generated visual effects elements can be conditioned on the established style so they match surrounding live-action footage without extensive post-production correction.
Post-production alteration techniques. Push processing, bleach bypass, cross-processing — the specific photochemical looks associated with these darkroom techniques become learnable parameters.

Style as the output valve of the stack

The captioner patent (US 12,511,904) is this patent's complement. The captioner is a reading system: it watches footage and extracts a description of its visual identity — which lens, which lighting ratio, which colour palette, which motion characteristics. This patent is a writing system: it takes a style description and enforces it during video generation.

Together, they form a closed loop. The captioner analyses a film's established footage and generates a style specification. That specification is passed to the generation system described in this patent. New footage — for pickups, VFX inserts, or AI-assisted reconstruction — is then generated with the style specification as a constraint, ensuring it matches the original material.

This loop is what makes the system practically useful in a professional production context. A post-production supervisor doesn't just want an AI that generates plausible footage. They want an AI that generates footage that will cut seamlessly with the material already in the edit.

What distinguishes this from prior style-transfer systems

Style transfer is not a new idea in AI. Neural Style Transfer dates to 2015. AI-assisted colour grading tools have been available to the post-production industry for several years. What differentiates this patent is three things that prior approaches tend to lack simultaneously.

First, the training data is purpose-built and photochemically grounded — InterPositive filmed the training material themselves, using real film stocks and real darkroom processes, giving the AI access to actual photochemical characteristics rather than digital simulations.

Second, the paired comparison methodology isolates style variables rather than learning from uncontrolled examples, producing more precise and reliable control than associative style learning.

Third, style is integrated into the generation process rather than applied as a post-processing filter. This is fundamentally more reliable than generating footage and then applying a style transformation afterward, which always risks inconsistencies at edge cases the post-processing wasn't trained to handle.

Patent scope and design-around risk

The independent claims are relatively specific to the training methodology described — capturing control and test images, applying physical post-production alterations, training on paired comparisons, iterating based on authenticity feedback. A competitor who trained a style model on large quantities of stylised footage, without the paired comparison approach, would be doing something meaningfully different from what the patent claims.

The core idea — that you can train an AI to understand and reproduce visual style — is not novel and could not be protected by any single patent. The patent protects a particular implementation, not the idea itself. The strategic value — as a business matter rather than a strict legal one — lies in the specificity of the approach: grounded in real photochemistry, filmed under controlled conditions, and trained on actual photochemical processes rather than digital approximations. Whether and how quickly a competitor could replicate those conditions is a practical question separate from what the patent itself covers.

Augmentation by design, not by accident

US 12,511,837 is not about generating novel visual styles. It is specifically about reproducing predetermined ones — styles established by human cinematographers, through deliberate creative choices, on physically photographed material. The AI's job is not to invent; it is to maintain.

That framing puts the AI in a supporting role rather than a generative one. It is closer to having a very capable colourist who can match any grade perfectly every time, or a VFX team that can generate inserts indistinguishable from the surrounding photography — than it is to having a system that replaces the cinematographer.

The patents InterPositive has filed — and that Netflix now owns — consistently describe a system built around understanding and preserving the visual decisions of human filmmakers, rather than replacing them. That's a considerably more defensible position, commercially and politically, than the alternative.

Patent Fig 2D — Fig. 2D — A single-camera data capture configuration showing the controlled lighting environment, grid backdrop for spatial reference, and the Filmmaker camera rig. LiDAR and depth sensors capture ground-truth geometry alongside the video signal. (US 12,438,995 B1)

Shot reframing — The InterPositive ComfyUI interface processing a period car scene for shot reframing. The selected region shows a take being repositioned; the right panel shows Display Variables and Parameter Presets conditioning the style output. Credit: Netflix / InterPositive

Before and after: an actor shot against blue screen (left) transformed by the InterPositive system into a sunset-lit outdoor scene (right). Style conditioning — colour temperature, directional light, atmospheric haze — is applied during generation, not as a post-process filter. Credit: Netflix / InterPositive

Teaching an AI to read cinematography — what the InterPositive captioner patent actually does

Getting AI to generate footage that looks cinematic is one problem. Getting AI to understand why footage looks cinematic — which lens was used, how the camera moved, how the frame was composed, where the subject sat in space — is an entirely different one. That is the problem US 12,511,904 is trying to solve.

The three articles in the chapter already imply its importance. The introduction said the captioner patent was left to "a companion article available separately," and the coda's summary grid gives it a very specific role in the stack: metadata reading — a model trained to watch footage and extract cinematographic descriptions, the reading half of InterPositive's closed loop.

That framing turns out to be exactly right.

Patent on Record

US 12,511,904 B1

Full title	Method, system, and computer-readable medium for training a captioner model to generate captions for video content by analyzing and predicting cinematic elements
Inventor	Benjamin Geza Affleck-Boldt
Assignee	InterPositive, LLC
Filed	12 March 2025
Granted	30 December 2025
Claims	30 (method, system, non-transitory computer-readable medium)
International counterpart	None on the public record so far; the portfolio overview describes the captioner patent as US-only at this stage

Claim 1 — Method US 12,511,904 B1 · 30 total claims · system & CRM claims cover the same steps

A computer-implemented method for training a captioner model to generate captions for video content by analyzing and predicting cinematic elements, the method comprising:

Organizing a dataset comprising raw video clips and corresponding metadata, wherein each video clip represents a specific shot varying one cinematic parameter at a time selected from the group consisting of focal length, camera movement, and framing
Extracting frames from the raw video clips at a consistent frame rate and storing the frames in a structured format
Associating each frame with corresponding metadata detailing the cinematic elements present in the frame, wherein the metadata includes information on focal length, camera movement, object distance, and framing style
Segmenting the raw video clips into shots and frames, wherein a shot comprises a continuous sequence captured without cuts, and frames are extracted at regular intervals from each shot
Applying frame-level labels to each frame based on the corresponding metadata, wherein the frame-level labels include focal length used during the shot, camera movement details, object distance from the camera, and framing style
Aggregating the frame-level labels to generate shot-level labels, wherein the aggregating includes calculating average focal length, determining predominant framing style, and smoothing camera movement data across the shot
Training the captioner model using the frame-level labels, the frames, and the aggregated shot-level labels to recognize and predict the cinematic elements in unseen video content
Iteratively refining the captioner model based on feedback from validation datasets to improve accuracy of cinematic element prediction
Deploying the trained captioner model to process and label a large video database, wherein the model generates metadata for new video content based on learned cinematic elements
Post-processing the generated labels to ensure consistency and accuracy, including performing outlier detection, confidence scoring, and manual quality control

This is the most procedurally detailed method claim in the portfolio — 10 steps versus 4 in US 12,322,036 and 6 in US 12,438,995. Dependent claims extend to: high-resolution lossless storage; JSON/CSV metadata formats; shot detection algorithms; LiDAR and laser locator data for frame-accurate spatial positioning; multi-task and active learning; and structured export into downstream database and editorial workflows. Claim breadth note: Step 1 requires that each clip varies one cinematic parameter at a time from the specific group of focal length, camera movement, and framing — tying the claim to the controlled-capture methodology.

Why a filmmaking AI needs a dedicated reading system

Most video AI systems are trained primarily to understand what is in the frame: a person, a street, a conversation, a landscape. InterPositive's patents, by contrast, are built around the idea that professional filmmaking depends at least as much on how the frame was made as on what appears inside it. The captioner patent is where that distinction becomes operational. Its background section is explicit that current AI video systems struggle with focal length, dolly/pan/tilt movement, shot type, parallax, occlusion, depth of field, and temporal consistency — precisely the technical variables filmmakers actually use.

That matters because the rest of the stack cannot function intelligently unless something upstream can first look at footage and describe it in cinematographic terms.

The language patent needs structured metadata to translate filmmaker instructions into executable parameters. The style patent needs a description of a film's existing visual identity before it can enforce that identity on new material. And the broader synthesis in this chapter makes the architecture clear: spatial data informs the language model, the language model drives generation, and the captioner feeds the style system.

In other words: without a model that can read cinematography, the rest of the InterPositive system would be working blind.

What the patent actually claims

The core independent claim is a workflow claim. It does not claim "an AI that understands film" in the abstract. It claims a fairly specific pipeline for teaching a captioner model to infer cinematic elements from video. The method begins by organising a dataset of raw clips and corresponding metadata, where each clip varies one cinematic parameter at a time — specifically focal length, camera movement, or framing. It then extracts frames at a consistent frame rate, associates each frame with metadata, segments the clips into shots and frames, applies frame-level labels, aggregates those into shot-level labels, trains the captioner model, refines it against validation sets, deploys it over a larger video database, and post-processes the resulting labels through outlier detection, confidence scoring, and manual quality control.

That sequence is worth slowing down for, because it tells you what InterPositive thought the real bottleneck was.

"Not generation. Not prompting. Not style transfer. The bottleneck was building a system that can take real filmed material and convert it into machine-readable cinematographic supervision."

The dependent claims sharpen that further. They specify high-resolution lossless storage, metadata formats such as JSON and CSV, shot detection algorithms, LiDAR and laser locator data for frame-accurate positioning, multi-task learning, active learning, and export into structured formats suitable for downstream databases and video-editing workflows. That last point is particularly revealing: this is not written like a speculative research concept. It is written like a production tool intended to plug into a larger post pipeline.

In plain language: InterPositive is trying to teach a model to watch professionally shot footage and say, in structured technical terms, what happened cinematographically.

What is the model actually learning?

At the claims level, the answer is already specific. The metadata attached to each frame includes focal length, camera movement, object distance, and framing style. The frame-level labels likewise include focal length used during the shot, movement details, object distance from the camera, and framing style; the shot-level aggregation then calculates average focal length, determines the predominant framing style, and smooths movement data across the shot.

But the specification makes clear that the ambition is broader than those few claim words might suggest. It describes training the captioner on temporal sequences of frames rather than isolated stills so that it can learn dynamics such as parallax and motion blur over time. It treats focal length and camera-movement vectors as primary inputs rather than afterthoughts. And it emphasises that conventional captioning tends to focus on object recognition while the present system is aimed at technical cinematic prediction.

That distinction is the heart of the patent.

A conventional image captioner looks at a frame and says: a man sitting in a chair under studio lighting.
This system is trying to say: medium close framing, longer focal length, limited depth of field, static or minimally drifting camera, subject isolated against a darker background plane.

That is a profoundly different task. It is also why the summary grid in this chapter is accurate when it reduces the patent to one line: the captioner "reads footage and extracts cinematographic descriptions."

Methodological discipline: one variable at a time

One of the quiet but important ideas here is methodological discipline.

The patent's dataset is not just "lots of film clips." It is a set of raw clips where one cinematic parameter varies at a time. That echoes the broader InterPositive approach described across this chapter: controlled soundstage captures, full metadata, instrumented shots, and systematic parameter isolation. The point is to stop the model from learning vague associative correlations and instead teach it causal visual relationships — what changing focal length actually does to the image, what a certain movement trajectory looks like, how framing categories differ when everything else is held steady.

That is the same larger philosophy that runs through the LiDAR article: rather than train on internet-scale data of unknown provenance, InterPositive filmed a smaller, purpose-built dataset with dense technical annotation. Applied to the captioner, the implication is clear: the model is not being asked to reverse-engineer cinematography from chaos. It is being taught in a controlled pedagogical environment.

Which is, in a sense, how a cinematographer learns too.

Implementation: a production pipeline, not a research prototype

The specification is unusually detailed about the machine around the model. It describes modules for frame extraction, data normalisation, tensor conversion, dataset partitioning, preprocessing, model training, auto-labelling, post-processing, metadata generation, labelling and annotation, retraining, and deployment. The captioner AI computing environment is connected to a broader Filmmaker computing environment, and the model's outputs are meant to be stored, queried, refined, and exported for downstream use.

That matters for two reasons. First, it reinforces that this patent is not just about an abstract neural architecture — it is about workflow. InterPositive is patenting a pipeline in which cinematic metadata is created, normalised, validated, and then re-used elsewhere. Second, it explains why the patent likely mattered so much to Netflix. A model that can reliably label an entire video database in cinematographic terms is not merely a research artefact. It is a piece of infrastructure. It turns footage into indexed technical knowledge.

That is potentially useful for model training, retrieval, continuity matching, style analysis, shot extension, editorial support, and post-production automation.

The middle patent that makes the loop work

On first glance, the captioner may sound less glamorous than the LiDAR foundation or the style system. It doesn't directly generate images. It doesn't provide the natural-language interface. It doesn't apply the look.

But architecturally, it may be the most important middle patent in the stack.

The portfolio overview places it in Layer 4: Core Models, under the name Captioner (SamildAnach), alongside the provisional Filmmaker model and the system's feedback loops. It is the point where raw, measured, purpose-shot footage becomes reusable intelligence. The language patent explicitly says its filmmaking metadata comes from the upstream captioner model. The style analysis says the captioner is the reading system, while the style patent is the writing system; together they form the closed loop that makes the system practically useful in production.

That means the captioner is doing something strategically crucial: it turns footage into a vocabulary the rest of the system can act on.

Without it, the language layer has less to translate.
Without it, the style layer has less to preserve.
Without it, the system becomes much closer to a conventional video model with some specialised training data.
With it, the system starts to look like a self-improving filmmaking loop.

Training regime: controlled data and production dailies

The other articles in this chapter give a reasonable picture. On the controlled soundstage, InterPositive filmed scenes with multiple cameras and angles, documenting lens types, focal lengths, stop values, camera positions, movement speeds, lighting setups, shot types, and framing categories. They also trained on dailies — real production footage from professional shoots — giving the system both precisely isolated examples and the messier complexity of actual film practice.

The captioner sits exactly where those two data sources become useful. The controlled captures teach it clean distinctions: what changes when only focal length changes, or only framing changes. The dailies teach it how those cinematic variables manifest in the wild, under the real conditions of production. That combination is precisely the sort of training regime you would choose if you wanted a model to become fluent in the actual grammar of cinematography rather than merely in statistical image patterns.

And once trained, the captioner can be deployed over "a large video database," generating new metadata that can feed retraining, indexing, and downstream generation tasks. That scaling step is in the independent claim itself. So the captioner is not just reading the initial training set. It is meant to become the machine that reads everything else.

Patent scope and design-around risk

The claims are meaningful, but they are not infinitely broad. They are tied to a specific kind of training pipeline: controlled raw clips, metadata association, frame- and shot-level labels, aggregation logic, validation refinement, deployment over a larger database, and post-processing with confidence scoring and manual quality control. A competitor could build alternative systems that rely more heavily on self-supervised learning, less on single-parameter controlled captures, or a different labelling structure entirely.

So this is probably not the patent that gives anyone ownership over "understanding cinematography with AI" in the abstract. What it does appear to protect is the most direct and production-practical version of that idea: a dedicated captioner trained on purpose-built filmmaking clips and metadata, with LiDAR-assisted spatial supervision, that labels footage in technical cinematic terms and exports those labels into the rest of the workflow.

That is narrower than a fantasy monopoly, but still strategically valuable — especially because, as the broader portfolio argument runs, the strength of this IP is less any single primitive than the integration of all of them into a coordinated filmmaking system.

What the captioner reveals about InterPositive’s wager

The answer is the same one the synthesis at the end of this chapter reaches from another angle. InterPositive is not best understood as a prompt-to-video company. It is a production-specific system built around measured space, controlled data collection, technical metadata, language translation, and visual consistency. In that architecture, the captioner is the part that makes filmed material computationally legible. It teaches the machine to read the work in the same terms practitioners use to make it.

That is also why the captioner patent fits so neatly with the portfolio's overall philosophy. The synthesis says the patents are oriented around grounding AI in the actual practice of professional filmmaking rather than replacing it. The captioner is perhaps the clearest example of that. It is not a fantasy engine for inventing cinema from nothing. It is a system for watching real cinematography and learning to describe it faithfully enough that the rest of the pipeline can preserve, extend, or reproduce it.

In that sense, US 12,511,904 may be the least flashy patent in the stack and one of the most revealing. It tells you that InterPositive's wager was never just that AI should generate moving images. It was that AI should first learn to read film language — technically, precisely, and in the terms cinematographers actually mean.

And once you see that, the rest of the portfolio makes much more sense.

What Netflix actually bought:
the InterPositive patent portfolio

The spatial intelligence behind Netflix's AI filmmaking acquisition — what the LiDAR patent actually does

LiDAR in a filmmaking context — why depth data is the first problem

What US 12,322,036 actually claims

How 3D geometry changes what the AI can do

Why spatial capture is the architectural foundation

Ground-truth measurement versus inferred depth

Patent scope and design-around risk

A fundamentally different training philosophy

Teaching an AI to speak the language of film — what the InterPositive language integration patent actually does

The semantic gap between filmmaker and machine

What the patent actually claims

The hub patent — why the language layer is architecturally central

Training data: controlled captures and licensed dailies

Patent scope and design-around risk

A tool that speaks film, not machine code

Teaching an AI to see like a cinematographer — what the InterPositive style patent actually does

The visual consistency problem

What the patent actually claims

Why paired comparisons outperform imitation-based training

Style as the output valve of the stack

What distinguishes this from prior style-transfer systems

Patent scope and design-around risk

Augmentation by design, not by accident

Teaching an AI to read cinematography — what the InterPositive captioner patent actually does

Why a filmmaking AI needs a dedicated reading system

What the patent actually claims

What is the model actually learning?

Methodological discipline: one variable at a time

Implementation: a production pipeline, not a research prototype

The middle patent that makes the loop work

Training regime: controlled data and production dailies

Patent scope and design-around risk

What the captioner reveals about InterPositive’s wager

What the stack tells us — the bigger picture

What Netflix actually bought:the InterPositive patent portfolio

The spatial intelligence behind Netflix's AI filmmaking acquisition — what the LiDAR patent actually does

LiDAR in a filmmaking context — why depth data is the first problem

What US 12,322,036 actually claims

How 3D geometry changes what the AI can do

Why spatial capture is the architectural foundation

Ground-truth measurement versus inferred depth

Patent scope and design-around risk

A fundamentally different training philosophy

Teaching an AI to speak the language of film — what the InterPositive language integration patent actually does

The semantic gap between filmmaker and machine

What the patent actually claims

The hub patent — why the language layer is architecturally central

Training data: controlled captures and licensed dailies

Patent scope and design-around risk

A tool that speaks film, not machine code

Teaching an AI to see like a cinematographer — what the InterPositive style patent actually does

The visual consistency problem

What the patent actually claims

Why paired comparisons outperform imitation-based training

Style as the output valve of the stack

What distinguishes this from prior style-transfer systems

Patent scope and design-around risk

Augmentation by design, not by accident

Teaching an AI to read cinematography — what the InterPositive captioner patent actually does

Why a filmmaking AI needs a dedicated reading system

What the patent actually claims

What is the model actually learning?

Methodological discipline: one variable at a time

Implementation: a production pipeline, not a research prototype

The middle patent that makes the loop work

Training regime: controlled data and production dailies

Patent scope and design-around risk

What the captioner reveals about InterPositive’s wager

What the stack tells us — the bigger picture

What Netflix actually bought:
the InterPositive patent portfolio