Patent Portfolio · Complete Analysis
What Netflix actually bought:
the InterPositive patent portfolio
A complete analysis of the four granted US patents behind Ben Affleck's AI filmmaking company — how each one works, what it does, and what together they tell us about the future of cinematic AI.
p3d65.xyz
·
March 2026
In this chapter
-
The spatial foundation — teaching AI where things actually are
US 12,322,036 · LiDAR Data Utilization for AI Model Training in Filmmaking
I
-
The language layer — teaching AI to speak the language of film
US 12,438,995 · Integration of Video Language Models with AI for Filmmaking
II
-
The style layer — teaching AI to see like a cinematographer
US 12,511,837 · AI-Based Video Content Creation with Predetermined Styles
III
-
The captioner — teaching AI to read footage like a cinematographer
US 12,511,904 · Method, system, and computer-readable medium for training a captioner model to generate captions for video content by analyzing and predicting cinematic elements
IV
-
What the stack tells us — the bigger picture
Synthesis & conclusions
↓
US Patents — InterPositive LLC · All assignee: InterPositive LLC · Inventor: Benjamin Geza Affleck-Boldt
| Patent number |
Title |
Date granted |
Links |
| US 12,322,036 B1 |
LiDAR Data Utilization for AI Model Training in Filmmaking |
2025 |
Google Patents ↗
USPTO ↗
|
| US 12,438,995 B1 |
Integration of Video Language Models with AI for Filmmaking |
7 Oct 2025 |
Google Patents ↗
USPTO ↗
|
| US 12,511,837 B1 |
Artificial Intelligence-Based Video Content Creation with Predetermined Styles |
30 Dec 2025 |
Google Patents ↗
USPTO ↗
|
| US 12,511,904 B1 |
Method, System, and Computer-Readable Medium for Training a Captioner Model to Generate Captions for Video Content by Analyzing and Predicting Cinematic Elements |
30 Dec 2025 |
Google Patents ↗
USPTO ↗
|
When Netflix announced on 5 March 2026 that it had acquired InterPositive — Ben Affleck's AI filmmaking company, which had been operating in complete stealth under the corporate entity Fin Bone LLC — the industry's immediate question was: what exactly did they buy?
The acquisition announcement was light on technical detail. Netflix confirmed that the entire 16-person team was joining, that Affleck would remain as Senior Adviser, and that the company's tools were designed to support professional filmmaking rather than replace it. Beyond that, what InterPositive had actually built remained largely opaque.
The patents tell a more complete story.
InterPositive holds four granted US patents, all enforceable immediately and valid through 2045, alongside twelve international WIPO publications seeking protection across Europe, Japan, and other major markets. Together they cover the complete architecture of a professional filmmaking AI — from the physical measurement of space through to the controlled reproduction of visual style.
I have spent the past week reading all of them. What follows is a complete analysis of the four US patents, taken in the order they appear in the technology stack: the spatial foundation first, then the language interface, then the style and output layer. A synthesis at the end draws together what the portfolio as a whole reveals about where AI in filmmaking is actually heading.
This is an independent analysis of InterPositive’s patent portfolio.
A note on claim structure: each of the four patents files the same substantive steps three times — once as a method claim, once as a system claim, and once as a non-transitory computer-readable medium claim. This is standard USPTO practice: it ensures the protection applies whether someone implements the invention as a process, as a hardware system, or as software. The three versions are substantively identical. In the claims reproduced below, only the method claim (Claim 1) is shown for each patent. The system and computer-readable medium claims cover the same steps in their respective statutory categories.
I
The spatial foundation
Teaching AI where things actually are
US 12,322,036 · LiDAR Data Utilization for AI Model Training in Filmmaking
The spatial intelligence behind Netflix's AI filmmaking acquisition — what the LiDAR patent actually does
I read the foundational patent that sits beneath all of InterPositive's technology to understand why teaching an AI to see in three dimensions is the first problem any serious filmmaking AI has to solve.
Ben Affleck, founder of InterPositive. Credit: Netflix / InterPositive
Patent on Record
US 12,322,036 B1
| Full title | LiDAR Data Utilization for AI Model Training in Filmmaking |
| Inventor | Benjamin Geza Affleck-Boldt |
| Assignee | InterPositive, LLC |
| Granted | 2025 |
| Claims | 20 (method, system, non-transitory computer-readable medium) |
| WIPO counterpart | WO2025255446A1 / WO2025255425A1 |
| Google Patents | US12322036B1 ↗ |
Claim 1 — Method
US 12,322,036 B1 · 20 total claims · system & CRM claims cover the same steps
A computer-implemented method for enhancing artificial intelligence (AI) model training in filmmaking through a use of Lidar data, the method comprising:
- Correlating two-dimensional video data with three-dimensional spatial data obtained from Lidar to simulate professional camera techniques
- Receiving detailed metadata related to professional filmmaking techniques, including camera settings, shot composition, and lighting setups
- Processing the received metadata alongside the Lidar data to provide one or more AI models with a granular understanding of spatial relationships and the physics of camera movement
- Training the AI models using the processed metadata and Lidar data to accurately simulate professional filmmaking techniques, thereby enhancing realism and quality of generated video content
Source: US12322036B1 on Google Patents. The independent claims are notably compact — four steps — with the conceptual core being the correlation of 2D video with 3D Lidar spatial data as a training input. The international counterpart WO2025255446A1 seeks equivalent protection outside the US.
LiDAR in a filmmaking context — why depth data is the first problem
LiDAR stands for Light Detection and Ranging. In practice, the technology fires pulses of laser light at a scene and measures how long each pulse takes to return. Because the speed of light is known, the system can calculate the precise distance from the sensor to whatever the laser hit. Do this rapidly across a wide area and you build up a dense cloud of distance measurements — a point cloud — that maps the three-dimensional geometry of the scene with exceptional accuracy.
You're probably most familiar with LiDAR from autonomous vehicles. Waymo, Cruise, and other self-driving car companies mount LiDAR sensors on their vehicles precisely because cameras alone cannot reliably tell you how far away something is. A camera captures a flat, two-dimensional representation of the world. LiDAR gives you actual depth data — which object is 1.3 metres away and which is 8.7 metres away, rather than both simply appearing somewhere in the frame.
So why does LiDAR appear in a filmmaking AI patent?
For almost exactly the same reason. A camera — even a cinema camera — produces a flat image. An AI system trained only on flat images has to infer depth from visual cues: perspective lines, relative object sizes, parallax between frames, focus fall-off. These cues are real, but they are imprecise and ambiguous. You can often work out roughly where things are in a scene from a 2D image. You cannot work out exactly.
For a filmmaking AI trying to simulate camera movement, relight a scene, or insert a visual effects element convincingly, "roughly" is not good enough. The AI needs to know where every significant surface actually is in three-dimensional space — how far from the camera, at what angle, in what spatial relationship to the other objects around it.
That is what this patent addresses.
What US 12,322,036 actually claims
The core independent claim of US 12,322,036 describes a method in which a system captures LiDAR data from a sensor during filming. The captured data includes spatial coordinates, distance measurements, and the relative positional information of objects within the scene.
The system then integrates that LiDAR data with filmmaking metadata — the information about how the shot was captured, including camera settings, lens characteristics, and composition details. The integration combines the geometric information from LiDAR with the cinematographic information from the metadata, producing a combined dataset that gives the AI a three-dimensional spatial understanding of the scene: how deep it is, where objects sit relative to each other, what the actual physical relationships are between the foreground, the subject, and the background.
That combined, spatially-grounded dataset is then used to train downstream AI models. The patent covers the whole chain: capture, integration, and use in training.
In plain language: InterPositive's team mounted LiDAR sensors on their controlled soundstage alongside their cameras. When they filmed their proprietary training dataset, they weren't just recording images — they were simultaneously recording the precise three-dimensional geometry of every scene. That geometric data was then fused with the cinematographic metadata and fed together into the AI.
"Most AI vision systems are trained to recognise what is in the frame. This patent is specifically about understanding where things are — the actual physical geometry of the scene."
How 3D geometry changes what the AI can do
Consider what an AI model needs to understand in order to convincingly simulate a dolly shot — where the camera moves physically forward through a scene.
As the camera moves, every object in the scene shifts in the frame. But critically, they don't shift equally. Objects close to the camera appear to move much faster across the frame than objects far in the background. A chair that's two metres away sweeps dramatically across the image as the camera passes. A painting on the far wall barely appears to move at all. This differential — called parallax — is what gives moving shots their sense of physical depth and immersion.
To simulate parallax correctly, an AI needs to know the actual depth of every element in the scene. Not an approximation. Not a guess from visual cues. The real number. Otherwise the simulated shot will look wrong — elements will shift at incorrect rates, the sense of physical depth will collapse, and the result will betray itself as artificial.
LiDAR gives the system exactly those real numbers.
The same principle applies to relighting. Light falls differently on surfaces at different depths and angles. To convincingly change the lighting on a scene — to shift the key light from camera left to camera right, or to add a motivated practical light source — the AI needs to understand the physical geometry of the space. Which surfaces face which direction? What angle do they present to the light source? How would shadows fall across the three-dimensional arrangement of objects? A flat image cannot answer these questions reliably. A point cloud can.
- Camera movement simulation. Parallax effects depend on knowing real depths. LiDAR provides them, so dolly moves, crane shots, and handheld tracking feel physically plausible.
- Relighting. Surface normals and object geometry determine how light falls. The AI can't relight convincingly without knowing what shape the scene actually is in 3D space.
- Wire removal and object replacement. Knowing the precise location of a stunt rig in three-dimensional space makes clean removal far more tractable than guessing from 2D.
- VFX integration. Inserting a generated element into real footage requires matching its position, size, and shading to the geometry of the scene — all of which LiDAR defines precisely.
- Focus pulling. The system can calculate depth of field accurately because it knows the real distances of objects from the lens, not just their apparent size in the frame.
The InterPositive pipeline processing a stunt shot for wire removal. The system’s 3D scene understanding — grounded in LiDAR data — is what makes clean removal tractable at a single-frame level. Credit: Netflix / InterPositive
Why spatial capture is the architectural foundation
Among the four US patents in InterPositive's portfolio, this one was granted first — in June 2025, several months before the others. That chronology is deliberate and revealing.
The LiDAR patent is foundational in an architectural sense. Every downstream model in the InterPositive system depends on the spatially-grounded training data this patent produces. The captioner model (US 12,511,904) is trained on footage with known depth. The language-model integration layer (US 12,438,995) operates with the understanding that the AI has been grounded in three-dimensional space. The style control system (US 12,511,837) applies aesthetic treatments to scenes whose geometry is understood.
Without this layer, the whole system would be trying to understand cinematography from flat images alone — which is how most AI video models work, and which is precisely what InterPositive was founded to improve upon. Affleck has said he found existing AI tools came up short because they lacked genuine understanding of the physical mechanics of cinematography. LiDAR is a significant part of what provides that understanding.
Ground-truth measurement versus inferred depth
LiDAR has been used in film production before — mostly in pre-production and virtual production contexts. Surveyors use it to create accurate digital models of shooting locations. Some virtual production pipelines use LiDAR scans of sets to calibrate camera tracking. The technology also appears in consumer devices: iPhones and iPads since 2020 have included LiDAR sensors, which filmmakers have experimented with for location scouting and basic augmented reality applications.
What is genuinely unusual here is using LiDAR data as training input for an AI model specifically designed for post-production work.
Most video AI training pipelines rely entirely on camera footage — flat images from which depth must be inferred. Some research efforts have used depth estimation algorithms to generate approximate depth maps from 2D footage. But these are estimates, not measurements. They are inherently less accurate than real sensor data, and the errors compound when you try to use them for precise spatial operations.
InterPositive's approach of capturing ground-truth LiDAR data during principal photography on their training soundstage — and fusing it directly with the cinematographic metadata — represents a notably different philosophy. Rather than asking AI to guess at depth, they gave it depth measurements accurate to centimetres. The patent language also explicitly extends the definition to include related sensing technologies: radar, ladar, and photogrammetry — which suggests the drafters were alive to obvious substitutions, though whether those alternatives would in practice fall within or outside the claims would depend on how a court construes the specific language.
Patent scope and design-around risk
The patent's claims are reasonably focused. They describe a specific pipeline: capture LiDAR data during filming, integrate it with filmmaking metadata, and use the combined dataset to train AI models. Competing systems could potentially use different approaches — monocular depth estimation rather than active LiDAR sensing, or photogrammetry from multiple camera angles — that fall outside the specific claims here.
On the current claim language, the most direct implementation path — deploying LiDAR sensors on a controlled filming set, capturing point cloud data alongside video, and using that combined dataset to train AI models — is what the granted claims appear to cover. The precise scope, and how far it extends to variants using related sensing technologies, would ultimately depend on claim construction and any applicable prosecution history. — is what the granted claims appear to cover. Competing approaches using different sensor modalities or depth-estimation methods would need to be assessed against the specific claim language independently.
A fundamentally different training philosophy
Most AI video companies train on internet-scale video — enormous quantities of footage scraped from YouTube, film archives, and other public sources. The advantage is volume. The disadvantage is that this footage comes with no ground-truth metadata about how it was actually shot. The AI has to infer lens characteristics, camera movement, depth relationships, and lighting conditions from the images themselves.
InterPositive took the opposite approach. Rather than training on billions of frames of unknown provenance, they filmed a smaller, purpose-built dataset in a controlled environment, instrumenting every shot with precise sensor data — LiDAR depth measurements, full camera metadata, lighting documentation. The resulting training data is dramatically smaller in volume but far richer in information per frame.
This is the difference between training a model on millions of unlabelled photographs and training it on thousands of photographs, each with precise technical specifications attached. For a system designed to assist professional cinematographers who work with precise technical specifications every day, the second approach makes considerably more sense.
That is what this patent protects: not just a technical method, but an entire philosophy of how to build a filmmaking AI with genuine craft knowledge rather than statistical pattern recognition alone.
II
The language layer
Teaching AI to speak the language of film
US 12,438,995 · Integration of Video Language Models with AI for Filmmaking
Teaching an AI to speak the language of film — what the InterPositive language integration patent actually does
Getting AI to generate video that technically works is one problem. Getting it to understand what a filmmaker actually means when they say "give me a 50mm with a slow push and shallow focus on her eyes" is an entirely different one. I read the patent that tries to solve that translation problem.
Ben Affleck. Credit: Netflix / InterPositive
Patent on Record
US 12,438,995 B1
| Full title | Integration of Video Language Models with AI for Filmmaking |
| Inventor | Benjamin Geza Affleck-Boldt |
| Assignee | InterPositive, LLC |
| Granted | 7 October 2025 |
| Filed | 25 November 2024 |
| WIPO counterpart | WO2025255437A1 |
| Claims | 19 (method, system, computer-readable medium) |
| Patent citations | 40 |
Claim 1 — Method
US 12,438,995 B1 · 19 total claims · system & CRM claims cover the same steps
A computer-implemented method for integrating one or more existing video large language models (LLMs) with custom AI algorithms for filmmaking, the method comprising:
- Interfacing with an existing video LLM
- Receiving detailed metadata related to professional filmmaking techniques, including camera settings, shot composition, and lighting setups
- Processing the received metadata to adapt the existing video LLM to generate video content that simulates professional filmmaking techniques
- Receiving Lidar data captured from a lidar sensor, the Lidar data including spatial coordinates, distance measurements and relative positional information of objects within a scene
- Integrating the Lidar data with the processed metadata by combining spatial coordinates and distance measurements with filmmaking metadata to enhance the generated video content by providing a three-dimensional spatial understanding of the scene indicating depths and positional relationships among objects
- Applying transfer learning techniques to the existing video LLM based on the processed metadata and Lidar data to refine its video content generation capabilities
Claim breadth note: The independent claims require receipt of Lidar data including spatial coordinates, distance measurements, and positional relationships. A system performing spatially-aware video generation without a Lidar capture step — using monocular depth estimation or other non-Lidar spatial methods — could argue the Lidar receipt steps are not met, which narrows the claim's reach against such alternatives.
The semantic gap between filmmaker and machine
Before I explain what this patent does, it's worth dwelling on the problem it's trying to solve — because it's easy to underestimate how fundamental it is.
Filmmaking is a craft with its own technical vocabulary. When a director of photography says "give me a 35mm on a 2:1, motivated from the window, with a falloff across the mid-ground," every experienced person on set knows exactly what that means. It describes a specific lens, a specific lighting ratio, a specific source direction, and a specific way the light should fade across the scene. The instruction is precise, efficient, and completely intelligible to anyone trained in the craft.
Now ask a state-of-the-art AI video model the same thing.
The patent is unusually direct about this problem. It describes a figure showing a graphical user interface in which cinematographic text instructions have been entered into a prior art AI system, the output that system produced, and then a third figure showing what the output should have looked like. The gap between the second and third figures is the problem this patent exists to close.
Existing AI video models are trained primarily on semantic metadata — what appears in a frame, not how it was captured. They can understand "a woman walking through a park" in a way that produces a recognisable image. They cannot understand "a slow dolly in on a 75mm at T2.8 with the background falling slightly out of focus" in a way that makes any practical difference to the footage they generate. The cinematographic instruction simply doesn't connect to anything in their training.
This is the semantic gap that US 12,438,995 is designed to bridge.
What the patent actually claims
The independent claim describes a method for integrating one or more existing video large language models with custom AI algorithms for filmmaking. Let me walk through the steps.
First, the system interfaces with an existing video LLM. This is significant — the patent is not about building a video generation model from scratch. It is about taking an existing model and extending it. Rather than competing with OpenAI or Google on raw video generation capability, InterPositive focused on what none of those models had, which is cinematographic understanding.
Second, the system receives detailed metadata related to professional filmmaking techniques, including camera settings, shot composition, and lighting setups. This metadata comes from the upstream captioner model (US 12,511,904) — which watches professionally photographed footage and generates structured descriptions of the cinematic choices it observes.
Third, the system processes that metadata to adapt the existing video LLM — specifically to enable it to generate video content that simulates professional filmmaking techniques. This adaptation is the core innovation.
Fourth, the system receives LiDAR data captured from a sensor, including spatial coordinates, distance measurements, and positional information of objects within the scene. The LiDAR data grounds the whole system in physical reality, giving the AI a three-dimensional understanding that flat camera footage alone cannot provide.
Fifth, the LiDAR data is integrated with the processed metadata, combining the geometric and cinematographic information. Sixth, transfer learning techniques are applied to the existing video LLM based on this combined data, refining its capabilities to reflect professional filmmaking understanding.
In plain language: the system takes a capable AI video model and teaches it to understand filmmaking. It does this by feeding the model a combination of cinematographic metadata and spatial data, then fine-tuning it so that filmmaking instructions — given in the technical language of the craft — produce corresponding outputs.
What prior AI video models understand
"A woman sitting at a desk in an office."
"A rainy street at night."
"Two people having a conversation."
What InterPositive's system understands
"Medium close-up, 75mm, directional key at 3:1, shallow focus, slow push-in."
"Wide establishing, locked off, 24mm, flat lighting, heavy rain practical in foreground."
"Over-the-shoulder coverage, 50mm, matching eyeline, motivated daylight from window left."
"The innovation is not building a video generator. It is building the layer that makes a video generator usable by an actual filmmaker — without that person having to learn to think like a machine."
The hub patent — why the language layer is architecturally central
Having now covered all four granted US patents in the InterPositive portfolio, I'd argue this one — the language integration layer — is the one that matters most from Netflix's perspective.
Consider the system as a whole. The LiDAR patent provides spatial understanding of the scene. The captioner patent enables the AI to read and describe existing footage in cinematographic terms. The style patent allows the AI to reproduce a predetermined visual identity. All of these are technically impressive and, in their own right, valuable.
But none of them are usable without the language layer.
A director or cinematographer working with InterPositive's tools doesn't think in point clouds, loss functions, or pixel tensors. They think in the vocabulary of their craft — the same vocabulary they use when talking to a camera department on set. The language integration patent is what allows them to use that vocabulary to drive the AI system, rather than having to develop an entirely new technical fluency.
| Patent |
Layer |
Role in the system |
| US 12,322,036 |
Spatial foundation |
LiDAR gives the AI real 3D geometry of each scene |
| US 12,511,904 |
Metadata reading |
Captioner reads footage and extracts cinematographic descriptions |
| US 12,438,995 |
Language interface |
Translates filmmaker instructions into parameters the AI executes |
| US 12,511,837 |
Style output |
Applies a predetermined aesthetic during video generation |
The arrow of dependency in this table runs in both directions through the language layer. Information flowing upward — spatial data from LiDAR, style metadata from the captioner — must pass through the language interface to become actionable creative parameters. Instructions flowing downward — from a filmmaker expressing creative intent — must pass through the same layer to become something the generation model can execute.
The language layer is the hub. Everything else is either upstream input or downstream output.
Training data: controlled captures and licensed dailies
On their controlled soundstage, the team filmed scenes with multiple cameras and angles simultaneously, with every variable documented. Lens types. Focal lengths. F-stops and T-stops. Camera positions and movement speeds. Lighting setups — source positions, ratios, colour temperatures. Shot types and framing across the full range: tight face shots, medium shots, wide shots, over-the-shoulders. All of this was captured as metadata alongside the footage, manually tagged by people who understood what it meant cinematographically.
The patent also describes training on dailies — the raw, unedited footage from professional film shoots. Dailies represent the richest possible source of real-world filmmaking data: actual cinematographers making actual decisions in actual production conditions. Licensing dailies from film studios also has a copyright advantage the patent explicitly notes — the footage comes from a controlled chain of rights, avoiding the provenance questions that have dogged other AI training datasets.
The combination of controlled soundstage data and licensed dailies gives the language model a vocabulary built from both precise, isolated technical examples and the full complexity of real professional production.
Patent scope and design-around risk
The core independent claim is fairly specific: it describes integrating an existing video LLM with filmmaking metadata and LiDAR data via a particular process — receiving, processing, integrating, and applying transfer learning. Competitors could build systems that achieve similar outcomes through different architectures — training a video generation model from scratch rather than fine-tuning an existing one, or using different sensor modalities instead of LiDAR. None of these would necessarily infringe the specific claims here.
On the current claim language, the most straightforward implementation — taking an existing video LLM, piping filmmaking metadata and spatial sensor data into it, and fine-tuning it on that combined dataset using the process the claims describe — is what the granted claims appear to cover most directly. Whether a materially different architecture achieving similar results would fall within or outside those claims would require independent assessment against the specific claim language.
A tool that speaks film, not machine code
US 12,438,995 is built around a clear premise: the filmmaker is the author of the creative decisions, and the AI's job is to execute those decisions faithfully. The language integration layer exists specifically so that the filmmaker's vocabulary — not the AI's internal parameter space — is the primary interface.
That framing is commercially and politically astute in an industry navigating significant anxiety about AI's role. But it also reflects something genuine about what the technology actually does. This isn't a system designed to replace a camera department. It's a system designed to give a camera department a new kind of tool — one that understands how they think.
III
The style layer
Teaching AI to see like a cinematographer
US 12,511,837 · AI-Based Video Content Creation with Predetermined Styles
Teaching an AI to see like a cinematographer — what the InterPositive style patent actually does
Getting AI to generate video that technically works is one problem. Getting it to generate video that looks like it was shot by a specific person, with a specific philosophy, on a specific film stock — that is an entirely different one. I read the patent that tries to solve it.
Patent on Record
US 12,511,837 B1
| Full title | Artificial Intelligence-Based Video Content Creation with Predetermined Styles |
| Inventor | Benjamin Geza Affleck-Boldt |
| Assignee | InterPositive, LLC |
| Granted | 30 December 2025 |
| Filed | 25 November 2024 |
| WIPO counterpart | WO2025255426A1 |
| Claims | 20 (method, system, computer-readable medium) |
Claim 1 — Method
US 12,511,837 B1 · 20 total claims · system & CRM claims cover the same steps
A computer-implemented method for constructing and training an artificial intelligence model configured to generate video content with a predetermined style, the method comprising:
- Capturing one or more control images corresponding to a scene using standard digital video as a baseline
- Capturing one or more test images of the scene using different film formats to document visual effects
- Applying post-production alterations to the captured footage
- Constructing a training dataset that includes a variety of shots captured under varied lighting conditions
- Training an AI model with paired comparisons to enable it to learn specific visual signatures
- Reviewing footage generated by the AI model to assess its authenticity and using feedback to refine the model
- Optimizing learning cycles to enhance an efficiency of the training
Dependent claims specify: film formats including 35mm, 16mm, 8mm, and Super 8mm; post-production alterations including push processing and bleach bypass; dataset construction including tight face shots, medium shots, and wide shots; control-versus-modified paired comparisons; and training optimisation by scaling data acquisition as the model demonstrates proficiency. Claim breadth note: Step 2 requires physical capture of test images using different film formats. A purely digital training approach — without physically shooting on multiple film stocks — could argue this step is not met.
The visual consistency problem
A feature film is not shot in sequence. Scenes from the end of the story are often filmed weeks before scenes from the beginning. A single sequence might be photographed across three or four different shooting days, with changing natural light, different crew call times, and varying conditions on set. By the time an editor cuts the film together, the footage has been shot across months, sometimes in multiple countries, under wildly different conditions.
Yet when you watch a well-made film, it looks unified. Every scene has the same visual character. The colour palette is coherent, the lighting ratios are consistent, the lenses have a recognisable quality, and the way the camera moves feels like it reflects a single sensibility. This is not an accident. It is the result of enormous effort from the cinematographer, the colourist, and the post-production team to establish a visual language at the beginning of a production and then maintain it, frame by frame, across everything that follows.
Now introduce an AI system into that workflow. The AI is used to relight a shot, extend a background, remove a stunt wire, or generate a pickup scene that couldn't be captured during principal photography. The generated content needs to match the established look of the film exactly — not approximately. Even a subtle discrepancy in colour temperature, contrast ratio, or the quality of specular highlights will be visible to a trained eye, and visible discrepancies undermine the credibility of the entire sequence.
This is the problem US 12,511,837 addresses. How do you encode a film's visual identity — its "predetermined style" — in a form that an AI can reliably reproduce?
What the patent actually claims
The core independent claim describes a method for training an AI model to generate video content with a predetermined style. The steps are worth going through in detail, because they reveal something important about how InterPositive approached the problem.
The first step is capturing control images using standard digital video as a baseline — the plain, unmodified image of a scene as captured by a digital sensor.
The second step is capturing test images of the same scene using different film formats. The patent specifically names 35mm, 16mm, 8mm, and Super 8mm. These aren't just examples — they represent the range of film stocks that carry distinct visual signatures: different grain structures, different colour responses, different dynamic range and contrast characteristics.
The third step involves applying post-production alterations to the captured footage. The patent specifically mentions push processing and bleach bypass. These are darkroom and photochemical processes that alter the tonal and colour characteristics of film in specific, recognisable ways. Push processing underexposes and overdevelops film to increase contrast and grain. Bleach bypass retains the silver layer in the final print, desaturating the image and increasing contrast simultaneously. Both produce distinctive looks associated with certain cinematographers and certain eras.
The fourth step constructs a training dataset from this material, across different shot types and varied lighting conditions. The fifth step trains an AI model using paired comparisons — control footage versus modified footage. The same scene, with and without the style treatment. And then the system iterates, reviewing outputs against authenticity criteria and refining accordingly.
"The system doesn't just learn what a 1970s film aesthetic looks like. It learns the precise difference between a digital image and one that has been run through that aesthetic — at every level from grain structure to colour response."
Why paired comparisons outperform imitation-based training
It would be tempting to assume InterPositive is simply training the AI on lots of stylised footage and asking it to imitate what it sees. That approach is how most style-transfer systems work, and it produces results that are often visually plausible but cinematically shallow.
The paired comparison method is different and more precise.
By filming the same scene in multiple formats and with multiple post-production treatments, InterPositive creates a dataset where the only variable between pairs of images is the specific style element being studied. The scene, the lighting, the subject, the camera position — all identical. The film stock, the processing, or the grading treatment — different.
This allows the AI to isolate exactly what a given style element contributes to the image. Not "this is what 16mm footage looks like" but "this is what 16mm footage looks like compared to the same shot in digital, and here are the specific differences in grain frequency, colour response, highlight rolloff, and shadow lift that account for that difference."
The choice to include physical film formats — 35mm, 16mm, 8mm, Super 8 — rather than only digital simulation is also telling. InterPositive isn't approximating film aesthetics from digital representations. They are training from the actual photochemical artefacts of real film, giving the AI access to the genuine optical and chemical characteristics of each format.
- Colour grading consistency. The AI learns the precise colour palette of an established production and can apply it to newly generated or modified footage, frame by frame.
- Lighting signature reproduction. Cinematographers develop distinctive lighting approaches. Encoded as style parameters, these can be applied to AI-generated shots so they match traditionally photographed footage.
- Film stock simulation. The grain structure, colour response, dynamic range, and contrast characteristics of different film stocks are learned from real examples, not digital approximations.
- Era-specific aesthetics. A period film set in the 1970s can adopt the visual language of 1970s cinematography — the optical and chemical characteristics of that era's actual equipment.
- VFX integration. Generated visual effects elements can be conditioned on the established style so they match surrounding live-action footage without extensive post-production correction.
- Post-production alteration techniques. Push processing, bleach bypass, cross-processing — the specific photochemical looks associated with these darkroom techniques become learnable parameters.
Style as the output valve of the stack
The captioner patent (US 12,511,904) is this patent's complement. The captioner is a reading system: it watches footage and extracts a description of its visual identity — which lens, which lighting ratio, which colour palette, which motion characteristics. This patent is a writing system: it takes a style description and enforces it during video generation.
Together, they form a closed loop. The captioner analyses a film's established footage and generates a style specification. That specification is passed to the generation system described in this patent. New footage — for pickups, VFX inserts, or AI-assisted reconstruction — is then generated with the style specification as a constraint, ensuring it matches the original material.
This loop is what makes the system practically useful in a professional production context. A post-production supervisor doesn't just want an AI that generates plausible footage. They want an AI that generates footage that will cut seamlessly with the material already in the edit.
What distinguishes this from prior style-transfer systems
Style transfer is not a new idea in AI. Neural Style Transfer dates to 2015. AI-assisted colour grading tools have been available to the post-production industry for several years. What differentiates this patent is three things that prior approaches tend to lack simultaneously.
First, the training data is purpose-built and photochemically grounded — InterPositive filmed the training material themselves, using real film stocks and real darkroom processes, giving the AI access to actual photochemical characteristics rather than digital simulations.
Second, the paired comparison methodology isolates style variables rather than learning from uncontrolled examples, producing more precise and reliable control than associative style learning.
Third, style is integrated into the generation process rather than applied as a post-processing filter. This is fundamentally more reliable than generating footage and then applying a style transformation afterward, which always risks inconsistencies at edge cases the post-processing wasn't trained to handle.
Patent scope and design-around risk
The independent claims are relatively specific to the training methodology described — capturing control and test images, applying physical post-production alterations, training on paired comparisons, iterating based on authenticity feedback. A competitor who trained a style model on large quantities of stylised footage, without the paired comparison approach, would be doing something meaningfully different from what the patent claims.
The core idea — that you can train an AI to understand and reproduce visual style — is not novel and could not be protected by any single patent. The patent protects a particular implementation, not the idea itself. The strategic value — as a business matter rather than a strict legal one — lies in the specificity of the approach: grounded in real photochemistry, filmed under controlled conditions, and trained on actual photochemical processes rather than digital approximations. Whether and how quickly a competitor could replicate those conditions is a practical question separate from what the patent itself covers.
Augmentation by design, not by accident
US 12,511,837 is not about generating novel visual styles. It is specifically about reproducing predetermined ones — styles established by human cinematographers, through deliberate creative choices, on physically photographed material. The AI's job is not to invent; it is to maintain.
That framing puts the AI in a supporting role rather than a generative one. It is closer to having a very capable colourist who can match any grade perfectly every time, or a VFX team that can generate inserts indistinguishable from the surrounding photography — than it is to having a system that replaces the cinematographer.
The patents InterPositive has filed — and that Netflix now owns — consistently describe a system built around understanding and preserving the visual decisions of human filmmakers, rather than replacing them. That's a considerably more defensible position, commercially and politically, than the alternative.
Before and after: an actor shot against blue screen (left) transformed by the InterPositive system into a sunset-lit outdoor scene (right). Style conditioning — colour temperature, directional light, atmospheric haze — is applied during generation, not as a post-process filter. Credit: Netflix / InterPositive
IV
The captioner
Teaching AI to read footage like a cinematographer
US 12,511,904 · Method, system, and computer-readable medium for training a captioner model to generate captions for video content by analyzing and predicting cinematic elements
Teaching an AI to read cinematography — what the InterPositive captioner patent actually does
Getting AI to generate footage that looks cinematic is one problem. Getting AI to understand why footage looks cinematic — which lens was used, how the camera moved, how the frame was composed, where the subject sat in space — is an entirely different one. That is the problem US 12,511,904 is trying to solve.
The three articles in the chapter already imply its importance. The introduction said the captioner patent was left to "a companion article available separately," and the coda's summary grid gives it a very specific role in the stack: metadata reading — a model trained to watch footage and extract cinematographic descriptions, the reading half of InterPositive's closed loop.
That framing turns out to be exactly right.
Patent on Record
US 12,511,904 B1
| Full title | Method, system, and computer-readable medium for training a captioner model to generate captions for video content by analyzing and predicting cinematic elements |
| Inventor | Benjamin Geza Affleck-Boldt |
| Assignee | InterPositive, LLC |
| Filed | 12 March 2025 |
| Granted | 30 December 2025 |
| Claims | 30 (method, system, non-transitory computer-readable medium) |
| International counterpart | None on the public record so far; the portfolio overview describes the captioner patent as US-only at this stage |
Claim 1 — Method
US 12,511,904 B1 · 30 total claims · system & CRM claims cover the same steps
A computer-implemented method for training a captioner model to generate captions for video content by analyzing and predicting cinematic elements, the method comprising:
- Organizing a dataset comprising raw video clips and corresponding metadata, wherein each video clip represents a specific shot varying one cinematic parameter at a time selected from the group consisting of focal length, camera movement, and framing
- Extracting frames from the raw video clips at a consistent frame rate and storing the frames in a structured format
- Associating each frame with corresponding metadata detailing the cinematic elements present in the frame, wherein the metadata includes information on focal length, camera movement, object distance, and framing style
- Segmenting the raw video clips into shots and frames, wherein a shot comprises a continuous sequence captured without cuts, and frames are extracted at regular intervals from each shot
- Applying frame-level labels to each frame based on the corresponding metadata, wherein the frame-level labels include focal length used during the shot, camera movement details, object distance from the camera, and framing style
- Aggregating the frame-level labels to generate shot-level labels, wherein the aggregating includes calculating average focal length, determining predominant framing style, and smoothing camera movement data across the shot
- Training the captioner model using the frame-level labels, the frames, and the aggregated shot-level labels to recognize and predict the cinematic elements in unseen video content
- Iteratively refining the captioner model based on feedback from validation datasets to improve accuracy of cinematic element prediction
- Deploying the trained captioner model to process and label a large video database, wherein the model generates metadata for new video content based on learned cinematic elements
- Post-processing the generated labels to ensure consistency and accuracy, including performing outlier detection, confidence scoring, and manual quality control
This is the most procedurally detailed method claim in the portfolio — 10 steps versus 4 in US 12,322,036 and 6 in US 12,438,995. Dependent claims extend to: high-resolution lossless storage; JSON/CSV metadata formats; shot detection algorithms; LiDAR and laser locator data for frame-accurate spatial positioning; multi-task and active learning; and structured export into downstream database and editorial workflows. Claim breadth note: Step 1 requires that each clip varies one cinematic parameter at a time from the specific group of focal length, camera movement, and framing — tying the claim to the controlled-capture methodology.
Why a filmmaking AI needs a dedicated reading system
Most video AI systems are trained primarily to understand what is in the frame: a person, a street, a conversation, a landscape. InterPositive's patents, by contrast, are built around the idea that professional filmmaking depends at least as much on how the frame was made as on what appears inside it. The captioner patent is where that distinction becomes operational. Its background section is explicit that current AI video systems struggle with focal length, dolly/pan/tilt movement, shot type, parallax, occlusion, depth of field, and temporal consistency — precisely the technical variables filmmakers actually use.
That matters because the rest of the stack cannot function intelligently unless something upstream can first look at footage and describe it in cinematographic terms.
The language patent needs structured metadata to translate filmmaker instructions into executable parameters. The style patent needs a description of a film's existing visual identity before it can enforce that identity on new material. And the broader synthesis in this chapter makes the architecture clear: spatial data informs the language model, the language model drives generation, and the captioner feeds the style system.
In other words: without a model that can read cinematography, the rest of the InterPositive system would be working blind.
What the patent actually claims
The core independent claim is a workflow claim. It does not claim "an AI that understands film" in the abstract. It claims a fairly specific pipeline for teaching a captioner model to infer cinematic elements from video. The method begins by organising a dataset of raw clips and corresponding metadata, where each clip varies one cinematic parameter at a time — specifically focal length, camera movement, or framing. It then extracts frames at a consistent frame rate, associates each frame with metadata, segments the clips into shots and frames, applies frame-level labels, aggregates those into shot-level labels, trains the captioner model, refines it against validation sets, deploys it over a larger video database, and post-processes the resulting labels through outlier detection, confidence scoring, and manual quality control.
That sequence is worth slowing down for, because it tells you what InterPositive thought the real bottleneck was.
"Not generation. Not prompting. Not style transfer. The bottleneck was building a system that can take real filmed material and convert it into machine-readable cinematographic supervision."
The dependent claims sharpen that further. They specify high-resolution lossless storage, metadata formats such as JSON and CSV, shot detection algorithms, LiDAR and laser locator data for frame-accurate positioning, multi-task learning, active learning, and export into structured formats suitable for downstream databases and video-editing workflows. That last point is particularly revealing: this is not written like a speculative research concept. It is written like a production tool intended to plug into a larger post pipeline.
In plain language: InterPositive is trying to teach a model to watch professionally shot footage and say, in structured technical terms, what happened cinematographically.
What is the model actually learning?
At the claims level, the answer is already specific. The metadata attached to each frame includes focal length, camera movement, object distance, and framing style. The frame-level labels likewise include focal length used during the shot, movement details, object distance from the camera, and framing style; the shot-level aggregation then calculates average focal length, determines the predominant framing style, and smooths movement data across the shot.
But the specification makes clear that the ambition is broader than those few claim words might suggest. It describes training the captioner on temporal sequences of frames rather than isolated stills so that it can learn dynamics such as parallax and motion blur over time. It treats focal length and camera-movement vectors as primary inputs rather than afterthoughts. And it emphasises that conventional captioning tends to focus on object recognition while the present system is aimed at technical cinematic prediction.
That distinction is the heart of the patent.
- A conventional image captioner looks at a frame and says: a man sitting in a chair under studio lighting.
- This system is trying to say: medium close framing, longer focal length, limited depth of field, static or minimally drifting camera, subject isolated against a darker background plane.
That is a profoundly different task. It is also why the summary grid in this chapter is accurate when it reduces the patent to one line: the captioner "reads footage and extracts cinematographic descriptions."
Methodological discipline: one variable at a time
One of the quiet but important ideas here is methodological discipline.
The patent's dataset is not just "lots of film clips." It is a set of raw clips where one cinematic parameter varies at a time. That echoes the broader InterPositive approach described across this chapter: controlled soundstage captures, full metadata, instrumented shots, and systematic parameter isolation. The point is to stop the model from learning vague associative correlations and instead teach it causal visual relationships — what changing focal length actually does to the image, what a certain movement trajectory looks like, how framing categories differ when everything else is held steady.
That is the same larger philosophy that runs through the LiDAR article: rather than train on internet-scale data of unknown provenance, InterPositive filmed a smaller, purpose-built dataset with dense technical annotation. Applied to the captioner, the implication is clear: the model is not being asked to reverse-engineer cinematography from chaos. It is being taught in a controlled pedagogical environment.
Which is, in a sense, how a cinematographer learns too.
Implementation: a production pipeline, not a research prototype
The specification is unusually detailed about the machine around the model. It describes modules for frame extraction, data normalisation, tensor conversion, dataset partitioning, preprocessing, model training, auto-labelling, post-processing, metadata generation, labelling and annotation, retraining, and deployment. The captioner AI computing environment is connected to a broader Filmmaker computing environment, and the model's outputs are meant to be stored, queried, refined, and exported for downstream use.
That matters for two reasons. First, it reinforces that this patent is not just about an abstract neural architecture — it is about workflow. InterPositive is patenting a pipeline in which cinematic metadata is created, normalised, validated, and then re-used elsewhere. Second, it explains why the patent likely mattered so much to Netflix. A model that can reliably label an entire video database in cinematographic terms is not merely a research artefact. It is a piece of infrastructure. It turns footage into indexed technical knowledge.
That is potentially useful for model training, retrieval, continuity matching, style analysis, shot extension, editorial support, and post-production automation.
The middle patent that makes the loop work
On first glance, the captioner may sound less glamorous than the LiDAR foundation or the style system. It doesn't directly generate images. It doesn't provide the natural-language interface. It doesn't apply the look.
But architecturally, it may be the most important middle patent in the stack.
The portfolio overview places it in Layer 4: Core Models, under the name Captioner (SamildAnach), alongside the provisional Filmmaker model and the system's feedback loops. It is the point where raw, measured, purpose-shot footage becomes reusable intelligence. The language patent explicitly says its filmmaking metadata comes from the upstream captioner model. The style analysis says the captioner is the reading system, while the style patent is the writing system; together they form the closed loop that makes the system practically useful in production.
That means the captioner is doing something strategically crucial: it turns footage into a vocabulary the rest of the system can act on.
- Without it, the language layer has less to translate.
- Without it, the style layer has less to preserve.
- Without it, the system becomes much closer to a conventional video model with some specialised training data.
- With it, the system starts to look like a self-improving filmmaking loop.
Training regime: controlled data and production dailies
The other articles in this chapter give a reasonable picture. On the controlled soundstage, InterPositive filmed scenes with multiple cameras and angles, documenting lens types, focal lengths, stop values, camera positions, movement speeds, lighting setups, shot types, and framing categories. They also trained on dailies — real production footage from professional shoots — giving the system both precisely isolated examples and the messier complexity of actual film practice.
The captioner sits exactly where those two data sources become useful. The controlled captures teach it clean distinctions: what changes when only focal length changes, or only framing changes. The dailies teach it how those cinematic variables manifest in the wild, under the real conditions of production. That combination is precisely the sort of training regime you would choose if you wanted a model to become fluent in the actual grammar of cinematography rather than merely in statistical image patterns.
And once trained, the captioner can be deployed over "a large video database," generating new metadata that can feed retraining, indexing, and downstream generation tasks. That scaling step is in the independent claim itself. So the captioner is not just reading the initial training set. It is meant to become the machine that reads everything else.
Patent scope and design-around risk
The claims are meaningful, but they are not infinitely broad. They are tied to a specific kind of training pipeline: controlled raw clips, metadata association, frame- and shot-level labels, aggregation logic, validation refinement, deployment over a larger database, and post-processing with confidence scoring and manual quality control. A competitor could build alternative systems that rely more heavily on self-supervised learning, less on single-parameter controlled captures, or a different labelling structure entirely.
So this is probably not the patent that gives anyone ownership over "understanding cinematography with AI" in the abstract. What it does appear to protect is the most direct and production-practical version of that idea: a dedicated captioner trained on purpose-built filmmaking clips and metadata, with LiDAR-assisted spatial supervision, that labels footage in technical cinematic terms and exports those labels into the rest of the workflow.
That is narrower than a fantasy monopoly, but still strategically valuable — especially because, as the broader portfolio argument runs, the strength of this IP is less any single primitive than the integration of all of them into a coordinated filmmaking system.
What the captioner reveals about InterPositive’s wager
The answer is the same one the synthesis at the end of this chapter reaches from another angle. InterPositive is not best understood as a prompt-to-video company. It is a production-specific system built around measured space, controlled data collection, technical metadata, language translation, and visual consistency. In that architecture, the captioner is the part that makes filmed material computationally legible. It teaches the machine to read the work in the same terms practitioners use to make it.
That is also why the captioner patent fits so neatly with the portfolio's overall philosophy. The synthesis says the patents are oriented around grounding AI in the actual practice of professional filmmaking rather than replacing it. The captioner is perhaps the clearest example of that. It is not a fantasy engine for inventing cinema from nothing. It is a system for watching real cinematography and learning to describe it faithfully enough that the rest of the pipeline can preserve, extend, or reproduce it.
In that sense, US 12,511,904 may be the least flashy patent in the stack and one of the most revealing. It tells you that InterPositive's wager was never just that AI should generate moving images. It was that AI should first learn to read film language — technically, precisely, and in the terms cinematographers actually mean.
And once you see that, the rest of the portfolio makes much more sense.
Synthesis
What the stack tells us — the bigger picture
Having read all four granted US patents in the InterPositive portfolio, a coherent picture emerges that is rather different from most of the coverage the acquisition received.
InterPositive is not best understood as a generic consumer prompt-to-video product. The granted patents — which explicitly describe adapting existing video language models to generate content, and training models configured to generate video with predetermined styles — do cover generative capability. But the system's generative functions are production-specific: grounded in measured physical space, conditioned on a project's own material, and oriented toward controlled post-production enhancement rather than open-ended synthesis from text prompts. That is a materially different design philosophy from Sora, Runway, or Kling, even if the underlying generation mechanism shares some architecture.
The distinction that matters is not generative versus non-generative. It is purpose-built versus general-purpose. InterPositive's system trains on purpose-built, instrumentally captured data and is designed to operate on footage from a specific production. The major consumer generators train on internet-scale video and produce statistically plausible footage against any prompt. Cinematographic precision and general plausibility are not the same thing, and for a professional production environment, that gap is significant.
US 12,322,036
The spatial foundation
LiDAR measurement gives the AI ground-truth knowledge of where every surface is in three-dimensional space. All downstream operations — relighting, motion simulation, VFX integration — depend on this geometric certainty.
US 12,438,995
The language interface
Transfer learning on cinematographic metadata and spatial data enables the AI to respond to professional filmmaking vocabulary. This is the hub through which all instructions and all outputs flow.
US 12,511,904
The captioner
A model trained to watch footage and extract cinematographic descriptions — focal length, lighting ratio, camera movement, framing. The reading half of the system's closed feedback loop, and the point at which filmed material becomes computationally legible to the rest of the stack.
US 12,511,837
The style layer
Paired comparisons between real film formats and digital baselines teach the AI the precise optical and chemical characteristics of different visual styles, enabling reliable reproduction during generation.
What makes this portfolio strategically significant is that it covers a complete and coherent pipeline rather than a collection of isolated capabilities. The layers are architecturally interdependent: spatial data informs the language model, the language model drives generation, the captioner feeds the style system, and the style system's outputs are evaluated against the captioner's descriptions. As a business and technical matter, that integration makes the system harder to replicate piecemeal — though it is worth noting that the individual patents protect specific workflows rather than the architecture as a whole, and a competitor building a different but functionally similar system would need to be assessed against each set of claims independently.
There is also a consistency of philosophy across the four patents that is worth noting explicitly. Every one of them is oriented around professional filmmaking as a craft. The LiDAR patent protects the physical measurement of real scenes — not synthetic depth estimation. The language patent protects integration with an existing professional vocabulary — not a new interface paradigm. The style patent protects learning from real photochemical processes — not digital approximations. The captioner patent protects a model trained to describe footage in the terms cinematographers actually use.
In each case, InterPositive's approach is to ground the AI in the actual practice of professional filmmaking rather than to build AI that approximates or replaces it. Whether that constraint reflects strategic positioning, Ben Affleck's personal convictions about the craft, or simply what was technically achievable with a 16-person team on a controlled soundstage — the result is a portfolio whose consistent orientation is toward augmenting professional practice rather than substituting for it.
One structural caveat is worth keeping in view. The analysis here focuses on the four granted US patents, which are enforceable now and where the claim language is fixed. The twelve WIPO/PCT international publications — which seek protection across Europe, Japan, and other major markets — are still moving through national examination and may be narrowed, amended, or rejected in whole or in part by individual patent offices. The international scope of the portfolio is therefore best understood as a strategic ambition rather than a settled legal position.
Whether Netflix maintains that orientation as the technology develops inside a company that produces hundreds of hours of original content annually, and that has every financial incentive to reduce the cost of that production, is a different question entirely. The patents describe what InterPositive built. What Netflix does with it is the next chapter.