We propose factor matting, an alternative formulation of the video matting problem in terms of counterfactual video synthesis that is better suited for re-composition tasks. The goal of factor matting is to separate the contents of video into independent components, each visualizing a counterfactual version of the scene where contents of other components have been removed. We show that factor matting maps well to a more general Bayesian framing of the matting problem that accounts for complex conditional interactions between layers. Based on this observation, we present a method for solving the factor matting problem that produces useful decompositions even for video with complex cross-layer interactions like splashes, shadows, and reflections.
Our method is trained per-video and requires neither pre-training on external large datasets, nor knowledge about the 3D structure of the scene. We conduct extensive experiments, and show that our method not only can disentangle scenes with complex interactions, but also outperforms top methods on existing tasks such as classical video matting and background subtraction. In addition, we demonstrate the benefits of our approach on a range of downstream tasks.
We reframe video matting in terms of counterfactual video synthesis for downstream re-compositing tasks, where each counterfactual video answers a questions of the form “what would this component look like if we froze time and separated it from the rest of the scene?” We developed a plug-in for Adobe After Effects for faster re-composition, and used it to produce results in the rightmost column.
 
We compare with the most related previous work, Omnimatte. While Omnimatte has the merits of associating effects such as shadows and reflections to the correct layer, they do not have explicit conditional priors for any layer, and thus fail at complex scenes with foreground-background interactions. They also do not produce meaningful color factorizations.
 
While FactorMatte is designed to address videos featuring complex cross-component interactions, we find that it also excels on scenes without such interactions. One example would be the task of classical video matting.
 
Another example is the task of background subtraction. We select clips from CDW-2014 with shadows and reflections that should be associated with foreground objects, and featuring significant camera jitter, as well as changes in zoom and exposure.
 
The output of FactorMatte can also be combined with other methods for downstream applications, such as object removal and color pop. we compare the results of Flow-edge Guided Video Completion using different input masks provided by a variety of matting methods. Simple segmentation masks tend to leave correlated effects like shadows in the scene, while masks from Omnimatte lead to removal of most interaction effects, including deformations of the background. The mask from FactorMatte contains the object and its shadow but not the cushion, thus leading to the most plausible invisible result.
 
Improving the quality of alpha mattes also enables us to shift the color or timing of components within a video more aggressively than previous methods. In the flashlight video below, we successfully change the color of the flashlight beam by adjusting the foreground color layer to be more red outside of the input foreground mask.
 
 
 
 
@article{gu2023factormatte,
title={FactorMatte: Redefining Video Matting for Re-Composition Tasks},
author={Gu, Zeqi and Xian, Wenqi and Snavely, Noah and Davis, Abe},
journal={ACM Transactions on Graphics (TOG)},
volume={42},
number={4},
pages={1--14},
year={2023},
publisher={ACM New York, NY, USA}
}