GeCo Avatar GeCo: A Differentiable Geometric Consistency Metric for Video Generation

1Harvard University, 2Google DeepMind, 3Massachusetts Institute of Technology
Manuscript, code, and data will be released soon.
Teaser image

GeCo pipeline. For each frame pair in a sliding window, we compute residual motion and depth errors and fuse them into scale-invariant inconsistency maps. Aggregating these maps over the window localizes artifacts in the target frame, while separate motion and structure maps provide complementary diagnostics.

Abstract

We introduce GeCo, a geometry-grounded metric for jointly detecting geometric deformation and occlusion-inconsistency artifacts in static scenes. By fusing residual motion and depth priors, GeCo produces interpretable, dense consistency maps that reveal these artifacts. We use GeCo to systematically benchmark recent video generation models, uncovering common failure modes, and further employ it as a training-free guidance loss to reduce deformation artifacts during video generation.

Detect the deformation.

GeCo measures geometric consistency of generated videos. Motion Map highlights motion inconsistency between object motion and motion induced by camera (deformation). Structure Map highlights depth reprojection errors that can detect suddenly appearing/disappearing objects, which compensates the motion map in occlusion regions. Our metrics clearly detect various artifacts in generated videos, highlighted in each error map:

  • Row 1: The bowl rotates and morphs into a cup.
  • Row 2: The house deforms during camera panning.
  • Row 3: The glass on the table gradually moves across the dining table.
  • Row 4: The train remains static while the background morphs towards it.
  • Row 5: The stove deforms and elongates during camera rotation.
  • Row 6: The whole living room shrinks as the camera rotates.
  • Row 7: The office scene is filled with sudden deformation during fast camera movement.
  • Row 8: The table and stairs subtly deform when the camera moves forward.
Input Video
Motion Map
Structure Map
Fused Map

Fix the deformation.

Videos in the left column are generated by CogVideoX-5B without guidance. Videos in the right column are generated by CogVideoX-5B with GeCo guidance in a training-free manner. GeCo effectively reduces various deformation artifacts observed in the baseline:

  • Row 1: The eaves of the Japanese house deform severely during camera panning.
  • Row 2: The car's side skirt fractures into two disjoint segments.
  • Row 3: The block structure deforms and wobbles during camera movement.
  • Row 4: The sink drifts within the kitchen and elongates during camera rotation.

Without GeCo

With GeCo

We further show GeCo leads to better reconstruction quality by predicting point clouds using generated videos. The point cloud is generated by VGGT.

Without GeCo

With GeCo

house car block kitchen

Stop the globe that can't be stopped.

We observe a common failure mode where models fail to generate a static globe while camera orbits, a phenomenon we refer to as "the globe that can't be stopped." By applying GeCo guidance, we demonstrate that spurious object motion is effectively suppressed, ensuring the object remains static relative to the environment. This example demonstrates the improvement on CogVideoX-5B.

Without GeCo

With GeCo

Freeze the dog that can't be frozen.

Similarly, models often struggle to generate static living subjects during camera orbits. For instance, when the prompt requires a dog to remain perfectly still, the baseline often hallucinates unintended animation or drift. By employing GeCo guidance, we successfully eliminate this spurious motion, ensuring the subject remains relatively frozen. The results shown here are generated using CogVideoX-5B.

Without GeCo

With GeCo