Why we research visual relationships?

Visual relationships connect isolated instances into the structural graph. It provides a dimension in scene understanding, which is higher than the single instance and lower than the holistic scene. The visual relationships act as the bridge of perception and cognition.

What have we done in visual relationships?

From the phrase detection to the scene graph generation, we have a clearer data standard and task for representing relationships.

What next to visual relationships?

After the representation of visual relationships, we should use the relationship information and build the bridge from perception to cognition. More than given the correct scene graph, the relationships should go further in the semantic and play the actual role in scene understanding.

Why the applications in visual relationships stuck?

Rather than the method side, more problems exist in the data side. Before designing methods on how to learn, we should figure out what to learn first.

What should be learned in visual relationships for cognition?

Visually-relevant relationships! The visually-irrelevant relationships like spatial relationships and low diversity relationships degrade the relation problems to detection or deductive reasoning. Those visually-irrelevant relationships pull back the relationships inferring to the perceptive side. To take advantage of relationship information for semantic understanding, Only the Visually-relevant relationships should be learned!


Visual-relevance Relationships dataset (VrR-VG) is a scene graph dataset from Visual Genome. It contains 117 visual-relevant relationships selected by our method.

VrR-VG is constructed from Visual Genome. The VrR-VG has 58983 images and 23375 relation pairs. Excluded positional and statistically biased relationships, the VrR-VG is more balance and includes more valuable relation data. Rather than generating scene graph dataset by relation label frequency, the VrR-VG gathers visual-relevant relationships, like "play with", "lay on", "hang on", etc. The relationships like "on", "wearing", "has", etc., which can be inferred by positional information or data bias, are excluded in our dataset. The VrR-VG offers a more balanced research material for relationships and more diverse relation data for semantic tasks.

Distribution comparison of datasets. The images in the left column are from VG150, and the images in right are our VrR-VG. VrR-VG is more balanced and diversity.

Tag cloud comparison of dataset.


For scene graph generation, you can take the last 5000 samples as the test set (the same setting in neural-motifs). For features representation, you can use all the samples for training.

An implemented example for VD-Net is also provided. For choosing our complete visually-relevant relationships, multiple initial learning rates (learning from 1e-5 to 1e-2, momentum is 0.9, weight decay is 1e-4) were used and we took 10 models for voting. Additionally, to avoid overfitting, the data sample should be completely random. In each batch, the relation triplets were from different images and the labels should be different from each other.

Download VrR-VG  google drive VD-Net   github


Yuanzhi Liang
Xi'an Jiaotong University
JD AI Research

Yalong Bai
JD AI Research

Wei Zhang
JD AI Research

Tao Mei
JD AI Research