Seq2Seq Scene Graph Generation Utilizing Vision-language Pretrained Model
Johns Hopkins University
Scene graph generation (SGG) aims to represent a visual scene with a hierarchical structure that contains objects, attributes and relationships. Most existing SGG methods require two stages for detecting objects and predicting their pairwise relationships, which is complicated and computational intensive. Motivated by the recent progress in vision-and-language pretraining (VLP), we propose to formulate SGG as a sequence generation problem that is compatible with the unified VLP pipeline. In this work, we present a one-stage sequence-to-sequence (Seq2Seq) SGG model with a Transformer backbone that can be trained alongside other vision-and-language tasks. Our approach achieves good performance in predicting both object labels and relationships. This study demonstrates the feasibility of formulating SGG as a Seq2Seq task and the potential of improving SGG with the vision-and-language pretrained models.
scene graph generation, Seq2Seq, pretrained model, vision-language model