S<sup>2</sup>Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR (2025)

Abstract

Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR). However, previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. This pipeline may potentially compromise the flexibility of learning multimodal representations, consequently constraining the overall effectiveness. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed, S2Former-OR, aimed to complementally leverage multi-view 2D scenes and 3D point clouds for SGG in an end-to-end manner. Concretely, our model embraces a View-Sync Transfusion scheme to encourage multi-view visual information interaction. Concurrently, a Geometry-Visual Cohesion operation is designed to integrate the synergic 2D semantic features into 3D point cloud features. Moreover, based on the augmented feature, we propose a novel relation-sensitive transformer decoder that embeds dynamic entity-pair queries and relational trait priors, which enables the direct prediction of entity-pair relations for graph generation without intermediate steps. Extensive experiments have validated the superior SGG performance and lower computational cost of S2Former-OR on 4D-OR benchmark, compared with current OR-SGG methods, e.g., 3 percentage points Precision increase and 24.2M reduction in model parameters. We further compared our method with generic single-stage SGG methods with broader metrics for a comprehensive evaluation, with consistently better performance achieved. Our source code can be made available at: https://github.com/PJLallen/S2Former-OR.

Original languageEnglish
JournalIEEE Transactions on Medical Imaging
Number of pages12
ISSN0278-0062
DOIs
Publication statusAccepted/In press - 2024

Keywords

  • Scene graph generation
  • 3D surgical scene understanding
  • Transformer
  • Single-stage
  • Bi-modal

Access to Document

  • FulltextAccepted author manuscript, 3.26 MB

    Fingerprint

    Dive into the research topics of 'S2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR'. Together they form a unique fingerprint.

    Cite this

    • APA
    • Author
    • BIBTEX
    • Harvard
    • Standard
    • RIS
    • Vancouver

    Pei, J., Guo, D., Zhang, J., Lin, M., Jin, Y., & Heng, P. A. (Accepted/In press). S2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR. IEEE Transactions on Medical Imaging. https://doi.org/10.1109/TMI.2024.3444279

    Pei, Jialun ; Guo, Diandian ; Zhang, Jingyang et al. / S2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR. In: IEEE Transactions on Medical Imaging. 2024.

    @article{8bdf01a30f164cb1b0d605fad3dda462,

    title = "S2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR",

    abstract = "Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR). However, previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. This pipeline may potentially compromise the flexibility of learning multimodal representations, consequently constraining the overall effectiveness. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed, S2Former-OR, aimed to complementally leverage multi-view 2D scenes and 3D point clouds for SGG in an end-to-end manner. Concretely, our model embraces a View-Sync Transfusion scheme to encourage multi-view visual information interaction. Concurrently, a Geometry-Visual Cohesion operation is designed to integrate the synergic 2D semantic features into 3D point cloud features. Moreover, based on the augmented feature, we propose a novel relation-sensitive transformer decoder that embeds dynamic entity-pair queries and relational trait priors, which enables the direct prediction of entity-pair relations for graph generation without intermediate steps. Extensive experiments have validated the superior SGG performance and lower computational cost of S2Former-OR on 4D-OR benchmark, compared with current OR-SGG methods, e.g., 3 percentage points Precision increase and 24.2M reduction in model parameters. We further compared our method with generic single-stage SGG methods with broader metrics for a comprehensive evaluation, with consistently better performance achieved. Our source code can be made available at: https://github.com/PJLallen/S2Former-OR.",

    keywords = "Scene graph generation, 3D surgical scene understanding, Transformer, Single-stage, Bi-modal",

    author = "Jialun Pei and Diandian Guo and Jingyang Zhang and Manxi Lin and Yueming Jin and Heng, {Pheng Ann}",

    year = "2024",

    doi = "10.1109/TMI.2024.3444279",

    language = "English",

    journal = "IEEE Transactions on Medical Imaging",

    issn = "0278-0062",

    publisher = "Institute of Electrical and Electronics Engineers",

    }

    Pei, J, Guo, D, Zhang, J, Lin, M, Jin, Y & Heng, PA 2024, 'S2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR', IEEE Transactions on Medical Imaging. https://doi.org/10.1109/TMI.2024.3444279

    S2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR. / Pei, Jialun; Guo, Diandian; Zhang, Jingyang et al.
    In: IEEE Transactions on Medical Imaging, 2024.

    Research output: Contribution to journalJournal articleResearchpeer-review

    TY - JOUR

    T1 - S2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR

    AU - Pei, Jialun

    AU - Guo, Diandian

    AU - Zhang, Jingyang

    AU - Lin, Manxi

    AU - Jin, Yueming

    AU - Heng, Pheng Ann

    PY - 2024

    Y1 - 2024

    N2 - Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR). However, previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. This pipeline may potentially compromise the flexibility of learning multimodal representations, consequently constraining the overall effectiveness. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed, S2Former-OR, aimed to complementally leverage multi-view 2D scenes and 3D point clouds for SGG in an end-to-end manner. Concretely, our model embraces a View-Sync Transfusion scheme to encourage multi-view visual information interaction. Concurrently, a Geometry-Visual Cohesion operation is designed to integrate the synergic 2D semantic features into 3D point cloud features. Moreover, based on the augmented feature, we propose a novel relation-sensitive transformer decoder that embeds dynamic entity-pair queries and relational trait priors, which enables the direct prediction of entity-pair relations for graph generation without intermediate steps. Extensive experiments have validated the superior SGG performance and lower computational cost of S2Former-OR on 4D-OR benchmark, compared with current OR-SGG methods, e.g., 3 percentage points Precision increase and 24.2M reduction in model parameters. We further compared our method with generic single-stage SGG methods with broader metrics for a comprehensive evaluation, with consistently better performance achieved. Our source code can be made available at: https://github.com/PJLallen/S2Former-OR.

    AB - Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR). However, previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. This pipeline may potentially compromise the flexibility of learning multimodal representations, consequently constraining the overall effectiveness. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed, S2Former-OR, aimed to complementally leverage multi-view 2D scenes and 3D point clouds for SGG in an end-to-end manner. Concretely, our model embraces a View-Sync Transfusion scheme to encourage multi-view visual information interaction. Concurrently, a Geometry-Visual Cohesion operation is designed to integrate the synergic 2D semantic features into 3D point cloud features. Moreover, based on the augmented feature, we propose a novel relation-sensitive transformer decoder that embeds dynamic entity-pair queries and relational trait priors, which enables the direct prediction of entity-pair relations for graph generation without intermediate steps. Extensive experiments have validated the superior SGG performance and lower computational cost of S2Former-OR on 4D-OR benchmark, compared with current OR-SGG methods, e.g., 3 percentage points Precision increase and 24.2M reduction in model parameters. We further compared our method with generic single-stage SGG methods with broader metrics for a comprehensive evaluation, with consistently better performance achieved. Our source code can be made available at: https://github.com/PJLallen/S2Former-OR.

    KW - Scene graph generation

    KW - 3D surgical scene understanding

    KW - Transformer

    KW - Single-stage

    KW - Bi-modal

    U2 - 10.1109/TMI.2024.3444279

    DO - 10.1109/TMI.2024.3444279

    M3 - Journal article

    C2 - 39146166

    SN - 0278-0062

    JO - IEEE Transactions on Medical Imaging

    JF - IEEE Transactions on Medical Imaging

    ER -

    Pei J, Guo D, Zhang J, Lin M, Jin Y, Heng PA. S2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR. IEEE Transactions on Medical Imaging. 2024. doi: 10.1109/TMI.2024.3444279

    S<sup>2</sup>Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR (2025)
    Top Articles
    Latest Posts
    Recommended Articles
    Article information

    Author: Nathanial Hackett

    Last Updated:

    Views: 5529

    Rating: 4.1 / 5 (72 voted)

    Reviews: 87% of readers found this page helpful

    Author information

    Name: Nathanial Hackett

    Birthday: 1997-10-09

    Address: Apt. 935 264 Abshire Canyon, South Nerissachester, NM 01800

    Phone: +9752624861224

    Job: Forward Technology Assistant

    Hobby: Listening to music, Shopping, Vacation, Baton twirling, Flower arranging, Blacksmithing, Do it yourself

    Introduction: My name is Nathanial Hackett, I am a lovely, curious, smiling, lively, thoughtful, courageous, lively person who loves writing and wants to share my knowledge and understanding with you.