Abstract

Surgical scene understanding is crucial for computer-assisted intervention systems, requiring visual comprehension of surgical scenes that involves diverse elements such as surgical tools, anatomical structures, and their interactions. To effectively represent the complex information in surgical scenes, graph-based approaches have been explored to structurally model surgical entities and their relationships. However, aspects such as tool–action–target combinations and the identity of the operating hand remain underexplored. To address this, we propose Endoscapes-SG201, a new dataset including annotations for action triplets (tool–action–target) and hand identity. We also introduce SSG-Com, a graph-based method designed to represent these critical elements. Experiments on downstream tasks—Critical View of Safety (CVS) assessment and action triplet recognition—demonstrate the importance of integrating these scene graph components, significantly advancing holistic surgical scene understanding.

Main Contributions

Endoscapes-SG201

We were fortunate to build Endoscapes-SG201, a dataset for holistic scene graph research, by extending and refining the publicly available Endoscapes-Bbox201 dataset released by CAMMA. To annotate additional labels, two clinical experts from Samsung Medical Center refined the bounding boxes in Endoscapes-Bbox201.

Step 1: We refined Bounding Boxes from Endoscapes-Bbox201
Step 2: We subdivided the 'Tool' class into 6 classes
Step 3: We annotated Action labels (tool–structure interactions) and Hand Identity labels (which hand manipulates each tool)

Dataset Comparison

This table contrasts the datasets used in previous surgical scene graph studies with Endoscapes-SG201.

Endoscapes-SG201 is designed with holistic scene graph research in mind.
It incorporates:
- Diverse tools and anatomical structures as graph nodes.
- Diverse relationships as graph edges.
- Hand Identity labels as attributes of the tool nodes.
By unifying these elements, the dataset provides a more expressive and comprehensive foundation for modeling surgical scenes.

Endoscapes-SG201 Details

This table presents the category-wise distribution of the additional labels introduced in Endoscapes-SG201.

Additional Annotations:

6 Surgical Instruments: Hook (HK), Grasper (GP), Clipper (CL), Bipolar (BP), Irrigator (IG), Scissors (SC)
6 Surgical Actions: Dissect (Dis.), Retract (Ret.), Grasp (Gr.), Clip (Cl.), Coagulate (Co.), Null
3 Hand Identities: Operator’s Right Hand (Rt), Operator’s Left Hand (Lt), Assistant’s Hand (Assi)

SSG-Com

SSG-Com is designed to leverage the diverse labels of Endoscapes-SG201.

Graph Construction
Nodes: Surgical instruments (with Hand identity), Anatomical structures
Edges: Spatial relations, Surgical action relations
Multi-task Training (3 classifiers)
Classifier 1: Spatial relation classification
Classifier 2: Action relation classification
Classifier 3: Hand identity classification $Total Loss: \[ L_{\text{total}} = L_{\text{LG}} + \lambda_{\text{action}} L_{\text{action}} + \lambda_{\text{hand}} L_{\text{hand}} \tag*{} \]$

Experimental Results

The latent graph of SSG-Com demonstrated its effectiveness across two downstream tasks.

Action Triplet Recognition
CVS prediction

Quantitative Results

In Action Triplet Recognition (a):

Modeling action relations as graph edges between nodes improved performance from 18.0 mAP (LG-CVS) to 23.5.
Further incorporating Hand Identity increased performance to 24.2.

In CVS Prediction (b):

Using Endoscapes-SG201 improved the performance of LG-CVS by 0.9 mAP, and SSG-Com achieved the highest score of 64.6.

Qualitative Results

By employing Endoscapes-SG201 and SSG-Com, we demonstrate the ability to construct a richer holistic surgical scene graph compared to existing approaches.

The authors thank Ms. Haeun Kim, M.F.A., for her professional assistance with the illustrations in this work.

✨MICCAI2025✨Towards Holistic Surgical Scene Graph