|dc.description.abstract||Robotic surgical systems are increasingly used for minimally-invasive surgeries. As such, there is opportunity for these systems to fundamentally change the way surgeries are performed by becoming intelligent assistants rather than simply acting as the extension of surgeons' arms. As a step towards intelligent assistance, this thesis looks at ways to represent different aspects of robot-assisted surgery (RAS).
We identify three main components: the robot, the surgeon actions, and the patient scene dynamics. Traditional learning algorithms in these domains are predominantly supervised methods. This has several drawbacks. First many of these domains are non-categorical, like how soft-tissue deforms. This makes labeling difficult. Second, surgeries vary greatly. Estimation of the robot state may be affected by how the robot is docked and cable tensions in the instruments. Estimation of the patient anatomy and its dynamics are often inaccurate, and in any case, may change throughout a surgery. To obtain the most accurate information, these aspects must be learned during the procedure. This limits the amount of labeling that could be done. On the surgeon side, different surgeons may perform the same procedure differently and the algorithm should provide personalized estimations for surgeons. All of these considerations motivated the use of self-supervised learning throughout this thesis.
We first build a representation of the robot system. In particular, we looked at learning the dynamics model of the robot. We evaluate the model by using it to estimate forces. Once we can estimate forces in free space, we extend the algorithm to take into account patient-specific interactions, namely with the trocar and the cannula seal. Accounting for surgery-specific interactions is possible because our method does not require additional sensors and can be trained in less than five minutes, including time for data collection.
Next, we use cross-modal training to understand surgeon actions by looking at the bottleneck layer when mapping video to kinematics. This should contain information about the latent space of surgeon-actions, while discarding some medium-specific information about either the video or the kinematics.
Lastly, to understand the patient scene, we start with modeling interactions between a robot instrument and a soft-tissue phantom. Models are often inaccurate due to imprecise material parameters and boundary conditions, particularly in clinical scenarios. Therefore, we add a depth camera to observe deformations to correct the results of simulations. We also introduce a network that learns to simulate soft-tissue deformation from physics simulators in order to speed up the estimation.
We demonstrate that self-supervised learning can be used for understanding each part of RAS. The representations it learns contain information about signals that are not directly measurable. The self-supervised nature of the methods presented in this thesis lends itself well to learning throughout a surgery. With such frameworks, we can overcome some of the main barriers to adopting learning methods in the operating room: the variety in surgery and the difficulty in labeling enough training data for each case.||