Using Paquo to directly interact with QuPath project files for usage in digital pathology machine learning

This is an updated version of the previously described workflow on how to load and classify annotations/detections created in QuPath for usage in downstream machine learning workflows. The original post described how to use the Groovy programming language used by QuPath to export annotations/detections as GeoJSON from within QuPath, made use of a Python script to classify them, and lastly used another Groovy script to reimport them. If you are not familiar with QuPath and/or its annotations you should probably read the original post first to provide better context and understanding of the respective workflows, as well as being able to appreciate the more elegant approach taken here. If you are already using the described approach, you should be able to easily modify it to follow this newer approach.

Here we present an updated approach which makes use of paquo, a Python library for interacting with QuPath, to directly read, create, and modify annotations and/or detections (among many other things outside the scope of this post). This approach yields a new QuPath project based on an input QuPath project which has been processed by a Python-based algorithm for Deep Learning (DL) predictions to assign object labels.

In addition, we make use of papermill, a library for parameterizing and executing Jupyter Notebooks. While potentially a bit contrived in this situation, this approach is quite handy, and can easily be generalized to many other use cases so is worth familiarizing yourself with.

Last but not least, we speed up the reading of image tiles by facilitating the use of tiffslide, a tiffiles-based drop-in replacement for openslide-python. So, instead of importing openslide we use:

import tiffslide as openslide  # This way we don't need to change our code a lot

Compared to the original workflow:

  • Use QuPath to detect cells
  • Export cells as GeoJSON
  • Import cells into python shapely objects
  • Apply DL to these objects to identify lymphocytes
  • Save GeoJSON from python
  • Import new labeled annotations back into QuPath

The updated workflow looks like this:

  • Use QuPath to detect cells
  • Export cells as GeoJSON
  • Import cells into python shapely objects
  • Apply DL to these objects to identify lymphocytes
  • Save GeoJSON from python
  • Import new labeled annotations back into QuPath

Which notes the following benefits:

  1. No need to leave your preferred programming language. You could of course just write a shell script that headlessly launches the QuPath command line, so you can run the Groovy script from the original post that exports the annotations/detections as GeoJSON to fully automate the approach with the creation of JSON files as an intermediate step. If you add a few lines of code to the Groovy script to also run QuPath’s cell detection plugin, there is actually no need to open QuPath and perform Steps 1, 2, and 6 outlined in the image showing the original workflow above (i.e. “Use QuPath to detect cells”, “Export cells as GeoJSON”, and “Import new labeled annotations back into QuPath”) manually.

    In short, you could do something like this from a shell (script):

    ./QuPath.sh script -p [path/to/qupath/project]/project.qpproj [path/to/groovy/script]/detect_cells_and_export_as_geojson_script.groovy

    If you want to know more about running scripts from the command line please refer to the QuPath documentation and Pete Bankhead’s Blog. Still though, you would need to maintain a Groovy script, a shell script (to invoke the above described command) and your actual Python code that you already wrote for your machine learning approach.

  2. If you don’t need JSON files of your annotations, there’s no need to store them.
    On the other hand, if you need to store your annotations anyway, or export them for use in 3rd party tools, having them available as JSON files in the future might prove useful.

On the other hand, while

“paquo’s goal is to provide a pythonic interface to important features of QuPath, and to make creating and working with QuPath projects intuitive for Python programmers.”

It comes with the potential following limitations, which should help you decide which approach is more apt:

  1. paquo will eventually change.
    That said, it’s probably wise to stay up to date with its developments as it might be necessary to update the code sooner or later. In contrast, the JSON approach has fewer dependencies, and is “simple enough” that it is unlikely to need revisions in the future.

    As a side note, in case anyone runs into problems: QuPath’s implementation of the GeoJSON format has changed minimally between QuPath 0.2.3 and 0.3.0-rc1, which might affect your ability to reuse your GeoJSON stored annotations between different version of the software (refer to this Paquo issue and this QuPath issue for more details). However, paquo can handle both versions no matter which QuPath version you use with paquo.

  2. Project files produced by different versions of QuPath are not guaranteed to be backward compatible, which may require running multiple QuPath versions in parallel.
    This limitation is avoided by storing annotations as JSON, which is version agnostic. However, if you need to work with your old QuPath projects you can also opt to specify the version paquo uses.

Use QuPath to detect cells

You should start with detecting cells from within QuPath. Unfortunately paquo does not offer the possibility (yet) to do this from within your Python code. So, start with detecting cells on an image and simply save the project. Please refer to the equivalent section (“Use QuPath to detect cells”) in the original blog entry.

Import cells into Python shapely objects

Now, instead of exporting the annotations/detections as GeoJSON, we will instead use paquo to load the whole QuPath project, basically in the same way as we would load any file.

  1. qp = QuPathProject(PROJECT_PATH, mode='r')
  2. print(f"Opened project ‘{qp.name}’ ")
  3. print(f"Project has {len(qp.images)} image(s).")
  4.  
  5. print(qp.images)

A QuPathProject has the property ‘images’, which returns a sequence of ImageEntries. More details about the class can be found in the API.

This is the output of the above command:

Opened project ‘myproject‘
Project has 1 image(s).
ImageEntries(['IMAGENAME.svs'])

We will use the same method to create the new QuPath project (an exact copy) that will finally receive our classified annotations. We can additionally already set the annotation classes that we want to assign to our annotations/detections to it. To do so we specified lists of class names and class colors at the onset of the script:

  1. CLASSNAMES = ["Other", "Lymphocyte"]  # Choose the classes you want yourself
  2. CLASSCOLORS = [-377282, -9408287]  # QuPath RGB colors are negative because of the way that java stores them
  3.  
  4. with QuPathProject(NEW_PROJECT_PATH, mode='a') as qpout:  
  5.     add_qupath_classes(CLASSNAMES, CLASSCOLORS, qpout)

We will then use the first image and read all detections for storing them in a list called allshapes. We only want to use our machine learning model to classify the detections (objects created by QuPath), but we still might want to also keep the annotations we made (human drawn objects) and add them back into the new project.

  1. def read_qupath_annotations(image):
  2.     annotations = image.hierarchy.annotations  
  3.     ann = [annotation.roi for annotation in annotations] if annotations else list()
  4.     return ann
  5.  
  6. def read_qupath_detections(image):
  7.     detections = image.hierarchy.detections
  8.     det = [detection.roi for detection in detections] if detections else list()
  9.     return det
  10.  
  11.  
  12. image = qp.images[0]
  13. ann = read_qupath_annotations(image)  # We keep the annotations, but we don't classify them
  14. det = read_qupath_detections(image)
  15. allshapes = det  # We only want to classify the detections

At this point we have not only read all detections as shapely objects and stored them in a list, but we have also retrieved the image from our original project file, so we can add it to the new QuPath project. Additionally, we also get the image’s image_type, which is safe even if the image type was not set for an image (or if we know the specific type, we can directly set it, as done below). Here we can also define if we allow duplicate images. We can extend the above code by adding the following lines at the end:

  1. wsi_fname = image.uri.split(":")[-1]
  2. entry = qpout.add_image(wsi_fname, image_type=QuPathImageType.BRIGHTFIELD_H_E,
  3.                         allow_duplicates=True)

Apply DL to these objects

The overall workflow to get the classifications is equal to what was described in the previous blog post. We look at each possible “tile” in a whole slide image, and if there are enough objects present to make it worth computing (MINHITS), we load the tile and operate upon it, otherwise we skip it.

Import new labelled annotations back into QuPath

After we have finished predicting/assigning classes to our detections we can add them to our list of shapely objects allshapes, by adding an additional property which we call class_id like this:

  1. def process_batch(arr_out, hits: list):
  2.     classids = []
  3.     for batch_arr in tqdm(divide_batch(arr_out, BATCHSIZE), leave=False):
  4.         classids.append(np.random.choice([0, 1], batch_arr.shape[0])) #apply your DL model here
  5.  
  6.     classids = np.hstack(classids)
  7.  
  8.     for hit, classid in zip(hits, classids):
  9.         hit.class_id = classid

The “hits” in the searchtree are a reference to the original object (no duplicated memory, and directly modifiable), so as a result, we can directly change the “hit” class_id, and it will be reflected in the allshape list immediately.
Finally, we can loop through all elements of allshapes to add them and their associated classes as annotations to our newly created QuPath project. If we wish we can also specify a name for our annotations, e.g. it’s geom_type (i.e., the name of the geometry’s type, such as ‘Polygon’) like in the example below or some other useful information .
Use help(Polygon) in the Python console to retrieve a full list of properties of a shapely Polygon object. If you don’t specify annotation.name below, your annotations’ name in QuPath will be the same as it’s class name (the one we specified in the beginning (“Other” or “Lymphocyte”)).

  1. def add_annotations(qpout, entry, ann: list, allshapes: list):
  2.     for classified_shape in allshapes:
  3.         annotation = entry.hierarchy.add_annotation(roi=classified_shape,
  4.                                             path_class=qpout.path_classes[classified_shape.class_id]
  5.                                             if hasattr(classified_shape, "class_id")
  6.                                             else None)
  7.  
  8.         annotation.name = str(classified_shape.geom_type)  # Use a different property here if you like.
  9.     if ann:  # Add the annotations to the new project as they were
  10.         for annotation_shape in ann:
  11.             entry.hierarchy.add_annotation(roi=annotation_shape)

This is how our project looks like in the end, without specifying the annotation.name explicitly.

Image shows the annotations as they appear in the QuPath GUI.

Here we have set the annotation.name to be the geom_type. (This example may not be very useful, but it serves to illustrate the principle.)

Image shows the annotations as they appear in the QuPath GUI.


This is it! We have, with little effort created a new QuPath project, that now holds classified annotations.

Loop through multiple QuPath projects

With the above code we can work our way through a single QuPath project with a single (or multiple) images. To apply the script to multiple QuPath projects, we can make use of the papermill library mentioned in the beginning.

This way can parameterize our script by supplying a list of QuPath projects ['project1', 'project2'] to process, which we might read from a file or retrieve by searching in a given directory.

The most important step we need to take to use this approach is to designate a cell in our script with the tag parameters. If you are using Jupyter notebook or JupyterLab you can refer to the official documentation. In case you use VS Code’s Jupyter extension, then there is no possibility to add tags. You can, however, just open your Jupyter script file in a simple text editor, find the cell that holds your global variable PROJECT_NAME and add the tag to the metadata field manually, like this:

Before:

  1. {
  2.  "cell_type": "code",
  3.  "execution_count": 36,
  4.  "metadata": {},
  5.  "outputs": [],
  6.  "source": ["PROJECT_NAMES = \"project1\""]
  7.  },

After:

  1. {
  2. "cell_type": "code",
  3. "execution_count": 36,
  4. "metadata": {"tags":["parameters"]},
  5. "outputs": [],
  6. "source": ["PROJECT_NAMES = \"project1\""]
  7. },

Now the line that holds our global variable PROJECT_NAME can receive the values we provide through a second Jupyter notebook. The code it holds looks like this:

  1. import sys
  2. import papermill as pm
  3. from tqdm.notebook import tqdm as tqdm

The module tqdm is optional, but it helps visualize the progress.

With the inspect_notebook function we can retrieve the inferred notebook parameters.

  1. pm.inspect_notebook('./paquo_classify_qupath_objects.ipynb')

Which outputs our parameter’s name and default value:

{'PROJECT_NAME': {'name': 'PROJECT_NAME',
  'inferred_type_name': 'None',
  'default': '"project1"',
  'help': ''}}

Here we specify the projects we would like to work on:

  1. PROJECT_NAMES = ['project1', 'project2'] # A list of QuPath projects
  2.  
  3. for projectname in tqdm(PROJECT_NAMES):
  4.     pm.execute_notebook(
  5.         input_path='./paquo_classify_qupath_objects.ipynb',
  6.         output_path='null',
  7.         stdout_file=sys.stdout,
  8.         parameters=dict(PROJECT_NAME=projectname)
  9.     )

stdout_file=sys.stdout ensures that the output from the sub-notebook appears in the parent notebook directly, which helps quite a bit in progress monitoring. Once you run the script it will change the global variable PROJECT_NAME in the paquo_classify_qupath_objects script and run the Jupyter notebook with each of the specified filenames in a new environment.

You could also use papermills power to create separate Jupyter notebook files with the respective variable in the filename like this: Instead of specifying output_path='null' we can set output_path='{PROJECT_NAME}.ipynb' which creates a Jupyter notebook file for every project we provided.

The generated output files are essentially the same as the input file, but they now have an additional cell which holds the actual parameter value:

# Parameters
PROJECT_NAME = "project1"

Conclusion

With the presented approach it is possible to perform all the steps needed for classifying QuPath detections/annotations in one Python script. Additionally, if you usually have multiple QuPath projects (or one project per WSI to enabled increased parallelization) you can also make use of the papermill approach to parameterize your Python code.

We tested our approach with a WSI containing more than 600.000 detections. The script ran for 3 min. In terms of speed there are mainly three things that consume the majority of the time:

  1. Reading images. This is, as was already described, by far the most expensive step in the given workflow, so this script tries to minimize the number of tiles that are actually loaded by specifying the global variable MINHITS. This way we can define which tiles we are going to skip because they do not contain enough cells (detections) to be interesting for us. By going from MINHITS = 1 to MINHITS = 100 and replacing the openslide module with tiffslide, we managed to greatly reduce the time of reading tiles to around 25% of the total run time. Compared to using the openslide module we get an overall time reduction of more than 40% in our specific use case.
  2. Reading a lot of detections. Reading 600.000 detections with the function qp_read_detections took about 28% of the overall time.
  3. And last but not least, adding annotations to the new QuPath project took another 26% of the total time the script ran for.

You can find the full code on Github.

If you are using conda to manage your environments you can use this file “paquo050_env.yml” on a 64-bit Linux system to create an environment using paquo 0.5.0, and all other necessary packages to run the code.

Special thanks to Andreas Poehlmann, one of the creators of Paquo, for helping review and providing feedback on this post!

Leave a Reply

Your email address will not be published. Required fields are marked *