Exporting and re-importing annotations from QuPath for usage in machine learning

Update-Nov 2020: Code has now been placed in github which enables the reading and writing of compressed geojson files at all stages of the process described below. Compression reduces the file size by approximately 93% : )

QuPath is a digital pathology tool that has become especially popular because it is both easy to use to and supports a large number of different whole slide image (WSI) file formats. QuPath is also able to perform a number of relevant analytical functions with a few mouse clicks. Of interest in this blog post is mentioning that the pathologists we tend to work with are either already familiar with QuPath, or find it easier to learn versus other tools. As a result, QuPath has become a goto tool for us for both the creation, and review of, annotations and outputs created by our algorithms.

Here we introduce a robust method using GeoJSON for exporting annotations (or cell objects) from QuPath, importing them into python as shapely objects, operating upon them, and then re-importing a modified version of them back into QuPath for downstream usage or review. As an example use case we will be looking at computationally identifying lymphocytes in WSIs of melanoma metastases using a deep learning classifier.

The workflow we will use here has been tested on images containing >300,000 objects without any issues, and is able to classify these cells in ~2 minutes using a Thinkpad P51 laptop with an Nvidia Quadro M2200 GPU. We note that this workflow is applicable to not only cells, but annotations of any sort or objects.

The workflow is:

QuPath to detect cells

In our test image, we can see that the cells appear especially “clean”, i.e., not overlapped, and there is good visual separation in colors between the cell nuclei and the surrounding cytoplasm and tissue.

As such, we would expect QuPath’s inbuilt cell detector to work very well, and thus employ it here. The process is well described in other places, including a youtube tutorial, so we will only briefly mention the steps here:

  1. Create a region of interest (ROI) using the rectangular selection tool
  2. Click Analyze – > Cell detection – > Cell detection
  3. Set “include cell nucleus” to false, set cell expansion to 0, and unclick “make measurements”

While the latter steps are not strictly necessary, they make the process run much faster, and the resulting output file much smaller. After executing these steps we’re left with a ROI which has cells nicely identified with what appear to be reasonable boundaries:

Now we want to export this information as a GeoJSON

Export objects from QuPath into GeoJSON

In the latest version of QuPath (Version: 0.2.3), their team has opted to support using what is called a GeoJSON format. One can read more about the official specification here: https://GeoJSON.org/

GeoJSON is a fantastic option for storing annotations (or geometric objects in general) and we’ll go into it briefly here.

GeoJSON is a particular variant of JSON (JavaScript Object Notation), which itself is a lightweight data-interchange format (you can read more here: https://www.json.org/json-en.html).

Before actually exporting the data from QuPath, lets first take a look at an example of JSON and our particular GeoJSON object:

  1. {
  2.         "type": "Feature",
  3.         "id": "PathDetectionObject",
  4.         "geometry": {
  5.             "type": "Polygon",
  6.             "coordinates": [[[17738.72, 42238], [17737.94, 42240.13], [17738.39, 42242.34], [17737.21, 42244.28], [17737.19, 42246.54], [17739.74, 42250.23], [17743.86, 42248.63], [17743.7, 42246.37], [17745.05, 42242.08], [17748.38, 42239.13], [17747.76, 42238.25], [17738.72, 42238]]]
  7.         },
  8.         "properties": {
  9.             "isLocked": false,
  10.             "measurements": [],
  11.             "classification": {
  12.                 "name": "Other",
  13.                 "colorRGB": -377282
  14.             }
  15.         }
  16.     }

As we can see JSON is just plain text. JSON describes a specific way of storing information where you have a key and a value of that key associated with it. So here we have a “type” of ”Feature” and its “id” is that of a “PathDetectionObject”. We also can have sub-keys, for example, this object also has a geometry. Inside you could see this geometry has an open curly bracket. So the subkeys of “type” and “coordinates” belong to this geometry.

In this instance we can see that there is a polygon and these are the coordinates of that polygon. And more specifically, in our case this polygon is the boundaries of a cell detected by QuPath in the first step.

At the bottom, there is a “property” key which has other values, such as the name and the color of that object as it appears in QuPath.

Additionally, we can see an empty array of measurements which would normally contain things like nuclear area, nuclear perimeter, texture, if we had enabled the “make measurements” option in QuPath.

This is an example of a single GeoJSON object. It turns out that if you take a number of these objects and just kind of paste them next to each other, with commas in between, you essentially have an array of these objects and you can store that as a text string.

Keep in mind that this is an example of JSON, which is basically just a key-value pairing. On the other hand, the GeoJSON specification describes exactly the types of keys that are possible and their associated types. So, for example, “geometry” is part of GeoJSON, “polygon” is also a part of the GeoJSON specification, as well as “coordinates”.

Ultimately, anything that supports GeoJSON knows how to read and interpret the different properties in the specification. This means that if you export something as GeoJSON from one tool and import it into another tool, it will be able to recreate those polygons (which is exactly what we’ll do in python).

Run export

Now, with a good understanding of the GeoJSON format, lets export the objects. To do so, in QuPath, we select “Automate” and then “Show Script Editor”. In there we paste the contents of our export script from here.  

What’s very interesting is that now this export function is really only a few lines long.

Set annotationMeasurements = []

getDetectionObjects().each{
it.getMeasurementList().getMeasurementNames().each{
annotationMeasurements << it}}

annotationMeasurements.each{ removeMeasurements(QuPath.lib.objects.PathCellObject, it);}

I typically always want to clear out the annotation the measurements because it’s going to make the output file significantly larger.

boolean prettyPrint = false 
// false results in smaller file sizes and thus faster loading times, at the cost of nice formating

def gson = GsonTools.getInstance(prettyPrint)
def annotations = getDetectionObjects()

// you can check here but this will be HUGE and take a long time to parse 
//println gson.toJson(annotations) 

Next we get an instance of the GeoJSON exporter. We get all of our detected objects which is now just going to be a vector of objects.

Note here that pretty print is false. This simply means don’t put in the newline characters and spaces between keys, implying that the output will all be on a single line. Overall this makes the output file significantly smaller, but can be turned to “true” for easier human viewing.

File file = new File('d:/1L1_nuclei.json')
 file.withWriter('UTF-8') {
     gson.toJson(annotations,it)
 }

Lastly, we’re going to create an output file, and we’re going to say to export our objects into a GeoJson format. That’s it. Takes at most a few seconds to run, and I’ve successfully done this with upwards of 300,000 cells, not a single problem, works perfectly fine.

Importing GeoJSON into Python

Now, the question is, we have this GeoJSON output file, what can we start to do with it. In this case, we would like to import it into Python, apply a DL model I previously trained to classify cells as lymphocyte or not, and save that information back to GeoJSON for later reimportation into QuPath.

My DL model was trained to accept 32 x 32 patches, so essentially all I need to do is go to the center of each of my detected cells, extract a patch, and submit that to the model for classification.

To do this, we’ll use the code supplied here.

The overall workflow should be quite intuitive. We look at each possible “tile” in a whole slide image, and if there are enough objects present to make it worth computing (“minhits”), we load the tile and operate upon it, otherwise we skip it. How do we know which tiles have sufficient number of objects? We leverage the shapely library which supports a very rapid search tree to detect the intersection of objects. More importantly, shapely “speaks” GeoJSON , which means we can load our objects in only a few lines of code!

Here we will look at a few core components of the code.

Firstly, we can load GeoJSON into a python dictionary with 2 lines of code:

  1. with open(json_fname) as f:
  2.     allobjects = GeoJSON.load(f)

Now all objects can be accessed like a list of dictionaries, where each dictionary has the “keys” mentioned above (e.g., geometry)

Next we would like to convert these objects to a list of shapely objects, we can do in a single line:

  1. allshapes=[shape(obj["nucleusGeometry"] if "nucleusGeometry" in obj.keys()
  2.          else obj["geometry"]) for obj in allobjects]

This code essentially looks at either the “nucleusGeometry” key (if it is present, which it will be for cell detections) or “geometry” key if it isn’t present.

For our specific purposes, we would also like to have a list of the centroids of all these objects, to do the tile detection later on, which now leverages shapely’s “centroid” method:

  1. allcenters=[ s.centroid  for s in allshapes]

And the last bit, we want to assign ID numbers to the items in the list so we can find them later:

  1. for i in range(len(allshapes)):
  2.         allcenters[i].id=i

And our last point is that we build a search tree to search through these centers:

  1. searchtree = STRtree(allcenters)

At most, even with 300,000 objects, this should have taken no more than a few seconds.

The secret sauce, finding interesting tiles to operate on:

Here is the secret sauce:

for y in tqdm(range(0,osh.level_dimensions[0][1],round(tilesize * scalefactor)), 
        desc="outer" , leave=False):
    for x in tqdm(range(0,osh.level_dimensions[0][0],round(tilesize * scalefactor)), 
          desc=f"inner {y}", leave=False):            

        tilepoly = Polygon([[x,y],[x+tilesize*scalefactor,y],
                            [x+tilesize*scalefactor,y+tilesize*scalefactor],
                            [x,y+tilesize*scalefactor]])

        hits=searchtree.query(tilepoly)

            
        if len(hits) < minhits:
            continue

Where we go tile wise over the entire whole slide image, and create a shapely polygon the same size as the tile. Next we search in our object tree for anything which intersects with that polygon. If there are no intersections (called hits) in this tile, or in this case, if there’s less than 100 cells in this tile, then we can skip it completely, and immediately go on to the next time.

So as a result of this, in my large WSI I’m only going to ever load from disk regions that actually have the points that I need inside of it.

Notably, over 60% of the computation times I’ve been experiencing are associated with simply loading the tile image data, so avoiding this loading process is a huge efficiency gain.

The rest of the code is fairly straight forward, we go and actually load the title if we need to, we create an array for the output based on the size of the input. Now for each of our hits, we get the x,y coordinates of the center. Extract a 32 x 32 patch around them and then push that all to the GPU for classification. Note: In this case instead of using the model, we simply assign a random class.

Assigning predicted class back to original objects

Once we have the predicted class, assigning it in the GeoJSON object is straight forward:

        for id,classid in zip(id_out,classids):
            allobjects[int(id)]["properties"]['classification']={'name':classnames[classid],'colorRGB':colors[classid]}

Simply because we know the precise format needed and supported by GeoJSON. So here we go to each object and assign it the predicted name and the predicted color.

Writing output file

with open(json_annotated_fname, 'w') as outfile:
    GeoJSON.dump(allobjects,outfile)

Now, lastly, because the GeoJSON format is a standard, and we have GeoJSON objects, we can simply use the associated writer to drop all of our objects to file, again in only 2 lines!

If you look closely, you will see that this looks very similar to the original file, except for the fact that now we have this new property called classification, which has a name and RGB color.

Loading back into QuPath

We’re almost done. The last thing that we need to do is now import our annotations.

Now the question is, we have a GeoJSON, which has all of our labeled annotations in it, how do we get them back in?

We can use the code here

Which again requires us to go to select “Automate” and then “Show Script Editor” in QuPath.

def gson = GsonTools.getInstance(true)
def json = new File("1L1_nuclei_reg_anno.json").text

// Read the annotations
def type = new com.google.gson.reflect.TypeToken<List<QuPath.lib.objects.PathObject>>() {}.getType()
def deserializedAnnotations = gson.fromJson(json, type)
addObjects(deserializedAnnotations)   

What’s interesting again because GeoJSON is a known format, and Python speaks GeoJSON and QuPath speaks GeoJSON, it only takes 5 lines to load this data back in. After running this script, we can now look at our image again, and can see that the cell outlines have changed colors to match their associated class, and the names we specified now appear in the hierarchy:

Conclusion

That brings us to the end of this workflow. The idea here is we can go and annotate things in QuPath and have it in a nice format which can be exported/imported iteratively. Since our pathologist collaborators tend to be comfortable using QuPath, we can now easily generate data for their review and modification to advance our downstream analyses.

Additional notes

QuPath Types

In this post we discuss using “getDetectionObjects”, which accessed detection objects (in this case, cells). Notably, though, this does not export annotations made by the user, for example, the rectangular ROI which delineated where we wanted our cells to be found. If we would like to export these annotation objects, we simply change “get DetectionObjects” to ” getAnnotationObjects”. If we are only interested in exporting selected objects we can change this to “getSelectedObjects”. The rest of the code remains the same!

Colors

The QuPath RGB colors are negative because of the way that java stores them. There is a discussion about that here with the java documentation here.  At the bottom of the python code there is a key of numbers to color for the existing classes.

Masking

The python code supports masking, which isn’t discussed in this post. This functionality essentially “blacks out” the area which is not covered by the polygon. This can often improve DL performance by forcing the classifier to focus only on the particular object of interest and not the surrounding regions. Note though, that the classifier itself must be trained with this functionality in order for it to be usable at test time, otherwise the performance tends to heavily decrease.

Compression of GeoJSON

Using 7z compression, some tests are showing that the GeoJson files with prettyprint=False to be compressable to 7.87% the initial size (~92% reduction). Not surprising given that the file is entirely plaintext.

All code available here!

6 thoughts on “Exporting and re-importing annotations from QuPath for usage in machine learning”

  1. Thank you Andrew for this very useful how-to!
    Scripting in QuPath remains a mystery to me 😉

    I do not completely understand your reasoning behind only classifying tiles with at least 100 cells. Wouldn’t this leave many tiles unclassified and result in a “spotty” lymphocyte classification once re-imported into QuPath?

    Cheers

    1. Thanks Mario! The tiles here are 10,000 x 10,000, so one would expect on the order of a thousand cells to be present in a “rich” area, much more than the 100 limit set here. If for some reason smaller tiles are used (e.g., 500 x 500), one would have to heavily lower this “minhit” limit, otherwise no tiles will be computed upon, resulting in the spotty appearance you mentioned

  2. Hello, I tried running QuPath with your labeled 100×100 pixel data from your lymphocyte detection page, just to check it against your manual segmentation, but it doesn’t seem to work. Is there some minimum image size requirement for the patches?

    Also, do you have a recommendation for the classification part? I don’t have Matlab so I started to “translate” the patch extraction part of that tutorial into Python, but it’s a few years old now, and you have since introduced fully-Python methods with U-net implementation, etc. I was wondering if I’m wasting my time getting the older model running. Could I even classify using traditional ML just with object size and shape data (or would that likely do a terrible job)? Thank you for sharing all this!

    1. Thank you for your questions:
      1) translating the old matlab code is a waste of time in my opinion, using newer deep learning architectures and frameworks (pytorch) is going to make your life a lot easier. in this case modifying the Digital pathology classification using Pytorch + Densenet blog post would be your best bet
      2) You absolutely could use traditional ML, the code itself is well designed for any type of process you would like to take. essentially modifying the contents of this for-loop with your own code (replacing up to line 147) will get you where you want to go
      3) Not sure I understand your first question, this code is designed to work with whole slide images, not smaller patches like those provided in that dataset

      1. Thank you for your reply! I will first try the traditional ML approach then move on to the DL tutorial your pointed to if it performs too poorly.

        As for the first question, my line of thought was to make a labeled training/validation dataset from the 100 patches you provide in your lymphocyte detection blog (http://andrewjanowczyk.com/wp-static/lymphocytes.zip) by individually feeding those images through QuPath (since it will be used for the WSI prediction), getting the outlines of both lymphocyte and non-lymphocytes, and extracting their attributes (area, roundness, color, etc.) to train the lymphocyte classifier. Is there a more straightforward way to build a training/validation set?

        1. for an ML approach, you likely don’t need as much data, and it may be easier to simply manually annotating your own lymphocytes on your slides to train a quick model. the youtube video linked to in this post shows a tutorial on how to go about doing that within qupath, and that may be sufficient to do what you need. keep in mind that simply transitioning between different scanners and different sites creating the images tends to lower overall performance since there is “nuance” that is different, so while you can import this lymphocyte data, i would ask if it looks similar to the data you have, because if not the model simply won’t generalize well.

Leave a Reply

Your email address will not be published. Required fields are marked *