As we’re testing out for migration to new deep learning frameworks, one of the questions that remained was dataset interoperability. Essentially, we want to be able to create a dataset for training a deep learning framework from as many applications as possible (python, matlab, R, etc), so that our students can use a language that are familiar to them, as well as leverage all of the existing in-house code we have for data manipulation.
Since the most popular DL frameworks (keras, tensorflow, pytorch) are all python based, using a library like PyTables would be ideal, since it allows for storing large matrices in a single file on disk (improving IT management… try looking in a directory containing 100k small files!), while also allowing for random access for reading rows, without having to load the entire dataset.
In this post, we look at how to write small image patches from Matlab into a hdf5 (the backend of pytable), and load and manipulate the corresponding file in python. We finish with using pytorch to build a DataLoader.
As I mentioned, the backend of PyTables is hdf5, which has modest support in Matlab. Offline, I wrote images using the hdf5 library to a file and read them in to pytables to ensure that they are compatible, which I found they were. Below we’ll show the final result of the code which was successful.
First we need to create the different “tables” which will be available in our dataset, this is done using h5create:
- db = 'DB_train_1.h5';
- h5create(db,'/data',[Inf wsize wsize 3],
- 'ChunkSize',[1 wsize wsize 3],'Datatype','single', 'Deflate',9 ,'Shuffle',true);
- h5create(db,'/label',[Inf 1],
- 'ChunkSize',[1 1],'Datatype','single', 'Deflate',9 ,'Shuffle',true);
- h5create(db,'/mean',[65 65 3]);
From a high level we can see that we’re going to save 3 tables. The first is the data (e.g., image patches), the second is the label for those images (for supervised learning), and lastly while we’re storing the images we keep a running mean image, which we’ll save here as well. Later on this can be used to real-time zero-center the images during training/testing.
The data definitions here use an underlying EArray for storage, which allows for “infinite” expansion. Lets use data as an example and look at each of the parameters:
- h5create(db,'/data',[Inf wsize wsize 3], 'ChunkSize',[1 wsize wsize 3],'Datatype','single', 'Deflate',6 ,'Shuffle',true);
- db: we need to specify the name of the file
- ‘/data’: this is the node that the table will be stored under
- [Inf wsize wsize 3]: we need to specify the exact shape of the data which we intend to store in this array for it to be infinitely extensible. This require specifying a single dimension as “Inf”, which is the one which will be appended to. The other 3 dimensions specify our patch size, which in this case is 65 x 65 x3
- ‘Datatype’,’single’: this specifies that we want the data stored as a single, which in this case is suitable since we typically have uint8 images as output, e.g., [0,255]
- ‘ChunkSize’,[1 wsize wsize 3]: from the documentation: The shape of the data chunk to be read or written in a single HDF5 I/O operation. Filters are applied to those chunks of data. Essentially a chunk specifies the minimum amount of data which needs to be read/written to the disk at a single time (more on this later).
- ‘Deflate’,6: this specifies that we would like to use compression of level “6”, on a scale of 0 (no compression) to “9” (most compression). Compression is a type of filter, which is applied to a chunk. As a result, the larger your chunk size, the lager a “packet” of data will be compressed together. Note that this compression assumes ‘deflate’ (e.g., zlib), as that is one of the 2 types that matlab seems to support (the other being blosc), though there are many others that pytables can write to, but matlab won’t be able to read them.
- ‘Shuffle’, true: Block-oriented compressors like GZIP or LZF work better when presented with runs of similar values. Enabling the shuffle filter rearranges the bytes in the chunk and may improve compression ratio. No significant speed penalty, lossless. Info here
Now that we have the hdf5 tables setup, we can write information to it. This will depend on how you organize your data, but a general approach is to fill up a matrix and then write it to disk:
- npatches = 100;
- img_writer=double(zeros(npatches ,wsize,wsize,3));
- h5write(db,'/data',uint8(img_writer),[next_spot 1 1 1],size(img_writer));
Note here that we use “next_spot” which indicates that we want to start writing after the last row already existing in the table.
Initially, I thought it would be more beneficial to write larger chunks to the disk, in particular the size of the # of patches from a particular image, under the assumption that they would be compressed better. Later on, in reading more documentation [link 1, link2 ], it became evident that to read a single patch out later, the entire chunk needed to be loaded, and decompressed. A rather time consuming process. As a result, I produced this chart showing hdf5 creating time and storage size for about 178250 patches:
|‘ChunkSize’,[npatches wsize wsize 3],,’Datatype’,’single’);||32067912581||538.940382|
|‘ChunkSize’,[npatches wsize wsize 3],,’Datatype’,’single’, ‘Deflate’,0 ,’Shuffle’,true);||32072807119||1332.031098|
|‘ChunkSize’,[npatches wsize wsize 3],’Datatype’,’single’, ‘Deflate’,6 ,’Shuffle’,true);||11712717535||3445.999905|
|‘ChunkSize’,[npatches wsize wsize 3],’Datatype’,’single’, ‘Deflate’,9 ,’Shuffle’,true);||11375891055||34677.11641|
|‘ChunkSize’,[1 wsize wsize 3],’Datatype’,’single’, ‘Deflate’,6 ,’Shuffle’,true);||8034604232||1330.615406|
|‘ChunkSize’,[1 wsize wsize 3],’Datatype’,’single’, ‘Deflate’,9 ,’Shuffle’,true);||7973965600||4632.042284|
Essentially there is a time consequence for compressing the data (of course CPU cycles take some time), but ultimately using a chunk size of a single patch and a compression of level 6, results in a significant datasize savings (8GB vs 30GB) at an acceptable time cost (1330s vs 538s). This is of course configurable, smaller compression values will have faster write times. Most importantly, is to verify that the data can be loaded quickly later on in python! Those times are not marked here, but when setup incorrectly (e.g., large chunk size which required loading and decompressing) the time consequence is very very large.
Now that we have a dataset in hdf5, we need to load it into python, which is a simple 2 liner:
- import tables
- hdf5_file = tables.open_file("DB_train_1.h5", mode='r')
and then we can access our data, which will load only the rows needed (This is a good resource which goes deeper into this).
- (3, 65, 65, 632500)
Note that the shape is different than what we originally put in (npatches, height,width,channel) vs (channel, height,width,npatches).
Now if we want to use this as part of a data load, we build a dataset class:
- class Dataset(object):
- def __init__(self, fname ,transform=None):
- self.data = None
- self.label = None
- def __getitem__(self, index):
- #opening should be done in __init__ but seems to be
- #an issue with multithreading so doing here
- if(self.data is None): #open in thread
- item = self.data[:,:,:,index]
- item_new = item
- if self.transform is not None:
- item_new = self.transform(item)
- return item_new, self.label[:,index], item
- def __len__(self):
- return self.nitems
There are a few take-aways here. Since pytorch dataloaders will spawn backend processes, it seems that the open file handle to the pytable is not transmitted in a way suitable for multiprocessing. As a result, we modify the design such that “getitem” will load a local dataset handle if it is not already defined (on the first iteration), otherwise it uses the existing one. In the end, we simply extract the image from the table, and return it in Channel x H X W format (CHW), which is what pytorch uses internally.
Building a data loader for usage is then trivial:
- data_train_loader = DataLoader(data_train, batch_size=batch_size,
- shuffle=True, num_workers=8 ,pin_memory=True)