Through a series of 4 blog posts, we’ll discuss and provide working examples of how one can use the open-source library Ray to (a) scale computing locally (single machine), (b) distribute scaling remotely (multiple-machines), and (c) serve deep learning models across a cluster (2 on this topic, basic/advanced). Please note that the blog posts in this series increasingly raise in difficulty!
This is the second to last blog post in the series, (the first one here, second one here), where we will go into greater detail about how we can use Ray Serve to set up a server waiting to respond to our requests for processing. These last two are the most complex blogpost in the series and require some understanding of how HTTP, REST, and web services work. You can find relevant prereading here.
Ray Serve is a scalable model serving library for building online inference APIs. Serve is framework agnostic, so you can use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, Tensorflow, and Keras, to Scikit-Learn models, to arbitrary Python business logic.
Lets get serving!
Introduction to the idea of “Serving”
Suppose we have a large, pre-trained DL network for segmenting various tissue types on digital pathology images and would intermittently like to generate results from it. An example of this may be a pathologist reviewing an image in their local viewing software. We would ideally like to near-real-time analyze that image and produce a segmentation result. This implies that we should have a server “waiting” to answer this segmentation request, especially because loading a large DL model onto the GPU takes the lion’s share of the time versus applying it on a small image.
Effectively, in this scenario, we would like to implement model serving, the process by which DL models hosted on on-premises or cloud-based servers are made available through API endpoints. We can achieve model serving very easily using the Ray Serve library. In a Ray Serve configuration, the pre-trained DL model, such as the one described above, is saved on each worker node (or can be copied over during initialization…but remember some of these models can be quite large) and loaded into a GPU’s memory before a user calls the model. The head node can then rapidly respond to requests from the same user or multiple users, allowing for scalable DL model deployment. The figure below illustrates the process of Ray Serve, and the full documentation for Ray Serve is available here.
To demonstrate Ray Serve, we will continue using our naive image rotation example from previous blog posts and show how everything can work in a simple local environment. Then we’ll build up in the next blog post to serving a DenseNet Pytorch model, for example from here.
Here we will be using Ray Serve, which is a library in the Ray Framework. There are many other very cool ones worth reading about, including
- Ray Data – the standard way to load and exchange data in Ray libraries and applications.
- Ray Train – scales model training for popular ML frameworks such as Torch, XGBoost, TensorFlow, and more.
- Ray Tune – is a Python library for experiment execution and hyperparameter tuning at any scale.
You can install it like so, if not using the docker image (which you likely really should be….):
pip install ray “ray[serve]”
To start our Ray docker container, we’ll use a very similar command as in the previous blog post, except this time we’ll also add port 8000 to be forwarded, like so:
docker run --shm-size=2g -t -i -v
pwd:/data -p8000:8000 -p8888:8888 -p6379:6379 -p10001:10001 -p8265:8265 rayproject/ray
This is because Ray Serve by default will make our REST endpoints available on port 8000. Once that is up and running we’ll again want to install our dependencies:
pip install jupyter jupytext opencv-python-headless
and now our container is ready for us to use! Fire up jupyter and lets take a look at our basic server! I do this by launching jupyter in the container like so, to enable remote access:
jupyter notebook --ip='*'
And then if things work well, we can connect to jupyter on our local machine using http://localhost:8888
Ray Serve – Basic Example
We’ll be going step by step through our basic example code available here.
To begin, we’ll create an ImageModel class and decorate it to indicate to Ray that we want this to be “served”:
- class ImageModel:
- def __init__(self):
- #do any initilization here, like loading a model!
@serve.deployment decorator. The decorator tells Ray Serve to accept REST requests at the specified route_prefix. In this case “/image_rotate”. It further says that ImageModel will be the class that is responsible for handling those requests.
In Ray, object classes like
ImageModel above, are called Actors.
To continue on, we also define an initlization function, which if we had to do any setup work before handling a request (e.g., loading a model), we would do it there. In our most basic use case, we have don’t need to set anything up, so we leave it blank.
Next we move on to the bulk of the code which is run each time our REST endpoint is called:
- async def __call__(self, starlette_request):
- #-- get bytes from the request and convert it to an image
- image_payload_bytes = await starlette_request.body()
- #-- process the image normally
- img = cv2.rotate(img, cv2.ROTATE_90_CLOCKWISE)
- #-- return our results by converting them to a byte stream
- return pickle.dumps((img,mean_val))
When requests are received, the server responds to the request via the
__call__ method associated with the Actor. The
__call__ method contains all the code to be executed when a request is made.
Here, our example
__call__ method is essentially broken into three parts (a) receiving the input data, (b) processing it, and (c) returning our result. We’ll take a look at these in sequence.
The argument passed into the __call__ method is a Starlette Request object. From this object we will want to access the input image. This image has been passed as a byte stream in the body of the starlette request and thus needs to be converted back to an image (in this case a numpy object) for downstream processing (typical for REST communications, and not specific to Ray). As well, if we look at the Ray Serve documentation, the output of the __call__method must also be a JSON-serializable object or a Starletter Request object, which numpy objects are not.
An easy way of both serializing and deserializing to bytes, used here, is pickle (included in Python) or dill (a better library, but needs to be installed). Using the dumps and loads methods, we can take objects and convert them to and back from byte streams, respectively.
So, in short, we have received a byte stream, and converted it back to an image. The next part is the same code as we saw in previous posts, which rotates the image, and computes the mean. Then in the last line, we simply bundle these values into a list and return them. Again, we note that we need a byte stream, so we convert our list object to bytes using dumps.
And that’s it! Really not that bad if you think about it; essentially 2 or 3 additional lines to receive our image, and return it, while the remainder of the code remains the same!
Starting the service!
First, we need to create a Ray cluster, so that we have somewhere to perform the work. Amazingly, this is the same line of code as in previous posts:
Can you guess what happens if, similarly to the previous blog post, you were to point this at a head node of an established cluster? If you said, “it would distribute incoming work across the cluster”, you’re right! So again, scalability from local to highly distributed without any code changes!
Now we have 2 final lines of code and we’re all set, the first one:
Here we actually start the server which will be listening for our requests. We have only a single parameter, http_options, which is a dictionary of options for the HTTP server. In this dictionary, we define host – which is the interface for HTTP servers to listen on. This defaults to “127.0.0.1”, meaning we won’t be able to access it remotely, so to expose our REST endpoint publicly, you probably want to set this to “0.0.0.0”, which will accept requests to any of the server’s IP addresses.
And then our second line of code:
Which takes our ImageModel class, and actually deploys it at the associated REST endpoint we specified in the decorator above.
So as a recap, we defined an Actor to listen on a specific REST endpoint, started a ray cluster, and deployed that Actor. Now when we access it (we’ll show you how below), we would expect it to call the __call__ function of our Actor to do, and then return, some work. Lets see!
Using the Service
Nearly there! So how do we actually connect to this service, via python, send it our image, and receive the associated result back?
Amazingly, this is only a few lines of code.
First we load the image:
- imagedata = cv2.cvtColor(cv2.imread("09-322-02-1.bmp"),cv2.COLOR_BGR2RGB)
Nothing special here, and we can see our image:
Now we want to send this image to our server, we use the python Requests library to do this:
Note that we’re connecting to our (local) server on port 8000, which is why we opened this port up when starting docker. We use a POST command which allows us to send (large amounts) of additional data with our request (in this case our image). Again, we convert it to a byte stream using pickle “dumps”, so that later we can pickle “loads” it.
And that its!
Once we run this cell, the image goes from our jupyter notebook instance over to the head node where it is delegated for processing by our Ray cluster. Note that this command blocks until the server responds (or times out).
After this is done, our result value is now stored in the resp variable, which is a requests response object.
resp_output = pickle.loads(resp.content)
To get our return data, we access its content property, and (again) deserialize it using pickle “loads”. We can confirm that this list contains both our rotated image and mean_value!
(returned_img,returned_mean_val)=resp_output plt.imshow(returned_img) print(returned_mean_val)
In total, setting up Ray Serve to handle requests required 5 steps:
- Initialize Ray and connect Python script to Ray cluster
an HTTP server for each actor (class) using the
each actor, create a
__call__method to handle incoming requests.
the model serving process via
actor.deploy(), deploy the actor as an HTTP server onto the head node.
Afterward, we can communicate with our server using python Requests by serializing and deserializing our input/output.
In the next and last blog post, we’ll see how we can use this approach to save us significant amounts of time when deploying models!
The code for this example is available here
Special thanks to Thomas DeSilvio and Mohsen Hariri for their help in putting this series together!