Interoperating Deep Learning models with ONNX.jl

Flux [17] is a machine learning framework, written using the numerical computing language Julia[4]. The framework makes writing layers as simple as writing mathematical formulae, and it’s advanced AD, Zygote [11] , applies automatic differentiation (AD) to calculate derivatives and train the model. It makes heavy use of Ju-lia’s language and compiler features to carry out code analysis and make optimisations. For example, Julia’s GPU compilation support [3] can be used to JIT-compile custom GPU kernels for model layers [19]. Flux also supports a number of a hardware options, from CPUs, GPUs and even TPUs via XLA.jl, that compiles Julia code to XLA: an advanced compiler for linear algebra that is capable of greatly optimizing speed and memory usage in large deep learning models.ONNX.jl is an Open Neural Network Exchange backend for the Flux.jl deep learning framework. ONNX.jl supports directly importing high quality ONNX standard models into Flux, thus saving time and reducing the need for additional computation resources. This paper aims at introducing ONNX.jl and explaining how it ﬁts into the bigger picture: How we can use the Julia Language, specifically Flux.jl and ONNX.jl as a starting for high quality transfer learning of large deep learning models.


Introduction
The Julia language was introduced to solve the two language problem: In simple words, languages that are simple to write (highlevel) are very slow but those which are difficult to use (low-level) are way faster. This is because most of the high-level languages weren't written to process a large amount of data. Thus engineers, researchers and developers have a hard time developing a lot of high performance languages. At the moment, the common protocol is to write the core of the software in a low-level language (C/C++/Fortran) and wrap it in a high-level language (Python). This results in optimized performance and ease of use. The Julia language aims to make best of both worlds. It provides a high level syntax but manages to perform as fast as C (sometimes even faster). Flux.jl is a library for implementing machine learning models, written completely in the Julia programming language. At the heart of Flux.jl lies Zygote.jl: A source-to-source automatic differentiation (AD) library that makes complete use of the Julia language compiler to generate backward pass during training phase of a neural network, with complete support for control flow, recursion, closures and data structures. Implementing models in Flux.jl is as simple as writing regular Julia code. Implementing models is as simple as writing the formulae for those, and Zygote.jl will compute the derivatives seamlessly. Flux.jl also provides support for other hardware options using external packages such as CuArrays.jl and CLArrays.jl. CuArrays is written completely in Julia, making implementing GPU kernels very simple. Making a model run on GPU can be done in a hasslefree manner: It is as simple as calling a few functions to transfer data to GPU. Flux.jl also has support for running models on Google's Tensor Processing Unit (TPU). TPUs help in very fast linear algebra computation. Running Flux models on TPUs is possible through XLA.jl that compiles Julia code to XLA. The FluxML ecosystem provides a number of supporting packages that provide additional functionalities , some of them being (apart from the aforementioned Flux.jl, Zygote.jl and XLA.jl); -ONNX.jl : Open Neural Network eXchange backend for Flux.jl -Metalhead.jl [12]: Simple plug and play pretrained Flux.jl computer vision models. -Torch.jl [7]: This package aims at exposing Torch.tensor types in Julia. -IRTools.jl [13] : Provides an IR format that is easy to manipulate. -FluxJS.jl [18] : Runs Flux models in the browser, via tensorflow.js -model-zoo [15]: Collection of implementation of various Flux deep learning models.

Open Neural Network eXchange (ONNX)
Open Neural Network Exchange (ONNX) [2] is an open ecosystem that empowers AI developers to choose the right tools as their project evolves. ONNX provides an open source format for AI models, both deep learning and traditional machine learning. ONNX defines the computation graph for a deep learning model along with various operators used in the model. It provides a set of specifications to convert a model to a basic ONNX format, and another set of specifications to get the model back from this ONNX form. At a high level, ONNX is designed to allow framework interoperability. There are many excellent machine learning libraries in various languages : PyTorch [2] , TensorFlow [1] , MXNet [5] , and Caffe [20] are just a few that have become very popular in recent years, but there are many others as well. Machine learning models can be converted to a serialized ONNX format which can then be run on a number devices. ONNX Runtime is an inference engine written in C++ framework used to deploy ONNX format models into production. It works on diverse hardware and support both deep learning as well as traditional machine learning models.

Where does ONNX come in?
ONNX is a format for representing deep learning models, which can be further run on numerous devices without worrying much about the implementation. This helps researchers, developers and engineer to focus on the problem in hand without worrying much about the peripherals, such as the framework to use, the ability to run a model trained using this particular framework on specialized hardware. ONNX is is usable anywhere from small mobile devices to large server farms, across chipsets and vendors, and with extensive runtimes and tools support. ONNX reduces the friction of moving trained AI models among your favorite tools and frameworks and platforms. A simple example of how ONNX is ideal for ML is the case when large deep learning models need to be deployed. Consider the simple case of deploying a Deep Learning model to an iOS application. This particular model can be implemented in any framework : TensorFlow, PyTorch, MXNet just to name a few. However, iOS applications expect to use CoreML inside the application. Up until now, developers have been porting large models to different frameworks, which is a waste of time and energy, better spent somewhere else. This is also retraining the entire model from scratch, which isn't efficient. This makes the entire process cumbersome and impractical. ONNX exists to solve this very problem : By connecting the common dots from different frameworks, ONNX makes it possible to express a model of type A to type B, thus saving time and the need to train the model again.

¦ ¥
Since ONNX tries to inherit properties from diverse frameworks, ONNX serialized models can be large and complicated. While there are a number of complex generated data structures, three of those are essential towards understanding how data is stored internally: -ModelProto: A very high-level struct that holds all the information. ONNX models are read directly into this structure. -GraphProto: This structure captures the entire computation graph of the model. -NodeProto and TensorProto: Information regarding individual nodes in the graph (inputs, outputs and finer attributes) and weights associated with the nodes.

ModelProto
ModelProto structure is the structure that holds all the information needed to load a model. Internally, it holds data such as the version information, model version, docstring, producer details and most importantly: the computation graph. § ¤ An ONNX model, once read using ProtoBuf.jl is loaded into this ModelProto object before extracting the graph details. Naturally, at the heart of this is the graph::GraphProto attribute that stores the computation graph of the model.

GraphProto
The GraphProto structure stores information about particular nodes in the graph. This includes the node metadata, name, input, output and the pre-trained parameters in the initializer attribute. § ¤

TensorProto
This is the main structure that holds the raw model parameters.
For example, in comvolutional layers, the weights associated with the kernel as available as dense vectors in the raw_data attribute. During graph traversal, these weights are extracted and reshaped according the shape that is available as a node attribute. § ¤

¦ ¥
This is most of the information needed to build the model. In the next section, we discuss how we use DataFlow.jl to travel this graph and extract model parameters and other relevant information.

Graph operations via DataFlow.jl
Once we have the entire model data present as a ModelProto object, the next step is to travel the computation graph and capture all the operation being done in the graph while mapping those simultaneously to the corresponding Flux operators.
DataFlow.jl [16] is a code intermediate representation format, representing Julia code as an expression graph. It provides functions for graph re-structuring , even on cyclic graphs. Graphs can then also be used to generate Julia expression. It can be efficiently used to traverse our ModelProto.graph object. However, during this traversal we want to map ONNX operators to Flux layers and functions. In DataFlow.jl, this becomes equivalent to creating a new vertex for the required operator and calling in with appropriate Flux functions , which are inferred from the ONNX operator itself. As an example, let's consider the simple case of the BatchNorm operator in ONNX. Relu is a commonly used activation function in neural networks that can be expressed as :   [14] can be used to store and load such structures, our dictionary containing the model weights being one of them.

Interface and Design:
At a top level, ONNX.jl provides a minimal interface for the user; it is just a tool for loading ONNX format models. Once the model and weight file has been successfully generated, ONNX.

¦ ¥
ONNX.load_model here generates the required model and weights file. Internally, it carries out all the above mentioned graph operations. model above can be treated as any other Flux model.
The significant advantage the ONNX.jl provides is that is treats a compiler problem as a graph problem. It generates the Flux code for the model, which makes it very easy and intuitive to use the same model for further applications, such as fine-tuning or even replacing existing layers for some other use case. This is ideal in the case of applications such as neural style transfer, where it is very common to use a pre-trained network and modify it a bit as a starting point. The generated code can also be helpful for finer debugging of the model. Overall, the entire process from the ONNX serialized file to generation of model and weight file can be summarized as: Additionally, ONNX.jl also provides helper functions for inspecting the model before loading it. ONNX.layers reads an ONNX file and returns a list of all the layers in the model. With the growing interest around more complicated and deep models, it is possible that an ONNX model might have layers that Flux itself doesn't support. For handling these, ONNX.jl leaves a hook for the users to implement additional functionality. A hook is a function that doesn't have an existing implementation: one would have to write an implementation for it themselves. However any operator that also has a corresponding implementation in Flux is completely recognized by ONNX.jl at the moment.

Usage Scenarios
The ONNX format and ONNX.jl can be used for transfer learning in Flux, where we store knowledge while training a model and use this knowledge for some other task. The idea is that rather than random initialization of parameters for training a neural network, it's better to take an already trained model, since it leads to faster convergence. In transfer learning, we take a pretrained model and train it on another dataset, which might also have a different class distribution. Fine tuning is an approach to transfer learning where we train on a subset of training data with a smaller learning rate.
Transfer learning learning has shown tremendous results in image classification, object detection, simulations, sentiment and NLP based classification in recent past. This is also pretty common when talking about tasks such as neural style transfer where we want to change the style of an image in accordance with the style of another image. Generative Adversarial Networks (GANs) have shown to deliver high quality results when trained on top of a pre-trained model. StyleGAN [21] , for example, can use a pre-trained model to train a custom model to deliver high quality super-resolution results.

Related Work
In recent times several projects have come up that solve similar issue. One of the most notable project is TensorFlow's mlir (Multi-Level Intermediate Representation) [24] . mlir is an evolution of LLVM [23] that defines a common Intermediate Representation (IR) format, which can be used to represent any DataFlow graph. This common format unifies machine learning models in Tensor-Flow or other frameworks. Other noteworthy approaches in this direction are PFA and NNVM. PFA [27] or Portable Format for Analytics is a common language that aims as easing the transition from development to production. It can be expressed within the common JSON format and has functionalities such as control structures, loops and user-defined functions. NNVM [25] is an end-to-end compiler for AI frameworks. It aims at solving the challenges posed by using different diverse machine learning frameworks. It consist if two major components: NNVM (Neural Network Virtual Machine) and TVM (Tensor Virtual Machine) [6] . NNVM defines a common computational graph intermediate representation format and TVM implements the operators used in these computation graphs while optimizing them for the backend hardware.

Future Work
As ONNX.jl becomes the beginning point for various researchers interested in using Julia for their research, it is important to note that it also has certain shortcomings. The most significant is that a model can't be completely loaded unless there's an equivalent implementation of the operator in Flux.jl. An example of this is Grouped Convolutions. These variants of Convolutional layers were used in AlexNet [22] and showed amazing results. However, since Flux doesn't support these at the moment, the users will need to have an implementation ready if they choose to import an ONNX model with this particular layer into Flux using ONNX.jl. On the plus side, a lot of the most commonly used layers are available in Flux and can be readily used. Another thing to note is that to run ONNX.jl generated code in some other hardware, one might need to do a little restructuring. The model should work directly on the CPU. Another challenge moving forward is that we need to constantly update ONNX.jl to support Flux's latest API changes. This also applies for the other way round: As ONNX operators are updated, we would have to update these corresponding implementations in ONNX.jl to support the newer models using these updated specifications. Over the long run, we'd need to constantly keep an eye out for such changes and adapt ONNX.jl to those. Moreover, there are subtle differences in the way Flux implements operators as compared to other frameworks. As an example, consider the simple AveragePool layer. The way Flux implements this is by padding the input tensor appropriately and then performing the pooling operation. However, Keras-tensorflow does this by pooling and then padding. This leads to subtle changes along the edges of the tensor in the output. Such differences occur due to the way most frameworks deal with such layers, and the only way to avoid this is to check for such discrepancies. In recent past, DataFlow.jl has been superseded by another intermediate representation format tool: IRTools.jl. It provides the ability to work with both lowered and typed Julia code and can be used together with metaprogramming tools such as Cassette.jl. There has also been some talk about splitting ONNX.jl into two packages: The first one would do the code generation and DataFlow related functions while the other would be solely responsible for implementation of the ONNX operators. This would be greater control and ease while implementing layers or debugging a loaded model. This should also make implementation pretty straightforward wherever missing. For the moment, all this continues to be done by a single package.

Conclusion
Developing ONNX.jl has been tremendous learning experience. From studying about Intermediate Representation formats for deep learning models with millions of parameters to loading them in just a couple of lines of code, ONNX.jl has made it very easy and straight-forward to use a high quality trained model as a starting point for many projects. Once such example I'd like to point out is DenseNet-121 model. This is deep convolutional network that has multiple Convolutional, Dense and Pooling blocks. Naturally, implementing this in any framework is going to be a challenging task. However, thanks to ONNX, we can now use an earlier implementation to import this model into any other framework. Importing this model (train in Caffe2) via ONNX.jl into Flux can be done in 3 lines of code. I was also able to load several large computer vision models loaded from ONNX format at the time of actively developing the package. Most of these have been added to FluxML/Metalhead.jl for a direct plug-and-play use. These included: -SqueezeNet [10] -DenseNet 121 [9] -ResNet 50 [8] -GoogleNet [28] ONNX.jl serves as a entry point for people looking to use Flux for their research, but want quick results. It combines the power of the Julia language, the elegance of Flux and the availability of a vast number of pre-trained models. This enables researchers to spend time focusing on the real issues, rather than model portability.