**Intel Domain Leader: Shai Fine**

We plan to develop open-source library for large scale distributed training of deep networks, which is:

1) optimized for IA (Xeon, Xeon-Phi),

2) based on open-source data analytics cluster computing framework (Spark, Hadoop).

The research project will:

(1) Employ advanced ML concepts such as distributed learning and improved deep learning architecture.

(2) Include new, advanced & demanding, deep learning based use cases.

__The projects __

- Optimal Deep Learning and the Information Bottleneck Principle
- SimNets: A Generalization of Convolutional Networks
- Rigorous Algorithms for Distributed Deep Learning
- Mega-Class Efficient Deep Learning
- Outlier Robust Distributed Learning + Learning Deep Forward Models for Reinforcement Learning
- Unsupervised and Semi-supervised Ensemble Learning
- Distributed Deep Learning on Xeon-Phi
- Distributed Methods for Non-Convex and Deep Learning
- Scene Understanding: from Image to Text and from Image and a Question to an Answer
- Applications of Deep Learning to Medical Imaging
- Image Restoration using Deep Learning

__Optimal Deep Learning and the Information Bottleneck Principle__

**Academia Researcher(s): **Prof. Naftali Tishby, Hebrew Univerisity

**Research Project Summary: **

Deep Neural Networks (DNN) and Deep Learning (DL) algorithms are attracting unprecedented attention, as they are currently performing better than most other ML methods on various real world applications, from speech and vision, to NLP and computational biology. Yet, there is little theoretical understanding of DNN. In particular, we are crucially missing theoretically motivated design principles (architecture: number of layers, connectivity, desired features, etc.), useful bounds on information/sample and computational complexities, and provably efficient DL algorithms. Moreover, there is a complete lack of interpretability of DNN models: what do they learn? What do the layers/units represent? What characterizes good problem domains for DNNs? Why convolutional NN work so well and what generalizes this principle? As all engineers know, there is nothing more practical than a good theory, and we are completely lacking one in this important domain. The Information Bottleneck (IB) method was introduced a long ago as an information theoretic principle for extracting the relevant information in one random variable (X) with respect to another variable (Y), given their joint distribution, P(X,Y), or a sample from this distribution. The method has been applied successfully to various supervised and unsupervised ML problems, from text categorization to neuroscience and cognition, but its main appeal is in its principled theoretical foundation. It provides a natural extension to the concept of minimal sufficient statistics and an algorithm for calculating them from empirical data. Optimal neural networks (or other supervised ML methods) should – in principle – act as an information-bottleneck, namely, extract optimal minimal features (minimal sufficient statistics) of the input variables (X) that enable optimal prediction of the output variable (Y). We have recently shown an exact correspondence between DNN and the IB problem, which explains the emergence of neural layers, their number and optimal architecture, as phase transitions in the Information Bottleneck optimal curve. In this work we would like to further exploit this correspondence and suggest new theoretical bounds and better deep learning algorithms.

__SimNets: A Generalization of Convolutional Networks__

**Academia Researcher(s):** Prof. Amnon Shashua, Hebrew Univerisity

**Participating Student(s):** Nadav Cohen

**Research Project Summary: **

We propose a deep layered architecture that generalizes classical convolutional neural networks (ConvNets). The architecture, called SimNets, is driven by two operators, one being a similarity function whose family contains the inner-product operator on which ConvNets are based, and the other is a new soft max-min-mean operator called MEX that realizes classical operators like ReLU and max pooling, but has additional capabilities that take SimNets far beyond ConvNets. Two interesting properties emerge from the architecture: (i) the fundamental constructions that are analogous to multilayer perceptron, LeNet and Network In Network are all kernel machines of different types, and (ii) networks may be initialized through a natural unsupervised scheme that carries with it the potential for automatically learning architectural parameters. Experiments demonstrate the capability of SimNets for achieving accuracy comparable to ConvNets with networks that are almost an order of magnitude smaller.

The expected Significance of this program is very high. The use of Deep Layered learning in visual understanding and speech recognition and natural language processing is gaining momentum and achieving great success in beating public benchmarks against classical machine learning techniques like kernel machines and boosting. However, the empirical success that ConvNets is mainly fueled by the ever-growing scale of available computing power and training data, with algorithmic advancements having secondary contribution. Secondly, Deep Learning is exclusively supervised meaning that large labeled training sets are required — which is unlike human visual processes that can cope with extracting meaning from unsupervised data to a large extent. Although there were attempts to use unsupervised learning to initialize ConvNets, it has since been observed that these schemes have little to no advantage over carefully selected random initializations that do not use data at all. One of the main advantages of SimNets is initialization using unsupervised learning — this has the potential of using much less supervised data for training the network than traditional ConvNets.

**Publications **

- Nadav Cohen, Or Sharir, Amnon Shashua; Deep SimNets, CVPR’16
- Nadav Cohen, Or Sharir, Amnon Shashua; On the Expressive Power of Deep Learning: A Tensor Analysis, COLT’16
- Nadav Cohen, Amnon Shashua; Convolutional Rectifier Networks as Generalized Tensor Decompositions, ICML’16
- Nadav Cohen, Amnon Shashua; Inductive Bias of Deep Convolutional Networks through Pooling Geometry, arXiv

__Rigorous Algorithms for Distributed Deep Learning__

**Academia Researcher(s):** Prof. Shai Shalev Shwartz, Hebrew Univerisity

**Research Project Summary: **

The project deals with new distributed algorithms for training deep networks. Our approach is based on theoretical analysis of the pitfalls of existing methods, which leads to the design, and analysis of alternative approaches. In particular, we underscore several problems of state-of-the-art approaches, such as high variance of update step, low probability to update based on rare events, and the vanishing gradient problem. We propose new practical algorithms that explicitly tackle these problems and show their advantage in the context of distributed training.

**Publications**

- Shai Shalev-Shwartz, Yonatan Wexler; Minimizing the Maximal Loss: How and Why?
- Alon Gonen, Shai Shalev-Shwartz; Faster SGD Using Sketched Conditioning
- Alon Gonen, Dan Rosenbaum, Yonina Eldar, Shai Shalev-Shwartz; The Sample Complexity of Subspace Learning with Partial Information
- Shai Shalev-Shwartz; SDCA without Duality, Regularization, and Individual Convexity
- Elad Hazan, Kfir Y. Levy, Shai Shalev-Shwartz; On Graduated Optimization for Stochastic Non-Convex Problems

__Mega-Class Efficient Deep Learning__

**Academia Researcher(s):** Prof. Koby Crammer, Technion

**Research Project Summary:**

Our goal is to develop new models and algorithms that learn such models from annotated data. We aim for building systems that are able to tag an input with one (called single-label) or more (called multi-label) categories from hundreds of thousands possible classes. For example, a document may be about a presidential visit to a soccer match, thus annotated both as about sports and politics. Typical real-world annotation problems may have such large amount of labels, such as topics in Wikipedia, or annotating images in the web. We plan to find a mapping of all possible labels into some joint space.

We expect our research to advance few areas: core machine learning in classification, both single- and multi-label; document categorization and image annotation; and learning with missing information (semi-supervised learning. Can be done with a good learned map).

A successful research will benefit Intel and the industry, as it will be possible to build systems that will annotate very large amount of data, often collected in many real-world applications. Such huge-scale automatic annotation capabilities can be a first step towards a business analytics systems that process big-data.

**Kobi Crammer – Publications**

- Learning Optimal Resource Allocation with Semi-Bandit Feedback Tor Lattimore, Csaba Szepesvari, and Koby Crammer UAI 2014, Best Paper Runner-up Award.

__Outlier Robust Distributed Learning + Learning Deep Forward Models for Reinforcement Learning__

**Academia Researcher(s):** Prof. Shie Mannor, Technion

**Participating Student(s):** Oran Reichman

**Research Project Summary:**

We consider distributed machine learning in the presence of outliers. Many of the learning algorithms (such as neural networks and support vector machines) used for classification, regression, and structure learning are extremely brittle to data points that are substantially different from the bulk of the data. Having even a few samples that are deliberately made to be “hard” can confuse such learning algorithms. In this research project we plan develop a framework for parallelizing machine learning algorithms in the presence of outliers.

**Shie Mannor – Publiactions**

- Jiashi Feng, Tom Zahavy, Bingyi Kang, Huan Xu, Shie Mannor. “Ensemble Robustness of Deep Learning Algorithms” arXiv:1602.02389

__Unsupervised and Semi-supervised Ensemble Learning__

**Academia Researcher(s):** Prof. Boaz Nadler, Weizmann Institute

**Participating Student(s):**

Ariel Jaffe

Omer Dror

**Research Project Summary: **

In a variety of applications, (big-) data practitioners and end users have large unlabeled test sets, and little or even no labeled data. With the availability of free machine learning packages they can easily compute the predictions of many different classifiers on their data. How should these end users, typically not knowledgeable in machine learning, decide which classifier is best suited to their data?

The goal of this proposal is to develop novel unsupervised and semi-supervised ensemble learning methods, suitable to such scenarios. Our methods would allow, in a principled manner, to estimate the accuracies of the different classifiers, point the practitioner to the most suitable classifier for his/her dataset, as well as construct more accurate ensemble learners. In contrast to classical supervised learning, the main novelty is that we propose to do these tasks with little or even no labeled data. The main idea is to build upon recent work by the PI and collaborators, that proposed a simple spectral approach to tackle such problems. Specifically we plan to generalize our prior work in three important directions: i) add a model of instance difficulty; ii) develop semi-supervised ensemble methods; iii) develop methods that detect strongly correlated yet inaccurate classifiers, and consequently construct improved ensemble learners. These extensions should significantly advance the state-of-the-art in this emerging field. Moreover, we anticipate our algorithms to be implemented in machine learning packages and used by practitioners worldwide.

**Boaz Nadler – Publications**

- Unsupervised Ensemble Learning with Dependent Classifiers, A. Jaffe, E Fetaya, B Nadler, T Jiang, Y Kluger, to appear in AISTATS, 2016.
- Zvi Ben-Ami, Ronen Feldman, Binyamin Rosenfeld. Entities’ Sentiment Relevance. ACL (2) 2014: 87-9

__Distributed Methods for Non-Convex and Deep Learning__

**Academia Researcher(s):**

__Prof. Ohad Shamir, Weizmann Institute
Prof. Natan (Nati) Srebro, TTI__

**Participating Student(s):**

Yossi Arjevani

Itai Safran

Behnam Neyshabur

Jialei Wang

Liat Peterfreudn

**Research Project Summary:**

The main objective of this project is to understand and develop distributed methods for non-convex and deep learning problems. This lies at the intersection of two of the most significant trends in current machine learning research, i.e. deep learning systems, which have recently led to breakthrough performance improvements across several difficult AI tasks; and scalable methods which can be distributed and parallelized across many computing cores. This is a highly ambitious goal, since we don’t fully understand how to perform distributed learning even on learning problems much “nicer” than deep learning, and even less understanding on how to do deep learning in a principled manner. We plan to attack this by building on our previous work in distributed convex learning and deep learning, going through a spectrum of intermediate learning settings which may be easier to study (some of them interesting and with applicative potential in themselves), eventually leading to methods and algorithms for distributed and deep learning. Due to the high importance of these topics in both academia and in industry, this project has the potential for significant impact.

**Ohad Shamir – Publications**

- Fast Stochastic Algorithms for SVD and PCA: Convergence Properties and Convexity (Shamir, submitted to ICML 2016)
- Convergence of Stochastic Gradient Descent for PCA (Shamir, submitted to ICML 2016)
- Communication Complexity of Distributed Convex Learning and Optimization. Arjevani and Shamir, NIPS 2015
- On the quality of the initial basin in overspecified neural networks / Itay Safran and Ohad Shamir, submitted to ICML 2016
- The Power of Depth for Feedforward Neural Networks / Ronen Eldan and Ohad Shamir, to be submitted to COLT 2016
- Ohad Shamir, A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate, ICML 2015

**Nati Srebro – Publications**

- Distributed Multi-Task Learning, Jialei Wang, Mladen Kolar, Nathan Srebro
- Distributed Mini-Batch SDCA, Martin Takác, Peter Richtárik, Nathan Srebro (submitted for publication)
- Norm-Based Capacity Control in Neural Networks Behnam Neyshabur, Ryota Tomioka, Nathan Srebro. COLT (Conference on Learning Theory) 2015
- Path-SGD: Path-normalized optimization in deep neural networks Behnam Neyshabur, Ruslan Salakhutdinov, Nathan Srebro. Advances in Neural Information Process Systems (NIPS 2015)

__Distributed Deep Learning on Xeon-Phi__

**Academia Researcher(s):** Prof. Mark Silberstein, Technion

**Participating Student(s):** Jonathan Ezroni

**Research Project Summary: **

We plan to develop a generic distributed CNN training framework on Intel Xeon-Phi accelerators. Our ultimate goal is to enable training very large CNNs on a multi-node cluster of Xeon-Phi accelerators interconnected via Infiniband network. This work has two primary thrusts:

(1) acceleration and parallelization of highly popular Cafe CNN training framework on Xeon-Phi, and

(2) Developing native distributed system on Xeon-Phi processors, and optimizing the network and system stack to achieve high performance.

Accelerating Caffe on Xeon-Phi will allow to extend the boundaries of what is possible today with CNNs. If successful, it will bring significant boost to the CNN training speed and will promote the use of Xeon-Phi processor in an application domain which currently experiences unprecedented growth and levels of interest. Further, building an all-native distributed system on Xeon-Phi will help gain insights regarding the development of a complete native application on massively parallel processors, which is a critical step toward successful adoption of future generations of Intel’s Knights Landing processor.

__Scene Understanding: from Image to Text and from Image and a Question to an Answer__

**Academia Researcher(s):** Prof. Lior Wolf, Tel Aviv University

**Participating Student(s):**

Ben Klein

Guy Lev

Sivan Keret

Yossi Biton

Michale Rotmann

**Research Project Summary: **

We solve the scene understanding task: given an image we generate a textual description of it. Since the textual description can be non-specific, we add the ability to ask the system questions about the image. The system would then produce specific answers. This is a huge leap forward in AI, or at least it was conceived this way, until it became within reach. We plan to keep using deep learning tools such as Concolutional Neural Networks, Recurrent Neural Networks, neural word embedding, as well as computer vision tools such as the Fisher Vector and new variants of it, and statistical tools such as Canonical Correlation Analysis.

Lior Wolf – Publications

- G. Lev, G. Sadeh, B. Klein, L. Wolf. RNN Fisher Vectors for Action Recognition and Image Annotation. arXiv:1512.03958, 2015.
- D. Gadot, L. Wolf. PatchBatch: a Batch Augmented Loss for Optical Flow. arXiv:1512.01815, 2015. CVPR 2016

__Applications of Deep Learning to Medical Imaging__

**Academia Researcher(s):** Prof. Hayit Greenspan, Tel Aviv University

**Participating Student(s):**

Idit Diamant

Avi Ben-Cohen

Ofer Geva

**Research Project Summary: **

Project deals with applications of DL to the domain of medical imaging. Several tasks will be explored, including: medical classification tasks, in which combination of nonmedical with medical image fine-tuning will be explored; and medical detection tasks, in which pathologies will be learned on a voxel by voxel level using novel DL techniques. Medical domains include Chest pathologies in X-ray imagery and Liver lesions in CT imagery. Big data combining a large set of imagery and descriptive text is the overall goal.

Hayit Greenspan – Publications

- Yaniv Bar, Idit Diamant, Lior Wolf, Sivan Lieberman, Eli Konen, and Hayit Greenspan, ‘’ Chest Pathology Identification using Deep Feature Selection with Non-Medical Training,” To appear in Computer Methods in Biomechanics and Biomedical Engineering Vol. S
- H. Greenspan, B. Van Ginneken, R. Summers, “Guest Editorial Deep Learning in Medical Imaging: Overview and Future Promise of an Exciting New Technique”, IEEE Transactions on Medical Imaging, Volume: 35, Issue: 5, 1153-1159, 2016
- Alan Bekker, Hayit Greenspan and Jacob Goldberger, Multi-view deep learning architecture for classification of breast microcalcifications, ISBI, 2016

__Image Restoration using Deep Learning__

**Academia Researcher(s): **

__Prof. Michael Zibulevsky, Technion
Prof Miki Elad, Technion__

**Research Project Summary:**

In our research we plan to apply deep neural network for image restoration in Compressed Sensing and Computed Tomography. As a result we expect to increase image restoration quality given amount of measurements, or alternatively, reduce amount of measurements given requirements to image quality. This, for example, leads to significant reduction of radiation dose in computed tomography, which has significant impact on patient cancer risk.

We also propose two new concepts in neural network architecture: Double-sparse weights and Super-neurons which should increase computational and energy efficacy and robustness of a network and can be used in variety of applications above image restoration.

**Michael Zibulevsky – Publications **

- Elad Richardson, Rom Herskovitz and Michael Zibulevsky, “SESOP for Neural Networks – Progress Report”, Feb 2016.
- J Sulam, B Ophir, M Zibulevsky, M Elad, “Trainlets: Dictionary Learning in High Dimensions” IEEE Transactions on Signal Processing