Many interesting problems in machine learning are being revisited with new deep learning tools. For graph-based semi-supervised learning, a recent important development is graph convolutional networks (GCNs), which nicely integrate local vertex features and graph topology in the convolutional layers. Although the GCN model compares favorably with other state-of-the-art methods, its mechanisms are not clear.
In our paper,
Li et al. (AAAI 2018), Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning
we develop deeper insights into the GCN model and address its fundamental limits. First, we show that the graph convolution of the GCN model is actually a special form of Laplacian smoothing, which is the key reason why GCNs work, but it also brings potential concerns of oversmoothing with many convolutional layers. Second, to overcome the limits of the GCN model with shallow architectures, we propose both co-training and self-training approaches to train GCNs. Our approaches significantly improve GCNs in learning with very few labels, and exempt them from requiring additional labels for validation. Extensive experiments on benchmarks have verified our theory and proposals.
Semi-supervised Classification Problem
A graph is represented by
Suppose that vertices are divided into two mutual exclusive subsets, a small labeled set
Graph Convolutional Networks
Graph Convolutional Networks was proposed by Kipf & Welling. This model contains the following steps:
-
Construct the convolutional filter from adjacency matrix
by 1) adding an extra self-loop to each vertex, which results in a new adjacency matrix an a new degree matrix , 2) then normalizing it and get the filter . -
Define the propagation rule for the graph convolutional layer
where is the matrix of activations in the -th layer and , is the trainable weight matrix in layer , and is the activation function, e.g., . Theis graph convolution is defined by multiplying the input of each layer with the renormalized adjacency matrix from the left, i.e., . The convoluted representation of vertices are then fed into a standard fully-connected layers. -
Stack two layers up and apply a softmax function on the output features to produce a prediction matrix:
and train the model using the cross-entropy loss over the labeled instances.
The key is the convolutional.
To understand the reasons why GCNs work so well, we compare them with the simplest fully-connected networks (FCNs), where the layer-wise propagation rule is
classification accuracy | MLP | GCN |
---|---|---|
1-layer | 53.08% | 70.79% |
2-layer | 55.92% | 79.83% |
We will show that this convolution is actually a special form of Laplacian smoothing in the following part. This convolution computes a new representation for each vertex which is smoother (more close to its neighbors) than the original one. Since vertices in the same cluster tend to be densely connected, the smoothing makes their representations similar, which makes the subsequent classification task much easier. As we can see from above table, applying the smoothing only once has already led to a huge performance gain.
Laplacian Smoothing
How to smooth a curve, which contains a series of discrete points,
image via here
We now take curves as a graphs with an adjacency matrix
Here
If we let
Summary
Understanding deep neural networks is crucial for realizing their full potentials in real applications. Our paper contributes to the understanding of the GCN model and its application in semi-supervised classification. Our analysis not only reveals the mechanisms and but also the limitations of the GCN model. Laplacian suffers from shrinkage problem, which leads to the oversmoothing problem of GCNs. We also derived solutions from our analysis to overcome its limits. See our paper for more details. If you have any question, please email me.