Gaussian Processes and the Schrodinger equation¶

Karan Shah
Advanced Scientific Computing
Georgia Institute of Technology
Spring 2020
Code on : https://github.com/karanprime/ASC/tree/master/Project

Introduction¶

Neural networks(NNs) have been shown to be good surrogate models for the Schrodinger equation for relatively simple 2D systems[1]. These models have been able to predict the ground-state energy of single electron to within chemical accuracy. An exact equivalence has been derived between infinitely wide deep neural networks and Gaussian processes(GPs)[2]. Certain kernels can be used to approximate multilayer NNs.
In this project, I explore the use of Gaussian processes to predict the ground state energy of 2D one electron systems, given an image of electrostatic potential. I also try to reproduce the results in [1] and compare the performance of convolutional NNs and GPs with different kernels.

Background¶

This section consists of a brief overview of Shcrodinger's equation with the specific potential systems used and of GPs with details about different kernels.

Schrodinger's Equation¶

Finding the ground state energy involves solving the following eigenvalue problem: $$\hat{H} \psi \equiv(\hat{T}+\hat{V}) \psi=\varepsilon \psi$$ where $\hat{H}$ is the Hamiltonian operator, the sum of kinetic energy operator $\hat{T}$ and potential energy operator $\hat{V}$. The wavefuntion $\psi$ is the eigenvector and $\varepsilon$ is the eigevalue. We am looking for ground state energy $\varepsilon_0$, the eigenvalue correponding to the smallest eigenvector $\psi_0$.

We focus on the time independent Schrodinger's equation(TISE). The kinetic energy operator is given by: $$\hat{T} = -\frac{\hbar^2}{2m_e}\Delta^2$$

Potentials¶

Since we are studing 2D single electron systems, the potential energy $V(x,y)$ is a function of the position of the electron and the system under consideration.

Simple Harmonic Oscillator¶

The simple harmonic oscillator (SHO) potentials are given by: $$V(x, y)=\frac{1}{2}\left(k_{x}\left(x-c_{x}\right)^{2}+k_{y}\left(y-c_{y}\right)^{2}\right)$$ , where $k_x$ and $k_y$ are force constants and $c_x$ and $c_y$ are translations from the origin.

The energy for SHO can be analytically determined: $$\varepsilon=\frac{\hbar}{2 \sqrt{m}}(\sqrt{k}_{x}+\sqrt{k}_{y})$$

Infinite Well¶

The potentials for Infinite Well(IW)/particle in a box system are given by: $$ V(x, y)=\left\{\begin{array}{ll} 0 & \frac{1}{2}\left(2 c_{x}-L_{x}\right)<x \leq \frac{1}{2}\left(2 c_{x}+L_{x}\right) \text { and } \\ & \frac{1}{2}\left(2 c_{y}-L_{y}\right)<y \leq \frac{1}{2}\left(2 c_{y}+L_{y}\right) \\ \infty & \text { otherwise } \end{array}\right. $$ where $L_x$ and $L_y$ are the dimensions of the box.

Double-well Inverted Gaussians¶

The potentials for double-well inverted gaussian (DIG) system are given by: $$ \begin{aligned} V(x, y) &=-A_{1} \exp \left[-\left(\frac{x-c_{x_{1}}}{k_{x_{1}}}\right)^{2}-\left(\frac{y-c_{y_{1}}}{k_{y_{1}}}\right)^{2}\right] \\ &-A_{2} \exp \left[-\left(\frac{x-c_{x_{2}}}{k_{x_{2}}}\right)^{2}-\left(\frac{y-c_{y_{2}}}{k_{y_{2}}}\right)^{2}\right] \end{aligned} $$ It is the superposition of two inverted 2D gaussians with amplitudes $A_1$ and $A_2$ respectively.

Random Potentials¶

We also experimented with random potentials, created by gaussian smoothing of random binary grids. No analytical solution exists for this system. The details about generation are given in the data generation section.

%%html

<img src="plots/four_pots.png" style="height:300px"  align="middle">
<p style="text-align:center;">Figure 1: Sample potentials for different systems[1]</p>

Gaussian Processes¶

The joint probability distribution for the observed(training) target values $y$ at training locations $X$ and predicted(test) target values $f_*$ at test locations $X_*$ is: $$ \left[\begin{array}{l} \mathbf{y} \\ \mathbf{f}_{*} \end{array}\right] \sim \mathcal{N}\left(\mathbf{0},\left[\begin{array}{cc} K(X, X)+\sigma_{n}^{2} I & K\left(X, X_{*}\right) \\ K\left(X_{*}, X\right) & K\left(X_{*}, X_{*}\right) \end{array}\right]\right) $$ Where $K_{p,q} = k(x_p,x_q)$ is the covariance between two points $x_p$ and $x_q$. The choice of the kernel function $k$ is an important design decision. We generate predictions by sampling from the conditional distribution $$\begin{aligned} \mathbf{f}_{*} | X, \mathbf{y}, X_{*} & \sim \mathcal{N}\left(\overline{\mathbf{f}}_{*}, \operatorname{cov}\left(\mathbf{f}_{*}\right)\right), \text { where } \\ \overline{\mathbf{f}}_{*} & \triangleq \mathbb{E}\left[\mathbf{f}_{*} | X, \mathbf{y}, X_{*}\right]=K\left(X_{*}, X\right)\left[K(X, X)+\sigma_{n}^{2} I\right]^{-1} \mathbf{y} \\ \operatorname{cov}\left(\mathbf{f}_{*}\right) &=K\left(X_{*}, X_{*}\right)-K\left(X_{*}, X\right)\left[K(X, X)+\sigma_{n}^{2} I\right]^{-1} K\left(X, X_{*}\right) \end{aligned}$$

I implemented the standard stable Rasumussen Williams GP regression algorithm[4]

def GP(x_train,y_train,kernel,noise,x_test):
    train_size = x_train.shape[0]
    test_size = x_test.shape[0]

    K = kernel.get_kernel(x_train, x_test) #kernel is an object of kernel class, one of sq exp, erf or relu
    K = K + np.eye(train_size+test_size) * noise

    K11, K12, K22 = split_kernel(K, train_size, test_size) #utility func to split the kernel

    solved = sp.linalg.solve(K11, K12, assume_a='pos').T
    # Compute posterior mean
    pred_mean = solved @ y_train
    # Compute the posterior covariance
    pred_cov = K22 - (solved @ K12)
    return pred_mean, pred_cov, K  # mean, covariance

Squared Exponential Kernel¶

The squared exponential covariace function is defined as: $$ K_{\mathrm{SE}}\left(x, x^{\prime}\right)=\exp \left(-\frac{|d|^{2}}{2 \ell^{2}}\right) $$

Motivation behind GP equivalence to NN¶

For a single hidden layer neural network, the $i$th component of the output $z^1_i$ is given by[5,6]: $$z_{i}^{1}(x)=b_{i}^{1}+\sum_{j=1}^{N_{1}} W_{i j}^{1} x_{j}^{1}(x), \quad x_{j}^{1}(x)=\phi\left(b_{j}^{0}+\sum_{k=1}^{d_{i n}} W_{j k}^{0} x_{k}\right)$$ where $W^0$ consists of the weights and $b$ consits of the biases of neurons.

The weights and biases are i.i.d., with distributions $w_{i,j}\sim \mathcal{N} \left(0, \sigma_w^2\right)$ and $b_i\sim \mathcal{N} \left(0, \sigma_b^2\right)$ respectively. As the number of neurons $N_1 \to \infty$, the output $z_{i}^{1} \sim \mathcal{GP} \mathcal{P}\left(0, K^{1}\right)$. Here, the kernel function $K^{1}\left(x,x^{\prime}\right) = \sigma_{b}^{2}+\sigma_{w}^{2} C\left(x, x^{\prime}\right)$ where $C$ depends on the choice of activation function $\phi$ (defined in [5]). It is not always possible to get analytical forms for kernel function.

erf Kernel¶

An analytical form of $K^{1}$ exists for the tanh activation function. It is given by[6]: $$ K_{\mathrm{erf}}\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\frac{2}{\pi} \sin ^{-1} \frac{2 \tilde{\boldsymbol{x}}^{T} \Sigma \tilde{\boldsymbol{x}}^{\prime}}{\sqrt{\left(1+2 \tilde{\boldsymbol{x}}^{T} \Sigma \tilde{\boldsymbol{x}}\right)\left(1+2 \tilde{\boldsymbol{x}}^{\prime T} \Sigma \tilde{\boldsymbol{x}}^{\prime}\right)}} $$ This is my implementation:

# https://papers.nips.cc/paper/1197-computing-with-infinite-networks.pdf
# sigma2 is a hyperparameter is a hyper parameter as described above

def get_kernel(self, X1, X2):
    n_train = X1.shape[0]
    n_pred = X2.shape[0]
    dim = X1.shape[1]

    X = np.concatenate((X1, X2), axis=0)
    X = np.concatenate((np.ones((n_train + n_pred, 1)), X), axis=1)

    cov_mat = self.sigma2 * np.eye(dim+1)
    K_raw = 2 * X @ (cov_mat @ X.T)
    norm = cov2varmean(1+K_raw) # returns a matrix of the geometric means of the variances in a covariance matrix

    K = np.arcsin(K_raw/norm)
    K *= 2 / np.pi

    return K

Multiple layers¶

Now, for multi-layer neural networks, the covariance kernel is defined recursively as[7]: $$K^{l}\left(x, x^{\prime}\right)=\sigma_{b}^{2}+\sigma_{w}^{2} F_{\phi}\left(K^{l-1}\left(x, x^{\prime}\right), K^{l-1}(x, x), K^{l-1}\left(x^{\prime}, x^{\prime}\right)\right)$$ for the $l$th layer, with the base case: $$ K^{0}\left(x, x^{\prime}\right)=\mathbb{E}\left[z_{j}^{0}(x) z_{j}^{0}\left(x^{\prime}\right)\right]=\sigma_{b}^{2}+\sigma_{w}^{2}\left(\frac{x \cdot x^{\prime}}{d_{\mathrm{in}}}\right) $$ Again,the function $F_{\phi}$ depends on the activation function $\phi$ and might not have a closed form.

ReLU Kernel¶

This kernel is equivalent to an infinitely wide neural network of $l$ layers with $\phi = ReLU$ activation functions $$\begin{aligned} K^{l}\left(x, x^{\prime}\right) &=\sigma_{b}^{2}+\frac{\sigma_{w}^{2}}{2 \pi} \sqrt{K^{l-1}(x, x) K^{l-1}\left(x^{\prime}, x^{\prime}\right)}\left(\sin \theta_{x, x^{\prime}}^{l-1}+\left(\pi-\theta_{x, x^{\prime}}^{l-1}\right) \cos \theta_{x, x^{\prime}}^{l-1}\right) \\ \theta_{x, x^{\prime}}^{l} &=\cos ^{-1}\left(\frac{K^{l}\left(x, x^{\prime}\right)}{\sqrt{K^{l}(x, x) K^{l}\left(x^{\prime}, x^{\prime}\right)}}\right) \end{aligned}$$ with $K_0$ defined above. This is my implementation:

# https://papers.nips.cc/paper/3628-kernel-methods-for-deep-learning.pdf
#layers, sigma_w2, sigma_b2, noise_bias are hyperparameters as described above

def get_K0(self, X):

    #X = normalize(X)
    d_in = X.shape[1]
    bias_kernel = np.eye(X.shape[0]) * self.noise_bias * d_in
    K0 = (np.dot(X, X.T) - bias_kernel) / d_in
    return self.sigma_b2 + self.sigma_w2 * K0

def get_kernel(self, X1, X2):
    n_train = X1.shape[0]
    n_pred = X2.shape[0]

    X = np.concatenate((X1, X2), axis=0)
    K = self.get_K0(X)


    for layer in range(self.layers):
        Kcorr = cov2cor(K) #returns correlation matrix from a covariance matrix
        theta = np.arccos(Kcorr)

        K = cov2varmean(K) * (np.sin(theta) + (np.pi - theta) * np.cos(theta))

        K *= self.sigma_w2 / (2*np.pi)
        K += self.sigma_b2

    return K

%%html
<img src="plots/GP_kernels.png" style="width:833.3px"  align="middle">
<p style="text-align:center;">Figure 2: Different kernels, for the potential to energy mappings. The order is [Squared exponential, erf, ReLU]</p>

I only wrote the results in this section. A good literature review would be [2,4,5,6,7,8]. [2] also discusses numerical computation of the kernel functions.

Methods¶

Data generation¶

I closely followed the data generation techniques outlined in [1] to try and reproduce the results. I wrote a 2D finite difference solver[9] to compute $\hat{T}$. I used the same distributions to generate parameters for potentials as [1] wherever possible (Appendix 1 here and Appendix B in [1]). They use 256x256 images with length of 20 a.u.. This was not feasible on my desktop and I used grids of size 32x32 with a range of -20 a.u. to 20 a.u.

To generate random potentials, I use the following method, quoted from [1]

First, we generate a 16 × 16 binary grid of 1s and 0s, and upscale it to 256 × 256. We then generate a second 16 × 16 binary grid and and upscale it to 128 × 128. Wec enter the smaller grid within the larger grid and then subtract them element-wise. We then apply a Gaussian blur with standard deviation σ1, to the resulting image. The potential is now random, and smooth, but does not achieve a maximum at the boundary. To achieve this, we generate a mask that smoothly goes to zero at the boundary, and 1 in the interior. We wish the mask to be random, e.g. a randomly generated ‘blob’. To generate the blob, we generate k2 random coordinate pairs on a 200 × 200 grid, where k is an integer between 2 and 7, inclusive. We then throw away all points that lie inside the covex hull of these points, and smoothly interpolate the remaining points with cubic splines. We then form a binary mask by filling the inside of this closed blob with 1s, and the outside with 0s. Resizing the blob to a resolution of R × R, and applying a Gaussian blur with standard deviation σ2, we arrive at the final mask. Element-wise multiplication of the mask with the random-blurred image gives a random potential that approaches zero at the boundary. We randomize the “sharpness” of the potential by then exponentiating by either d = 0.1,0.5,1.0,or2.0, chosen at random with equal probabilities (i.e. V := $V^d$). We then subtract the result from its maximum to invert the well.

Here are some samples of the data I generated:

%%html


<img src="plots/training_samples_32x32.png" style="height:600px"  align="middle">
<img src="plots/test_samples_32x32.png" style="height:233.3px"  align="middle">
<p style="text-align:center;">Figure 3: Sample potentials for different systems, size 32x32. Note that the columns are in the order [IW, SHO, RAND, DIG]</p>

    
<img src="plots/energy_values_32.png" style="width:533.3px"  align="middle">
<p style="text-align:center;">Figure 4: Energy distribution of different systems</p>

Fig. 4 shows the energy distribution of various systems. I intended to create a uniform distribution across all systems, using the same parameters as [1]. It is clear from the banding in fig 3 that the systems have different distributions of energy. This could be due to the smaller 32x32 images size I am using or other bugs. I also did not use DIG to train my models because it's energy distribution around 0 mHa is not ideal. Instead, I used it as a holdout set to test the models on a system that was not in the training set.

Machine Learing¶

I trained a relatively simpler simpler CNN that follows the structure of the CNN in [1].

%%html
<img src="plots/CNN_architecture.png" style="width:733.3px"  align="middle">
<p style="text-align:center;">Figure 5: CNN architecture in [1]</p>

Because my data consists of 32x32 images, I used a CNN that is equivalent to the layers [input, conv4, conv5, conv6, conv7, fc1] (refer to Fig. 5). I used AdaDelta optimizer, learning rate of 0.001 and experimented with batch sizes and epochs.

I implemented GP with different kernels as explained in the background section. Each 32x32 image is flattend into a length 1024 vector. Parameters for kernels are in Appending 1.

I experimented with different dataset sizes. Each dataset was split into 80% training and 20% testing. The DIG system was held out.

The main error metric used is Median Absolute Error. The goal is to get MAE within chemical accuracy (1.6 mHa). I also calculated RMSE as a proxy for distance between the predicted and true distributions.

Results¶

First, each model was trained and tested on each system seperately, referred by system name. Then a generalized model trained on three systems (IW, SHO, RAND), referred to as ALL.

For gaussian processes, each seperate model was trained on 4000 samples and tested on 1000 samples. The generalized model was trained on IW, SHO, and RAND samples, 12000 in total, and tested on 3000 samples.

The CNN was trained on 16000 samples and tested on 4000 samples. The generalized model was trained on IW, SHO, and RAND samples, 48000 in total, and tested on 12000 samples.

Finally, the generalized models were used predict small datasets (500 samples) of DIG systems.

Model performance¶

%%html
<img src="plots/Results.png" style="height:1000px"  align="middle">
<p style="text-align:center;">Figure 6: Model performance different systems. All units in mHa. Each quadrant corresponds to one model, and each plot shows the predicted vs true error for that system. MAE is the median absolute error. Note the MAE for CNN are incorrect on the graph. Please refer to Table 1</p>

	IW	SHO	RAND	ALL	DIG
Sq Exp	69.3116	154.149	333.653	27.0994	7.88218
erf	1.49599	0.604144	0.562534	1.01109	14.4476
Relu 1l	0.753448	0.119569	0.597165	0.664624	205.882
Relu 5l	0.47914	0.0887508	0.575393	0.429729	122.52
CNN	0.867032	0.435883	0.537426	0.6508	54.9341

Table 1: Median Absolute Error for each system (all units in mHa

Relu 1l is a single layer relu kernel, Relu 5l is a 5 layer relu kernel.
Relu 5l performed the best across a majority of models. Adding more layers decreased the prediction error.
Chemical accuracy was attained on erf, both Relu layers and the CNN.

%%html
<img src="plots/DIG_Results.png" style="height:1000px"  align="middle">
<p style="text-align:center;">Figure 7: Model performance on DIG system. All units in mHa. </p>

It was expected that the models would perform much worse on predicting the DIG system as it was not a part of the training datasets. Sqr Exp kernel performed relatively better than the NNGP kernels and the CNN.

Time¶

	IW	SHO	RAND	ALL	DIG
Sq Exp	20.231	21.6875	22.3705	151.493	106
erf	6.17432	6.54888	6.27394	21.8295	14.0087
Relu 1l	6.96452	7.31794	7.10844	29.1351	18.9333
Relu 5l	11.9033	11.9716	11.6958	69.2532	46.9672
CNN	465.491	468.205	468.594	2403.21	2403.21

Table 2: Time taken to train model (all units in seconds)

I experimented with the batch size and number of epochs for the CNN. On datasets of size 16000, the CNN took about 465 s and for the bigger 48000 point dataset, the CNN took 40 minutes to reach chemical accuracy. The NNGP kernels were computed with datasets of size 5000 and 15000. It took significantly less time to reach chemical accuracy. However, the caveat here is that neural networks take much longer to train but then inference on large amounts of data is fast. Gaussian processes scale poorly when used to predict larger and larger datasets.

%%html
<img src="plots/linear_time.png" style="height:400px"  align="middle">
<p style="text-align:center;">Figure 8: Size of kernel vs time taken to invert it </p>

Hyper parameter tuning¶

I did hyper parameter tuning on a grid of parameters ($\sigma_b^2$ and $\sigma_w^2$) for ReLU kernels of different depths. I used a small dataset (2000 samples) of 32x32 images.

%%html
<img src="plots/density_hyperparam.png" style="height:400px"  align="middle">
<p style="text-align:center;">Figure 9: Hyper parameter tuning for ReLU kernels </p>

I observed that the best performance happens on the edge of the structures in these heatmaps, as stated in [2]. After a certain threshold of depth increased, the kernel becomes singular (51 layers, 101 layers in Fig 9.)

Conclusion¶

I reproduced [1] on a smaller scale. NNGP kernels can be viable alternatives to solve the ground state energy (and other eigenvalue problems) for the relatively simple systems explored here.

I am considering working on this project over the summer. Some extensions that I can think of are:

Explain uncertainty quantification
Fix distributions of energies for the different systems in a uniform range
Debug DIG
Use pace/aws/gcp and train on more sophisticated electron systems (multi particle, multiple energy states
Explore convolutional GP kernels [10]
Implement GPU acceleration mentioned in [11]
Use current developments in Neural Tangents(ICLR 2020)[12] towards this problem.
Implement more stable potential generators (especially for DIG systems)

References¶

[1] Deep learning and the Schrodinger equation https://arxiv.org/abs/1702.01361
[2] Deep Neural Networks as Gaussian Processes https://arxiv.org/abs/1711.00165
[3] Uncertainty in deep learning, nice intro http://mlg.eng.cam.ac.uk/yarin/blog_2248.html
[4] Gaussian Processes for Machine Learning http://www.gaussianprocess.org/gpml/
[5] Priors for Infinite Networks https://www.cs.toronto.edu/~radford/pin.abstract.html, his thesis Bayesian Learning for Neural Networks is pretty cool as well https://www.cs.toronto.edu/~radford/bnn.book.html
[7] Kernel methods for deep learning https://papers.nips.cc/paper/3628-kernel-methods-for-deep-learning.pdf
[8] A Practical Bayesian Framework for Backpropagation Networks https://authors.library.caltech.edu/13793/1/MACnc92b.pdf
[9] Numerical Recipes, Third Ed pg 1029 http://numerical.recipes
[10] Deep Gaussian Processes with Convolutional Kernels, https://arxiv.org/abs/1806.01655v1
[11] GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration https://arxiv.org/abs/1809.11165
[12] https://arxiv.org/abs/1912.02803, https://github.com/google/neural-tangents