IceCube Neutrino Reconstruction Using PCA and RNNs

March 10, 2023 3373 words 16 minutes

Contents

a squirrel with ice cubes, AI-generated using DALL·E

Introduction

IceCube is a neutrino observatory encompassing a cubic kilometer of ice in the South Pole. One of the challenges associated with neutrino research is how we can quickly and accurately determine the direction that a neutrino came from. I participated in IceCube’s machine learning competition, hosted on Kaggle, that challenged competitors to accurately process a large number of neutrino events (about 1 million events in the hidden test set) within time constraints.

My goal was to use this competition as an opportunity to gain familiarity with principal component analysis (PCA) as well as sequential models such as recurrent neural networks (RNNs).

In the sections below, I’ll cover my work on exploratory data analysis, PCA, and RNNs, and share any relevant snippets of code along the way.

Exploratory Data Analysis

Let’s first take a look at IceCube’s sensor geometry. IceCube consists of 5160 sensors, and we are provided with the x, y, and z positions of each sensor.

Read in the sensory geometry CSV file, and create a 3D scatter plot using Plotly:

sensor_geometry = pd.read_csv('/Volumes/public/Python/icecube/sensor_geometry.csv')
sensor_geometry.head()

plotly.offline.init_notebook_mode()

trace = go.Scatter3d(
    x=sensor_geometry['x'],
    y=sensor_geometry['y'],
    z=sensor_geometry['z'],
    mode='markers',
    marker={'color': sensor_geometry['sensor_id'], 'colorscale': 'Viridis', 'size': 2, 'colorbar': dict(thickness=20)}
)

layout = go.Layout(margin={'l': 0, 'r': 0, 'b': 0, 't': 0})

plot_figure = go.Figure(data=[trace], layout=layout)

plotly.offline.iplot(plot_figure)

I ran this code in a Jupyter Notebook so that the interactive plot would appear alongside the code cell. This is what it looks like:

The darker colors correspond to lower sensor IDs and the lighter colors correspond to higher sensor IDs. You can see that the sensors are located along vertical “strings.”

The competition provides 131,953,924 neutrino events for our training purposes. Each neutrino event has a ground truth label (azimuth and zenith values), which specifies the direction that the neutrino came from. We have accurate labels for this dataset because this is synthetic data.

These labels are located in a train_meta.parquet file, which we’ll read in:

train_meta = pd.read_parquet('/Volumes/public/Python/icecube/train_meta.parquet')
train_meta.head()

batch_id	event_id	first_pulse_index	last_pulse_index	azimuth	    zenith
1	        24	        0	                60	                5.029555	2.087498
1	        41	        61	                111	                0.417742	1.549686
1	        59	        112	                147	                1.160466	2.401942
1	        67	        148	                289	                5.845952	0.759054
1	        72	        290	                351	                0.653719	0.939117

You can see that each row of this file corresponds to a separate neutrino event.

The data for each neutrino event are located in individual batch files (a total of 660 batch IDs) stored as parquets. Let’s open up batch_7.parquet:

train_batch_7 = pd.read_parquet('/Volumes/public/Python/icecube/train/batch_7.parquet')
train_batch_7.head()

            sensor_id	time	charge	auxiliary
event_id				
19526675	288	        6004	0.575	True
19526675	3257	    6360	1.175	True
19526675	303	        6490	1.025	True
19526675	2354	    7517	0.625	True
19526675	4104	    7850	1.075	True

Reading in batch #7 takes up about 840 MB memory. We can see that there are multiple rows per event, with each row describing the sensor ID that detected a pulse. Relative time of the pulse and its charge are also provided. auxiliary is a column where False means that the pulse was fully digitized and is likely of higher quality.

Let’s take a look at the distribution of pulse charge. Plot a histogram of its distribution:

plt.hist(train_batch_7['charge'], bins=30, range=(0, 6))
plt.xlabel('Pulse Charge [photoelectrons]')
plt.ylabel('Count')
plt.show()

Most pulses have a charge of about 1 photoelectron.

Let’s plot any given event including the pulse-detecting sensors and a line depicting the neutrino’s direction. We’ll write a function and use Plotly again for 3D plotting. Here’s the code along with comments:

def plot_event(event_id):

    # Get pulse-detecting sensors
    sample_event = train_batch_7.loc[event_id, 'sensor_id']
    sample_sensors = pd.merge(sensor_geometry, sample_event, on='sensor_id', how='inner')

    # Convert polar coordinates (azimuth, zenith) to cartesian coordinates (x, y, z)
    azimuth = train_meta[train_meta['event_id'] == event_id]['azimuth'].squeeze()
    zenith = train_meta[train_meta['event_id'] == event_id]['zenith'].squeeze()
    r = 500
    x_line = r * np.cos(azimuth) * np.sin(zenith)
    y_line = r * np.sin(azimuth) * np.sin(zenith)
    z_line = r * np.cos(zenith)

    # Set up plotly to display in notebook mode
    plotly.offline.init_notebook_mode()

    # Create traces for each scatterplot
    trace = go.Scatter3d(
        x=sensor_geometry['x'],
        y=sensor_geometry['y'],
        z=sensor_geometry['z'],
        mode='markers',
        marker={'size': 2, 'opacity': 0.8, 'color': 'gray'},
        showlegend=False
    )
    sample_trace = go.Scatter3d(
        x=sample_sensors['x'],
        y=sample_sensors['y'],
        z=sample_sensors['z'],
        mode='markers',
        marker={'color': train_batch_7.loc[event_id, 'time'], 'colorscale': 'Viridis', 'size': 5, 'colorbar': dict(thickness=20)}
    )
    label_trace = go.Scatter3d(
        x=[0, x_line],
        y=[0, y_line],
        z=[0, z_line],
        line=dict(color='orange', width=10),
        showlegend=False
    )

    # Define layout margins
    layout = go.Layout(margin={'l': 0, 'r': 0, 'b': 0, 't': 0})

    # Plot the scatterplots
    plot_figure = go.Figure(data=[trace, sample_trace, label_trace], layout=layout)
    plotly.offline.iplot(plot_figure)

Plot a random event:

plot_event(event_id=20936501)

In the above plot, the gray dots represent sensors that did not detect any pulses in this event. Colored dots represent sensors that detected pulses, where darker blue ones are earlier pulses and lighter green are later in time. The orange line represents the neutrino’s direction. In this example, the orange line’s angle looks like a reasonable fit to the sensors that received pulses. This is a nicer-looking event; many events don’t show a clear track through the detector.

That’s it for exploratory data analysis. Next, we’ll look into how we can use principal component analysis to help reconstruct the neutrinos.

Principal Component Analysis

Principal component analysis (PCA) is a popular dimensionality reduction algorithm that works by identifying dimensions (for example, a plane) that lie closest to the data and then projecting the data onto it. In our case, we won’t be needing to project our data; we only need to find the best projection that preserves the maximum variance in the data, which will represent the neutrino’s direction.

PCA will find the axis that accounts for the largest amount of variance in the dataset, and it can also identify a second axis (orthogonal to the first one) that accounts for the largest amount of the remaining variance. It can also find a third axis (orthogonal to both of the previous axes), a fourth one, a fifth one, and so on, up to the number of dimensions in the dataset.

Before we perform PCA, let’s process our data by reading in a batch of data, only keeping higher-quality pulses (auxiliary=False), and dropping any repeated sensor pulses (for simplicity). We’ll also merge the data with our sensor geometry information.

Here’s the code (takes about 4 seconds to run):

# Read in parquet data for a single batch, and reset the index
batch_7 = pd.read_parquet('/Volumes/public/Python/icecube/train/batch_7.parquet')
batch_7 = batch_7.reset_index()

# Only keep higher-quality data
batch_7 = batch_7[batch_7['auxiliary'] == False]

# Ignore the charge columns since we won't be using them, and drop duplicates
batch_7 = batch_7[['event_id', 'sensor_id', 'time']]
batch_7 = batch_7.drop_duplicates(subset=['event_id', 'sensor_id'], keep='first')

# Read in sensor_geometry data, and merge with parquet data
sensor_geometry = pd.read_csv('/Volumes/public/Python/icecube/sensor_geometry.csv', low_memory=False)
batch_geo = pd.merge(batch_7, sensor_geometry, on='sensor_id', how='inner')
display(batch_geo.sample(5))
print(batch_geo.shape)

event_id	sensor_id	time	x	       y	      z
21788731	4848	    11586	41.60	   35.49	  -424.23
21980555	1290	    11135	-492.43	   -230.16	  -13.02
20913545	4093	    11275	-268.90	   354.24	  280.27
19707634	3729	    14347	-66.70	   276.92	  344.64
20113992	83	        9887	-132.80	   -501.45	  108.60

(4133016, 6)

The output shows that we have one detected pulse per row, with the time, sensor_id, and its geographical location (x, y, z) provided.

Let’s perform PCA by going through one event at a time, using scikit-learn’s PCA to fit to each event’s (x, y, z, t) data. We’ll be fitting to both unscaled and scaled data, to see if that affects the results. Also, we’ll make sure that the fitted PCA vector is pointing in the right direction.

This takes about 30 minutes to run:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Define a DataFrame to hold the results
results = pd.DataFrame(columns=['event_id', 'pca_x', 'pca_y', 'pca_z', 'pca_var', 
                                            'pca_s_x', 'pca_s_y', 'pca_s_z', 'pca_s_var', 
])
results['event_id'] = results['event_id'].astype(int)

# Go through each unique event
for event_id in batch_geo['event_id'].unique():

    # Define the training data
    train = batch_geo[batch_geo['event_id'] == event_id][['x', 'y', 'z', 'time']].to_numpy()

    # Define the scaler, and scale the training data
    scaler = StandardScaler()
    scaled_train = scaler.fit_transform(train)

    # Fit PCA to both scaled and unscaled training data
    pca = PCA(n_components=1, svd_solver='full').fit(train)
    scaled_pca = PCA(n_components=1, svd_solver='full').fit(scaled_train)

    # If x and y are equal to 0, make x a small number so that the azimuth and zenith angles aren't undefined
    if (pca.components_[0][0] == 0) and (pca.components_[0][1] == 0):
        pca.components_[0][0] = 1e-06
    if (scaled_pca.components_[0][0] == 0) and (scaled_pca.components_[0][1] == 0):
        scaled_pca.components_[0][0] = 1e-06

    # Flip the direction of the PCA vectors if time is positive
    if pca.components_[0][3] > 0:
        pca.components_[0] = - pca.components_[0]
    if scaled_pca.components_[0][3] > 0:
        scaled_pca.components_[0] = - scaled_pca.components_[0]
    
    # Save all of this information to a DataFrame
    results.loc[len(results.index)] = [event_id, pca.components_[0][0], pca.components_[0][1], pca.components_[0][2], pca.explained_variance_ratio_[0],
                                                 scaled_pca.components_[0][0], scaled_pca.components_[0][1], scaled_pca.components_[0][2], scaled_pca.explained_variance_ratio_[0],
    ]

display(results.head(5))
print(results.shape)

	event_id	pca_x	    pca_y	        pca_z	    pca_var	    pca_s_x	    pca_s_y	    pca_s_z	    pca_s_var
0	19526675.0	-0.041279	-1.086727e-01	0.256477	0.991469	-0.465921	-0.502842	0.515349	0.906307
1	19539828.0	0.053141	3.337328e-02	0.180456	0.977748	0.512133	0.481357	0.536761	0.824943
2	19545811.0	-0.000000	3.330669e-16	0.185920	0.997839	-0.000001	0.000000	0.707107	0.984459
3	19548451.0	0.182637	-5.151474e-02	0.106419	0.941337	0.580503	-0.553344	0.120713	0.650493
4	19552579.0	0.187720	-7.183004e-02	0.189596	0.994454	0.501781	-0.501781	0.500839	0.978909

(200000, 9)

Looks great. Now we have a DataFrame with 200,000 rows (one per event), with each row giving us the neutrino’s xyz orientation and the amount of variance captured. We have this information for both unscaled and scaled (contains _s in the column labels) data. We also notice that the unscaled data has higher captured variance than the scaled data.

The next step is to convert our results from cartesian (xyz) to spherical (azimuth, zenith) coordinates, since the latter is the form that our ground truth labels use. We’ll also make sure that our angles are always positive.

Make the conversions:

# Convert from cartesian to spherical coordinates to get the direction of the event — designed to work on DataFrame columns
def cart_to_spherical(df_x, df_y, df_z):
    x2_and_y2 = (df_x ** 2) + (df_y ** 2)
    zenith = np.arccos(df_z / np.sqrt(x2_and_y2 + (df_z ** 2)))
    azimuth = np.sign(df_y) * np.arccos(df_x / np.sqrt(x2_and_y2))
    return azimuth, zenith

# Make conversions
results['pca_azi'], results['pca_zen'] = cart_to_spherical(results['pca_x'], results['pca_y'], results['pca_z'])
results['pca_s_azi'], results['pca_s_zen'] = cart_to_spherical(results['pca_s_x'], results['pca_s_y'], results['pca_s_z'])

# Convert from (-pi, +pi) to (0, 2pi) — this is for single values, not DataFrame columns
def angle_to_2pi(angle):
    angle_2pi = angle % (2 * np.pi)
    if angle_2pi < 0:
        angle_2pi += 2 * np.pi
    return angle_2pi

# Fix angles
results['pca_azi'] = results['pca_azi'].apply(angle_to_2pi)
results['pca_s_azi'] = results['pca_s_azi'].apply(angle_to_2pi)

display(results.head(5))
print(results.shape)

event_id	pca_x	    pca_y	        pca_z	    pca_var	    pca_s_x	    pca_s_y	    pca_s_z	    pca_s_var	pca_azi	    pca_zen	    pca_s_azi	pca_s_zen
19526675.0	-0.041279	-1.086727e-01	0.256477	0.991469	-0.465921	-0.502842	0.515349	0.906307	4.349375	0.425555	3.965084	0.926164
19539828.0	0.053141	3.337328e-02	0.180456	0.977748	0.512133	0.481357	0.536761	0.824943	0.560759	0.334660	0.754430	0.918582
19545811.0	-0.000000	3.330669e-16	0.185920	0.997839	-0.000001	0.000000	0.707107	0.984459	1.570796	0.000000	0.000000	0.000001
19548451.0	0.182637	-5.151474e-02	0.106419	0.941337	0.580503	-0.553344	0.120713	0.650493	6.008266	1.059700    5.521735	1.421400
19552579.0	0.187720	-7.183004e-02	0.189596	0.994454	0.501781	-0.501781	0.500839	0.978909	5.917729	0.814570	5.497787	0.956203

(200000, 13)

Now our DataFrame has four additional columns: azimuth and zenith, for both unscaled and scaled data.

To compare our results with the ground truth labels, let’s add the labels to our DataFrame:

# Get ground truth labels
train_meta = pd.read_parquet('/Volumes/public/Python/icecube/train_meta.parquet')
train_meta = train_meta[['event_id', 'azimuth', 'zenith']]

# Add ground truth labels to the DataFrame
results['event_id'] = results['event_id'].astype(int)
results = pd.merge(results, train_meta, on='event_id', how='left')

We’ll also define a function to calculate the angular error between two angles:

def angular_dist(az_true, zen_true, az_pred, zen_pred):
    
    if not (np.all(np.isfinite(az_true)) and
            np.all(np.isfinite(zen_true)) and
            np.all(np.isfinite(az_pred)) and
            np.all(np.isfinite(zen_pred))):
        raise ValueError("All arguments must be finite")
    
    # Pre-compute all sine and cosine values
    sa1 = np.sin(az_true)
    ca1 = np.cos(az_true)
    sz1 = np.sin(zen_true)
    cz1 = np.cos(zen_true)
    
    sa2 = np.sin(az_pred)
    ca2 = np.cos(az_pred)
    sz2 = np.sin(zen_pred)
    cz2 = np.cos(zen_pred)
    
    # Scalar product of the two cartesian vectors (x = sz*ca, y = sz*sa, z = cz)
    scalar_prod = sz1*sz2*(ca1*ca2 + sa1*sa2) + (cz1*cz2)
    
    # Scalar product of two unit vectors is always between -1 and 1, this is against nummerical instability
    scalar_prod =  np.clip(scalar_prod, -1, 1)
    
    # Convert back to an angle (in radian)
    return np.abs(np.arccos(scalar_prod))

Calculate the mean angular error between our results and ground truth labels, and plot them as histograms:

# Calculate the angular error
pca_error = angular_dist(results['azimuth'],
                         results['zenith'],
                         results['pca_azi'],
                         results['pca_zen'])
pca_s_error = angular_dist(results['azimuth'],
                           results['zenith'],
                           results['pca_s_azi'],
                           results['pca_s_zen'])
print(f'PCA: {np.average(pca_error):.2f} rad')
print(f'PCA-scaled: {np.average(pca_s_error):.2f} rad')
results['pca_error'] = pca_error

# Make histogram plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5))
ax1.hist(pca_error, bins=20, label='PCA Model', fc=(1, 0.5, 0.2, 0.5), range=(0, np.pi))
ax1.set_title('Angular Error')
ax1.set_xlabel('Ground Truth Difference [rad]')
ax1.set_ylabel('Count')
ax1.legend()
ax2.hist(pca_s_error, bins=20, label='Scaled PCA Model', fc=(0, 0.5, 1, 0.5), range=(0, np.pi))
ax2.set_title('Angular Error')
ax2.set_xlabel('Ground Truth Difference [rad]')
ax2.set_ylabel('Count')
ax2.legend()
plt.show()

PCA: 1.21 rad
PCA-scaled: 1.27 rad

PCA on unscaled data performs better than using it on scaled data. We already saw earlier that unscaled data resulted in a larger amount of captured variance. Looking at the histograms, unscaled PCA has more events with lowest error (in the leftmost histogram bin) compared to scaled PCA. Both PCA histograms show a distribution centered around π/2, which is what we would expect from random guessing of the neutrino’s direction. It appears that the neutrino events in our dataset are either (1) straightforward to determine the correct angle, or (2) difficult to determine such that PCA can’t do better than random guessing.

Now that we know that PCA can yield results of ~1.21 radians for the mean angular error, let’s see how well we can do using neural networks in the next section.

Recurrent Neural Networks

In this section, I’ll discuss my work on training recurrent neural networks (RNNs) and their flavors such as gated recurrent units (GRUs) and long short-term memory (LSTMs) to reconstruct a neutrino event. RNNs are a class of neural networks that analyze time series data, and are commonly used for forecasting and natural language processing.

To use these models, I’ll be treating the data as a time series or sequence. Each neutrino event will be defined by a sequence composed of sensor IDs listed in the order in which pulses are detected. Since events vary in the number of sensors triggered, different sequences can have different lengths. I will define a maximum sequence length so that any sequences longer than that will be truncated, and I will define a padding value (0) to pad sequences shorter than the maximum sequence length.

Here’s a Python script that pre-processes all of the IceCube data files, with optional command line arguments to specify the train-validation splitting fraction and maximum sequence length. It takes about 1 minute per parquet file or about 8 hours for the entire dataset:

# -----------------------------------------------------------------------------
# Imports

import argparse
import glob
import pandas as pd
import numpy as np
import csv


# -----------------------------------------------------------------------------
# Process parquet files

def process_data(input_file, split_name):

    # Read the input file, and reset the index
    batch = pd.read_parquet(input_file)
    batch = batch.reset_index()

    # Only keep higher-quality data
    batch = batch[batch['auxiliary'] == False]

    # Ignore the charge columns since we won't be using them, and re-order sensor_id in order of time
    batch = batch[['event_id', 'sensor_id', 'time']]
    batch = batch.sort_values(by=['event_id', 'time'], axis=0, ascending=True)

    # Add sensor_id by 1: 0-5159 --> 1-5160 because the 0 token is reserved for padding
    batch['sensor_id'] = batch['sensor_id'] + 1

    # Create a nested list of all events [ [event #1's sensors in order], [event #2's sensors in order], ...] and corresponding IDs
    g = batch.groupby('event_id')
    events = g['sensor_id'].apply(list).tolist()
    event_ids = list(g.groups.keys())

    # Truncate events longer than the maximum sequence length
    events = [e[:args.sequencelen] for e in events]

    # Pad events with 0, if shorter than the maximum sequence length
    events = [e + [0] * (args.sequencelen - len(e)) for e in events]

    # Get target values (labels)
    labels = meta.reindex(event_ids).values.tolist()

    # Combine the events and labels into a single list
    events_labels = [e + l for e, l in zip(events, labels)]

    # Save as CSV file
    with open(args.outputdir+split_name+'_'+input_file.split('/')[-1].split('.')[0][6:]+'.csv', 'w') as f:
        write = csv.writer(f)
        write.writerows(events_labels)


# -----------------------------------------------------------------------------

if __name__ == "__main__":

    # Define command line arguments
    parser = argparse.ArgumentParser(description='IceCube data preprocessing')
    parser.add_argument('-metafile', type=str, default='/Volumes/public/Python/icecube/train_meta.parquet', help='metadata filepath')
    parser.add_argument('-parquetdir', type=str, default='/Volumes/public/Python/icecube/train/', help='filepath for the directory containing parquet files')
    parser.add_argument('-trainfrac', type=float, default=0.9, help='fraction of the dataset to be used for the training set')
    parser.add_argument('-sequencelen', type=int, default=300, help='max length of each example in the dataset')
    parser.add_argument('-outputdir', type=str, default='/Volumes/public/Python/icecube/train_2023-02-27/', help='filepath for the output directory')

    # Parse arguments
    args = parser.parse_args()

    # Read the metadata file
    print(f'Reading metadata file...')
    meta = pd.read_parquet(args.metafile)
    meta = meta[['event_id', 'azimuth', 'zenith']]
    meta = meta.set_index('event_id')

    # Grab all of the parquet files, shuffle, and split into training/validation sets
    parquets = glob.glob(args.parquetdir + '*.parquet')
    print(f'Found {len(parquets)} parquet files')
    parquets = np.random.permutation(parquets)
    train_parquets = parquets[:int(len(parquets) * args.trainfrac)]
    valid_parquets = parquets[int(len(parquets) * args.trainfrac):]
    print(f'Split into {len(train_parquets)} training parquets and {len(valid_parquets)} validation parquets')

    # For each parquet file, pre-process the dataset and save to file
    for parquet in train_parquets:
        print(f'Processing train {parquet}')
        process_data(parquet, split_name='train')
    for parquet in valid_parquets:
        print(f'Processing valid {parquet}')
        process_data(parquet, split_name='valid')

After running this script, we’ll have one pre-processed CSV file per input parquet file. We’ll use these CSV files as inputs to train our neural networks.

I have publicly uploaded my pre-processed CSV datasets to Kaggle: IceCube Training Data and IceCube Validation Data. If you prefer using SQL, I have also uploaded them as a SQLite database: IceCube SQL Data.

Just prior to training, data are read from the CSV files into memory, truncated (if desired), and converted to PyTorch tensors. A Dataset class is defined to read in a single row from the tensor at a time, and a DataLoader is defined to load a batch of rows.

For example, this is how we define the validation set:

class IceCubeDataset(Dataset):
    def __init__(self, data, targ):
        self.data = data
        self.targ = targ

    def __len__(self):
        return len(self.targ)

    def __getitem__(self, idx):
        return self.data[idx], self.targ[idx]

valid_df = pd.read_csv(valid_filepath, names=range(302))
valid_data = torch.tensor(valid_df.iloc[:, :args.trunc].values, dtype=torch.long)
valid_targ = torch.tensor(valid_df.iloc[:, -2:].values, dtype=torch.float32)
valid_dataset = IceCubeDataset(valid_data, valid_targ)
valid_dl = DataLoader(valid_dataset, batch_size=args.bs, shuffle=False, num_workers=args.workers, pin_memory=pinning_mem)

I ended up assigning only a single CSV file as my validation set since more were unnecessary. All remaining 659 CSV files were assigned to the training set. In every epoch, I would iterate through the entire CSV set once, and then move on to the next CSV file in the next epoch. In total, this meant that the model would train for 659 epochs without ever seeing the same neutrino event more than once, therefore eliminating the possibility of overfitting.

Regarding the type of sequential model, I experimented with vanilla RNNs, GRUs, and LSTMs. The latter two equally outperformed the former, so I proceeded with the GRU only for all subsequent experiments.

Here’s the model class for the GRU network:

class GRUModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim, n_layers, fc1_dim, output_size):
        super().__init__()
        self.vocab_size = vocab_size
        self.emb_dim = emb_dim
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers
        self.fc1_dim = fc1_dim
        self.output_size = output_size

        self.embedding = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.emb_dim, padding_idx=0)

        self.gru = nn.GRU(input_size=self.emb_dim, hidden_size=self.hidden_dim, num_layers=self.n_layers, 
                          bias=True, batch_first=True, bidirectional=False)
        for name, param in self.gru.named_parameters():
            if 'bias' in name:
                nn.init.constant_(param, 0.0)
            elif 'weight_ih' in name:
                nn.init.xavier_uniform_(param, nn.init.calculate_gain('tanh'))
            elif 'weight_hh' in name:
                nn.init.orthogonal_(param, nn.init.calculate_gain('tanh'))

        self.fc1 = nn.Linear(self.hidden_dim, self.fc1_dim)
        for name, param in self.fc1.named_parameters():
            if 'bias' in name:
                nn.init.constant_(param, 0.0)
            elif 'weight' in name:
                nn.init.xavier_uniform_(param, gain=nn.init.calculate_gain('relu'))

        self.relu = nn.ReLU()

        self.fc2 = nn.Linear(self.fc1_dim, self.output_size)
        for name, param in self.fc2.named_parameters():
            if 'bias' in name:
                nn.init.constant_(param, 0.0)
            elif 'weight' in name:
                nn.init.xavier_uniform_(param, gain=1)

    def forward(self, x):
        seq_lengths = torch.count_nonzero(x, dim=1)
        gru_output, _ = self.gru(self.embedding(x))
        output_last = torch.stack([gru_output[i, l - 1, :] for i, l in enumerate(seq_lengths)], dim=0)
        final_output = self.fc2(self.relu(self.fc1(output_last)))

        return final_output

The first layer is an embedding layer; each sequence element (which is an IceCube sensor ID) is converted to a vector of numbers via a lookup table. This embedding vector is a quantitative representation of each sensor ID in the neural network. The second layer is the GRU (see torch.nn.GRU). The third layer is a fully connected layer with a ReLU activation, and the fourth layer is the output layer.

Training is done using the PyTorch framework, experiment tracking is performed using Neptune.ai, and hardware was either my local machine (CPU) or Lambda Labs (GPU).

Here are some of the hyperparameters corresponding to my best models:

L1Loss for the loss function
AdamW for the optimizer
Cosine annealing for the scheduler
Embedding dimension of 10
GRU had 50 hidden units and 1 layer
Fully connected layer of 30 units
Batch size of 1024
Learning rate of 0.006

These plots show the loss vs. epoch (top) and the mean angular error vs. epoch (bottom), for both training and validation sets:

The GRU model converges to about 1.20 radians for the mean angular error, which is only slightly better than the results from PCA.

Overall, I would say that the biggest gains in model improvement were obtained by (1) upgrading from a vanilla RNN to a GRU or LSTM, and (2) training on the entire dataset. The remaining hyperparameters only had minor effects.