Contents

Classification of RSNA Radiology Images

squirrel looking at a radiology image, AI-generated using DALLĀ·E

Introduction

This blog post covers my work on image classification of a mammography dataset provided by the Radiological Society of North America (RSNA) using deep learning with the PyTorch framework, based on a recent Kaggle competition. The goal was to create a model that could automatically identify breast cancer in screening mammograms, which could help radiologists work more efficiently and improve the quality of patient care.

In the following sections, I’ll describe my data pre-processing and loading steps, training workflow, and inference implementation, sharing relevant code snippets along the way.

Data Pre-Processing

The mammography dataset is about 315 GB in size, with 54,706 images belonging to 11,913 patients. Each patient typically had multiple images due to different views and laterality. Only about 2% of the images were cancerous, and such class imbalance is not unexpected in the medical domain. Each image is stored in DICOM format, which is the standard for medical imaging data.

Before pre-processing the data, I split the dataset into 8 parts so that I could process them separately. I used the training CSV file to determine which images to split, and saved the splits to file:

# Load the training data
df_csv = pd.read_csv('/Volumes/public/Python/rsna/train.csv', low_memory=False)

# Edit the DataFrame to include the filename
df_csv['patient_id'] = df_csv['patient_id'].astype('string')
df_csv['image_id'] = df_csv['image_id'].astype('string')
df_csv.insert(loc=0, column='fname', value=df_csv['patient_id']+'/'+df_csv['image_id']+'.dcm')
display(df_csv.sample(5))
print(f'There are {len(df_csv)} images.')

# Make splits
train_list = df_csv['fname'].tolist()
splitted_parts = np.array_split(train_list, 8)
for part in splitted_parts:
    print(f'Split: {len(part)} images.')

# Save splits to file
with open('/Volumes/public/Python/rsna/train_2048x1024_splits.pkl', 'wb') as f:
    pickle.dump(splitted_parts, f)

# Print the name of the first split's first image
print(splitted_parts[0][0])
There are 53223 images.
Split: 6653 images.
Split: 6653 images.
Split: 6653 images.
Split: 6653 images.
Split: 6653 images.
Split: 6653 images.
Split: 6653 images.
Split: 6652 images.
10006/462822612.dcm

To pre-process the dataset, I performed the following steps for every image:

  • Reading in the image array using dicomsdl (GitHub)
  • Performing region-of-interest cropping to remove the substantial amount of empty space around the breast tissue
  • Padding either the vertical or horizontal axis to arrive at my desired aspect ratio of 2:1 (which suitably represents the aspect ratio of breast tissue in the dataset)
  • Resizing to my desired resolution of 2048x1024; I initially started with small resolutions and worked my way up to this as my final resolution

Here are the functions to perform these actions:

# Define threshold values
cv2.THRESH_BINARY = 0
cv2.THRESH_OTSU = 16

# Region-Of-Interest cropping
def crop_coords(img):
    # Otsu's thresholding after Gaussian filtering
    blur = cv2.GaussianBlur(img, (5, 5), 0)
    _, breast_mask = cv2.threshold(blur,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)
    
    cnts, _ = cv2.findContours(breast_mask.astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnt = max(cnts, key = cv2.contourArea)
    x, y, w, h = cv2.boundingRect(cnt)

    return (x, y, w, h)

# Image resizing for desired width or height
def image_resize(image, width=None, height=None, inter=cv2.INTER_LINEAR):
    dim = None
    (h, w) = image.shape[:2]

    if width is None and height is None:
        return image
    if width is None:
        r = height / float(h)
        dim = (int(w * r), height)
    else:
        r = width / float(w)
        dim = (width, int(h * r))
    
    resized = cv2.resize(image, dim, interpolation=inter)

    return resized

# Process a path of DICOM files
def process_dicom(path):
    im = dicomsdl.open(str(path))
    pixels = im.pixelData()

    if im.PhotometricInterpretation == "MONOCHROME1":
        pixels = np.amax(pixels) - pixels
    else:
        pixels = pixels - np.min(pixels)

    if np.max(pixels) != 0:
        pixels = pixels / np.max(pixels)
        pixels = (pixels * 255).astype(np.uint8)

    # Region-Of-Interest cropping
    (x, y, w, h) = crop_coords(pixels)
    img_cropped = pixels[y:y+h, x:x+w]

    # Pad the axis that is needed to arrive at my desired aspect ratio, then resize to the right size
    current_ar = h / w
    if current_ar < 2.0:
        # Pad vertically
        img_resized = cv2.copyMakeBorder(img_cropped, 2 * w - h, 0, 0, 0, cv2.BORDER_CONSTANT)
    else:
        # Pad horizontally
        img_resized = cv2.copyMakeBorder(img_cropped, 0, 0, int(h / 2.0 - w), 0, cv2.BORDER_CONSTANT)
    img_resized = cv2.resize(img_resized, (1024, 2048), interpolation=cv2.INTER_LINEAR)

    return img_resized

Run those functions on one of the eight data splits (takes about 5 hours per split):

# Define the save folder
SAVE_FOLDER = '/Volumes/public/Python/rsna/train_2048x1024/'
os.makedirs(SAVE_FOLDER, exist_ok=True)

# Process each DICOM file and saving to PNG
for image_filename in splitted_parts[0]:
    image_path = '/Volumes/public/Python/rsna/train_images/'+image_filename
    processed_img = process_dicom(image_path)
    cv2.imwrite(f"{SAVE_FOLDER}{image_filename.replace('.dcm', '.png').replace('/', '_')}", processed_img, [cv2.IMWRITE_PNG_COMPRESSION, 7])

Note that we save each pre-processed image as PNG, which is the format that will be read into memory during training. And that’s it for pre-processing work!

In case you’re interested, my final processed dataset is publicly available and hosted on Kaggle: RSNA Mammography Dataset 2048x1024.

To view the pre-processed PNG images, here’s some code that picks random images from the training CSV file and plots them in a nice grid:

# Read in the training CSV file
df_csv = pd.read_csv('/Volumes/public/Python/rsna/train.csv', low_memory=False)

# Modify some columns to 'string' type, and create a new column 'fname' containing the filename
df_csv['patient_id'] = df_csv['patient_id'].astype('string')
df_csv['image_id'] = df_csv['image_id'].astype('string')
df_csv.insert(loc=0, column='fname', value=df_csv['patient_id']+'_'+df_csv['image_id']+'.png')

def plot_random_images(img_dir, cancer_flag):
    # Pick 9 random images from the DataFrame
    df_sample = df_csv[df_csv['cancer'] == cancer_flag].sample(9)

    # Plot images
    fig = plt.figure(figsize=(11, 15))
    rows, cols = 3, 3
    for i in range(1, rows * cols + 1):
        fig.add_subplot(rows, cols, i)
        img = plt.imread(img_dir+df_sample.iloc[i-1]['fname'])
        plt.imshow(img, cmap='gray')
        plt.title(df_sample.iloc[i-1]['fname']+'\n'+'cancer: '+str(df_sample.iloc[i-1]['cancer']))
    plt.show()

plot_random_images('/Volumes/public/Python/rsna/train_2048x1024/', cancer_flag=1)

You can specify whether you want to view cancerous or non-cancerous images by setting cancer_flag to 1 or 0, respectively.

Data Loading

The dataset is split into training (75%) and validation (25%) sets using scikit-learn’s StratifiedGroupKFold, which can maintain the proportion of cancer among each split (i.e., stratification) as well as keeping each patient’s images in the same split so they are not in both splits.

Once stratified, we can obtain training and validation indices for each fold:

sgkf = StratifiedGroupKFold(n_splits=4, shuffle=True)

for fold, (train_ind, valid_ind) in enumerate(sgkf.split(df, df['cancer'], df['patient_id'])):
    ...

Four folds are obtained since I specified a 75%/25% split. Since I was not planning on ensembling k-fold models nor needing more robust validation, I only trained on a single fold to speed up the training process.

When loading the training dataset, I performed upsampling to increase the chances of having cancerous image(s) in each batch. I experimented with upsampling factors ranging from 3x to 35x; my top models upsampled ~10-15x.

Here’s the function to perform upsampling using PyTorch’s WeightedRandomSampler:

def get_sampler(df, idx, up_factor):
    """Returns a sampler based on WeightedRandomSampler. The goal is to oversample the minority class
    to account for the class imbalance.
    """
    targets = df['cancer'][idx].values
    weight_dict = {0: 1.0, 1: up_factor}
    weights = [weight_dict[x] for x in targets]
    sampler = WeightedRandomSampler(weights=weights, num_samples=len(weights), replacement=True)

    return sampler

train_sampler = get_sampler(df, train_ind, up_factor=10)

I typically used a batch size of 26, as constrained by my GPU’s memory.

This is the custom PyTorch Dataset class used to read in images and perform transformations (e.g., augmentations, normalization):

class MammoDataset(Dataset):
    """Define a Dataset class to read in the images and corresponding labels.
    """
    def __init__(self, df, img_dir, transform=None):
        self.img_labels = df
        self.img_dir = img_dir
        self.transform = transform
    def __len__(self):
        return len(self.img_labels)
    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
        image = read_image(img_path, mode=ImageReadMode.RGB)
        label = self.img_labels.iloc[idx, 7]
        if self.transform:
            image = self.transform(image)
        return image, label.astype(float), self.img_labels.iloc[idx, 0]

I experimented with image augmentations that were available in torchvision, notably RandomResizedCrop, RandomHorizontalFlip, TrivialAugmentWide, and RandAugment. Their effect was unclear and did not seem to improve training.

Here is an example of a PyTorch image transform that includes both augmentation and normalization:

transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

Model

My strategy was to use transfer learning from pre-trained computer vision models by fine-tuning the weights on my dataset. I investigated model architectures such as EfficientNet_v2_small, EfficientNet_v2_medium, EfficientNet_v2_large, EfficientNet_B0, EfficientNet_B2, EfficientNet_B4, ConvNeXt_v1_tiny, and ConvNeXt_v1_small.

I ended up using ConvNeXt_v1_tiny exclusively due to its performance and small size, using the default weights provided in torchvision.

This function instantiates a ConvNeXt_v1_tiny model:

def create_convnext_tiny(out_features, device):
    """Creates a ConvNeXt Tiny model.
    """
    weights = torchvision.models.ConvNeXt_Tiny_Weights.DEFAULT
    model = torchvision.models.convnext_tiny(weights=weights).to(device)

    model.classifier = nn.Sequential(
        torchvision.models.convnext.LayerNorm2d((768,), eps=1e-06, elementwise_affine=True),
        nn.Flatten(start_dim=1, end_dim=-1),
        nn.Linear(in_features=768, out_features=out_features, bias=True)
    ).to(device)
    
    model.name = 'ConvNeXt Tiny'
    print(f'Created {model.name} model.')

    return model, weights

ConvNeXt (GitHub, arXiv) was released in 2022 by researchers at Facebook and UC Berkeley. It’s a convolutional neural network that is designed to be simple and efficient, yet can favorably compete with vision transformers on multiple computer vision benchmarks.

I also investigated freezing some of the layers’ weights. Starting from a fully frozen model and unfreezing layers starting from the output of the neural network and moving towards the input, typically model performance increases, training time per epoch increases, and memory usage increases. My best models had ~60-80 model.named_parameters() frozen.

Training

All of my training was performed using the PyTorch framework. Here are some of the details:

  • Epochs: I typically trained for 17 epochs, which was sufficient for the model to converge. One epoch means iterating through the entire dataset once.
  • Early Stopping: This was implemented to remove duds; every once in a while, I would have a model that wouldn’t train for some reason, and it was fairly obvious with a probabilistic F1 score of <0.05 after 2 epochs of training.
  • Learning Rates/Schedulers: My top models used learning rates of ~3e-5 to ~8e-5 and cosine annealing schedulers such as CosineAnnealingLR and CosineAnnealingWarmRestarts.
  • Optimizer/Loss: I used the AdamW optimizer and PyTorch’s BCEWithLogitsLoss loss function. My best models included a positive weight of ~15-25 in the loss function.

I was able to speed up training by 2x by implementing PyTorch’s automatic mixed precision (documentation):

scaler = torch.cuda.amp.GradScaler()

...

with torch.autocast(device_type='cuda', dtype=torch.float16):
    train_pred_logits = model(X)
    loss = loss_fn(train_pred_logits, y)

train_loss += loss.item()

optimizer.zero_grad()

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

scheduler.step()

When evaluating the model at every epoch with the validation set, I calculated the evaluation metric (i.e., probabilistic F1 score) and the best binarization threshold to maximize that score. Here are the pertinent functions:

def pf1_score(y_true, y_pred, beta):
    """Calculates the probabilistic F1 score (pF1). It accepts probabilities instead of binary classifications.
    """
    y_true_count, ctp, cfp = 1e-6, 1e-6, 1e-6

    for idx in range(len(y_true)):
        prediction = min(max(y_pred[idx], 0), 1)
        if (y_true[idx]):
            y_true_count += 1
            ctp += prediction
        else:
            cfp += prediction

    beta_squared = beta * beta
    c_precision = ctp / (ctp + cfp)
    c_recall = ctp / y_true_count
    if (c_precision > 0 and c_recall > 0):
        result = (1 + beta_squared) * (c_precision * c_recall) / (beta_squared * c_precision + c_recall)
        if isinstance(result, (list, tuple, np.ndarray)):
            result = result[0]
        return result
    else:
        return 0

def pf1_threshold(y_true, y_pred):
    """Finds the best threshold that produces the max pF1 score, by evaluating various threshold levels.
    """
    thresholds = np.arange(0, 1, 0.05)
    list_of_scores = [pf1_score(y_true, np.where(y_pred > t, 1.0, 0.0), beta=1.0) for t in thresholds]
    best_score = max(list_of_scores)
    ind = list_of_scores.index(best_score)
    best_threshold = thresholds[ind]
    
    return best_score, best_threshold, list_of_scores

Training was performed on cloud GPUs from Kaggle Notebooks (Tesla P100 with 13GB RAM) as well as Lambda Labs (A100 with 40GB RAM). I prepared convenient bash scripts that automatically install required libraries, download pre-processed data, and unzip files every time I start a new GPU instance.

Once I’m done using a cloud GPU, I gather up all of the files that I want to keep into one convenient zip file for easy downloading:

exp_files = Path('./').glob('exp*')
exp_files_list = [x for x in exp_files if x.is_file()]
zipName = 'exp_outputs.zip'
with ZipFile(zipName, 'w') as zipObj:
    for exp_file in exp_files_list:
        if (exp_file != zipName):
            zipObj.write(exp_file)

To track my experiments, I used Neptune.ai to organize my runs by automatically logging hyperparameters, losses, metrics, and GPU usage.

Inference

During inference, the model makes predictions on test data it has not seen before. The test set must be processed exactly the same way that the training set was processed, which includes both pre-processing steps (e.g., region-of-interest cropping, resizing, and conversion to PNG) and on-memory steps (e.g., data augmentations and normalization).

Submissions for this competition were made through Kaggle Notebooks, which meant that everything must complete within 9 hours of runtime and internet access must be disabled. The 9-hour runtime limit was challenging, as the majority of that time was spent on reading in the test images to DICOM format, pre-processing, and saving as PNG files. There was limited time remaining for model evaluation.

I got a ~30% time speedup by moving from 1x P100 to 2x T4, which are the two GPU options that Kaggle offers. To take advantage of two GPUs, I used data parallelism, which is easily implemented in PyTorch by wrapping the model’s forward pass in data_parallel:

os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'

from torch.nn.parallel.data_parallel import data_parallel

...

test_pred = data_parallel(model, X)

I also performed test-time augmentation (TTA), which means augmenting the test images and running the various versions through the model so that it can take the average of the logits from all augmentations.

Here’s the code for a TTA horizontal flip:

test_pred = data_parallel(model, X)
test_pred += data_parallel(model, torch.flip(X, dims=[3]))
test_pred /= 2.0

Lastly, I performed thresholding on the results based on the best threshold found during the training process:

model_results['cancer'] = (model_results['cancer'] > threshold).astype('float32')

Given the tight time constraints during inference, I only submitted single models and did not submit any ensembles to further improve my score.

Concluding Thoughts

My probabilistic F1 score on the final leaderboard was 0.38 (top ~30%). The gold medal range was 0.55-0.49, the silver medal range was 0.49-0.43, and the bronze medal range was 0.43-0.41.

Reading some of the top solutions posted by other competitors in the discussion threads, here are some things that I could have done differently:

  • Preprocessing: I could have spent more time experimenting with different ways to pre-process my dataset. For example, the 4th-place solution references a paper that concludes how the best resizing method is PIL, so the competitors chose to use PIL for its performance instead of cv2, despite PIL’s slowness. Additionally, the top solutions performed windowing, which I had skipped.
  • Augmentations: I didn’t spend too much time on augmentations, and I mostly experimented with light augmentation such as horizontal flipping. The top solutions used a wide range of augmentation techniques, including vertical/horizontal flipping, shifting, scaling, rotating, brightness/contrast, cutout, and mixup (e.g., see 3rd-place solution). A popular library for doing such transformations is Kornia.
  • Resolution: The top solutions used lower resolutions than my 2048x1024 images; for example, the 6th-place solution and 9th-place solution both used 1536x768. Some competitors mentioned that trying higher resolutions such as 2048 pixels did not improve their performance. Had I known that the extra resolution was unimportant, I could have trained more efficiently and ran more experiments with a lower resolution.
  • Freezing Layers: It’s possible that the model would have improved further if more layers were unfrozen. According to the 9th-place solution, freezing layers was not a useful strategy. I had hesitated when it came to unfreezing more of my model’s parameters, as this would have made training run slower and require smaller batch sizes. Instead, I should have followed a procedure similar to this, where the first step is to train only the top layers with a relatively large learning rate (0.01), then unfreeze some layers and fit the model using a smaller learning rate, and eventually unfreezing all layers.

I would say that my main challenges were:

  • DICOM conversion times when preprocessing the dataset, especially for the hidden test set that had notebook time limits
  • Class imbalance between cancerous and non-cancerous images and getting the model to learn how to differentiate them
  • The dataset’s size (315 GB) was the largest I’ve had to deal with so far
  • The probabilistic F1 score was an unstable metric and very sensitive to thresholding; in the future, I might want to avoid competitions with such metrics

This was my first machine learning competition, and overall it was a wonderful learning opportunity due to the unique challenges presented by the competition. I look forward to my next one.