CUDA-Demux

GPU-accelerated BCL to FASTQ demultiplexing

CUDA-Demux converts Illumina BCL/CBCL run folders directly into per-sample FASTQ files on the GPU. CBCL ingestion runs in parallel across cycles with OpenMP, base calling and barcode matching execute as CUDA kernels, and matched reads stream straight into per-(sample, lane) gzipped FASTQ writers — no per-cluster std::string allocations, no batched-then-merged buffers.

Features

GPU Acceleration

Base decoding and barcode matching run as CUDA kernels; barcodes are stored 2-bits-per-base in __constant__ memory and compared with a popcount-Hamming kernel.

Streaming Output

SoA per-cycle BCL buffers feed GPU batches; matched reads are written directly to per-sample gzip streams as each batch completes — no host-side accumulation of all reads.

NovaSeqX Scale

Validated on a full 2-lane NovaSeqX run (~2.06 B clusters, ~105 GB CBCL input) producing ~144 GB of gzipped FASTQ.

Parallel CBCL Ingest

OpenMP-parallel decompression across cycles; on a 48-core host, the per-tile zlib stage is no longer serial.

Hamming-Gap Matching

Reads are assigned to a sample only when the best barcode hit is ≤ 1 mismatch and at least one mismatch better than the runner-up — ambiguous reads go to undetermined.

Multi-Lane & Paired-End

Filter files (s_<lane>_*.filter), i5 reverse-complement detection, and per-lane FASTQ output (L001, L002, …) are handled automatically.

Performance

End-to-end run on a NovaSeqX dataset (2 lanes, 172 cycles, 1.18 B + 1.18 B raw clusters):

  • 2,057,911,943 post-QC clusters demultiplexed across both lanes
  • 1,696,296,371 reads matched to 18 samples (82.4 %) with the 1-mismatch + Hamming-gap rule
  • 144 GB of gzipped paired-end FASTQ produced (R1 + R2, per sample, per lane)
  • CBCL ingest parallelised over 172 cycles using all available CPU threads
  • GPU decode + match runs in ~8 M-cluster batches sized automatically against free device memory

Tested on an NVIDIA RTX A4500 (20 GB, compute capability 8.6).

Downloads

Latest stable release: v1.1.0 — streaming SoA pipeline, GPU-resident demux, OpenMP CBCL ingest, multi-lane filter fix. Validated end-to-end on a 2-lane NovaSeqX run (~2.06 B clusters).

Installation

Prerequisites

Docker (recommended)

The repository ships a Dockerfile that pins a compatible nvidia/cuda:13.0.1-devel-ubuntu24.04 toolchain, installs all build dependencies, runs the unit tests, and produces a ready-to-run image:

# Build the image (sets CUDA arch to 8.6 by default; override with --build-arg CUDA_ARCH=80)
git clone https://github.com/mmorri/cuda-demux.git
cd cuda-demux
docker build -t cuda-demux:dev .

# Run on a host run folder; mount input read-only and an output dir
docker run --rm --gpus all \
  -v /path/to/RunFolder:/work/run:ro \
  -v /path/to/output:/work/out \
  cuda-demux:dev \
  --input /work/run \
  --samplesheet /work/run/SampleSheet.csv \
  --output /work/out \
  --gzip

The image's ENTRYPOINT is the cuda-demux binary, so any of the CLI flags below can be appended to docker run.

Package installation (v1.1.0)

Debian / Ubuntu

wget https://github.com/mmorri/cuda-demux/releases/download/v1.1.0/cuda-demux_1.1.0_amd64.deb
sudo apt install ./cuda-demux_1.1.0_amd64.deb
# Or:
sudo dpkg -i cuda-demux_1.1.0_amd64.deb
sudo apt-get install -f

Fedora / RHEL / CentOS

wget https://github.com/mmorri/cuda-demux/releases/download/v1.1.0/cuda-demux-1.1.0-1.x86_64.rpm
sudo dnf install ./cuda-demux-1.1.0-1.x86_64.rpm
# Older systems:
sudo yum install ./cuda-demux-1.1.0-1.x86_64.rpm

Build from source

# Debian/Ubuntu — install build dependencies
sudo apt install -y \
    cmake ninja-build \
    g++-14 \
    libtinyxml2-dev libomp-dev zlib1g-dev

git clone https://github.com/mmorri/cuda-demux.git
cd cuda-demux

cmake -S . -B build -G Ninja \
    -DCMAKE_BUILD_TYPE=Release \
    -DBUILD_TESTING=ON \
    -DCMAKE_CUDA_ARCHITECTURES=86         # 80 for A100, 89 for RTX 40xx

cmake --build build -j
ctest --test-dir build --output-on-failure

# Resulting binary:
./build/cuda-demux --input ... --samplesheet ... --output ...

Usage

Basic command

cuda-demux \
    --input /path/to/RunFolder \
    --samplesheet /path/to/SampleSheet.csv \
    --output /path/to/fastq_output \
    --gzip

--input must point at the run folder root (the directory containing RunInfo.xml, RunParameters.xml, and Data/Intensities/BaseCalls/). The sample sheet is an Illumina v2 SampleSheet ([BCLConvert_Data]) or a legacy v1 sheet ([Data]).

Command-line options

Option Description
--input <dir> Illumina run folder (required).
--samplesheet <csv> Sample sheet path (required).
--output <dir> Destination directory for FASTQ files (required; created if absent).
--gzip Write .fastq.gz instead of .fastq.
--batch-size <N> Override the auto-tuned GPU batch size (clusters per GPU pass).
--gpu-mem-fraction <F> Fraction of free GPU memory the auto-tuner is allowed to use (0.05 – 0.95, default 0.40).
--device <id> CUDA device index (default 0).
--no-adaptive-probe Disable shrink-on-OOM batch-size probing.

Barcode mismatch tolerance is fixed at ≤ 1 with the Hamming-gap rule (best must beat second-best by ≥ 1 mismatch). CPU thread count for the CBCL ingest stage is controlled by OMP_NUM_THREADS (defaults to all cores).

Example

cuda-demux \
    --input /data/runs/250712_NovaSeq_RunA \
    --samplesheet /data/runs/250712_NovaSeq_RunA/SampleSheet.csv \
    --output /scratch/RunA_fastq \
    --gzip \
    --device 0 \
    --gpu-mem-fraction 0.5

Environment variables

All CLI tunables are also reachable as env vars (the CLI flags simply set them before main runs). A few additional knobs exist only as env vars:

VariablePurpose
CUDA_DEMUX_BATCH_SIZE Equivalent to --batch-size.
CUDA_DEMUX_MEM_FRACTION Equivalent to --gpu-mem-fraction.
CUDA_DEMUX_DEVICE Equivalent to --device.
CUDA_DEMUX_NO_ADAPTIVE Set to non-zero to disable adaptive probing.
CUDA_DEMUX_VERBOSE Set to non-zero for the per-tile / per-cycle diagnostic log.
CUDA_DEMUX_I5_RC Force i5 reverse-complement on (1) or off (0); overrides the platform heuristic from RunParameters.xml.
CUDA_DEMUX_TRY_BOTH_I5 Probe both i5 orientations on the first ~65 k clusters of each lane and keep whichever matches more reads.
OMP_NUM_THREADS Number of CPU threads for the parallel CBCL ingest stage.

Output layout

Files in the output directory are named:

<Sample_ID>_L<lane>_R1_001.fastq[.gz]
<Sample_ID>_L<lane>_R2_001.fastq[.gz]   # paired-end runs only
undetermined_L<lane>_R1_001.fastq[.gz]
undetermined_L<lane>_R2_001.fastq[.gz]

One pair of files per sample per lane. Reads that did not match any sample within the mismatch / Hamming-gap budget are written to undetermined_*. Records are emitted in input cluster order; the writer keeps each gzip stream open across all GPU batches and checks every gzwrite/fwrite return value, so a disk-full or quota error fails loudly rather than silently truncating.

Contributing

Contributions are welcome — issues, feature requests, and pull requests on the GitHub repository.