CUDA-Demux converts Illumina BCL/CBCL run folders directly into per-sample FASTQ files on the GPU. CBCL ingestion runs in parallel across cycles with OpenMP, base calling and barcode matching execute as CUDA kernels, and matched reads stream straight into per-(sample, lane) gzipped FASTQ writers — no per-cluster std::string allocations, no batched-then-merged buffers.
Features
GPU Acceleration
Base decoding and barcode matching run as CUDA kernels;
barcodes are stored 2-bits-per-base in __constant__
memory and compared with a popcount-Hamming kernel.
Streaming Output
SoA per-cycle BCL buffers feed GPU batches; matched reads are written directly to per-sample gzip streams as each batch completes — no host-side accumulation of all reads.
NovaSeqX Scale
Validated on a full 2-lane NovaSeqX run (~2.06 B clusters, ~105 GB CBCL input) producing ~144 GB of gzipped FASTQ.
Parallel CBCL Ingest
OpenMP-parallel decompression across cycles; on a 48-core host, the per-tile zlib stage is no longer serial.
Hamming-Gap Matching
Reads are assigned to a sample only when the best
barcode hit is ≤ 1 mismatch and at least one
mismatch better than the runner-up — ambiguous reads
go to undetermined.
Multi-Lane & Paired-End
Filter files (s_<lane>_*.filter),
i5 reverse-complement detection, and per-lane FASTQ
output (L001, L002, …) are
handled automatically.
Performance
End-to-end run on a NovaSeqX dataset (2 lanes, 172 cycles, 1.18 B + 1.18 B raw clusters):
- 2,057,911,943 post-QC clusters demultiplexed across both lanes
- 1,696,296,371 reads matched to 18 samples (82.4 %) with the 1-mismatch + Hamming-gap rule
- 144 GB of gzipped paired-end FASTQ produced (R1 + R2, per sample, per lane)
- CBCL ingest parallelised over 172 cycles using all available CPU threads
- GPU decode + match runs in ~8 M-cluster batches sized automatically against free device memory
Tested on an NVIDIA RTX A4500 (20 GB, compute capability 8.6).
Downloads
Latest stable release: v1.1.0 — streaming SoA pipeline, GPU-resident demux, OpenMP CBCL ingest, multi-lane filter fix. Validated end-to-end on a 2-lane NovaSeqX run (~2.06 B clusters).
Installation
Prerequisites
- NVIDIA GPU with compute capability 8.0, 8.6, or 8.9
(A100 / RTX 30-series / RTX 40-series — adjust
-DCMAKE_CUDA_ARCHITECTURESfor other targets) - CUDA toolkit 12.x or 13.x and a matching NVIDIA driver
- g++ 13 or 14 (CUDA 13 + g++-15 is not yet supported by nvcc)
- CMake ≥ 3.16 and Ninja (or Make)
- zlib, OpenMP runtime, TinyXML2
Docker (recommended)
The repository ships a Dockerfile that pins a
compatible nvidia/cuda:13.0.1-devel-ubuntu24.04
toolchain, installs all build dependencies, runs the unit
tests, and produces a ready-to-run image:
# Build the image (sets CUDA arch to 8.6 by default; override with --build-arg CUDA_ARCH=80)
git clone https://github.com/mmorri/cuda-demux.git
cd cuda-demux
docker build -t cuda-demux:dev .
# Run on a host run folder; mount input read-only and an output dir
docker run --rm --gpus all \
-v /path/to/RunFolder:/work/run:ro \
-v /path/to/output:/work/out \
cuda-demux:dev \
--input /work/run \
--samplesheet /work/run/SampleSheet.csv \
--output /work/out \
--gzip
The image's ENTRYPOINT is the cuda-demux
binary, so any of the CLI flags below can be appended to
docker run.
Package installation (v1.1.0)
Debian / Ubuntu
wget https://github.com/mmorri/cuda-demux/releases/download/v1.1.0/cuda-demux_1.1.0_amd64.deb
sudo apt install ./cuda-demux_1.1.0_amd64.deb
# Or:
sudo dpkg -i cuda-demux_1.1.0_amd64.deb
sudo apt-get install -f
Fedora / RHEL / CentOS
wget https://github.com/mmorri/cuda-demux/releases/download/v1.1.0/cuda-demux-1.1.0-1.x86_64.rpm
sudo dnf install ./cuda-demux-1.1.0-1.x86_64.rpm
# Older systems:
sudo yum install ./cuda-demux-1.1.0-1.x86_64.rpm
Build from source
# Debian/Ubuntu — install build dependencies
sudo apt install -y \
cmake ninja-build \
g++-14 \
libtinyxml2-dev libomp-dev zlib1g-dev
git clone https://github.com/mmorri/cuda-demux.git
cd cuda-demux
cmake -S . -B build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_TESTING=ON \
-DCMAKE_CUDA_ARCHITECTURES=86 # 80 for A100, 89 for RTX 40xx
cmake --build build -j
ctest --test-dir build --output-on-failure
# Resulting binary:
./build/cuda-demux --input ... --samplesheet ... --output ...
Usage
Basic command
cuda-demux \
--input /path/to/RunFolder \
--samplesheet /path/to/SampleSheet.csv \
--output /path/to/fastq_output \
--gzip
--input must point at the run folder root (the
directory containing RunInfo.xml,
RunParameters.xml, and
Data/Intensities/BaseCalls/). The sample sheet is
an Illumina v2 SampleSheet ([BCLConvert_Data]) or
a legacy v1 sheet ([Data]).
Command-line options
Barcode mismatch tolerance is fixed at ≤ 1 with the
Hamming-gap rule (best must beat second-best by ≥ 1
mismatch). CPU thread count for the CBCL ingest stage is
controlled by OMP_NUM_THREADS (defaults to all
cores).
Example
cuda-demux \
--input /data/runs/250712_NovaSeq_RunA \
--samplesheet /data/runs/250712_NovaSeq_RunA/SampleSheet.csv \
--output /scratch/RunA_fastq \
--gzip \
--device 0 \
--gpu-mem-fraction 0.5
Environment variables
All CLI tunables are also reachable as env vars (the CLI
flags simply set them before main runs). A few
additional knobs exist only as env vars:
Output layout
Files in the output directory are named:
<Sample_ID>_L<lane>_R1_001.fastq[.gz]
<Sample_ID>_L<lane>_R2_001.fastq[.gz] # paired-end runs only
undetermined_L<lane>_R1_001.fastq[.gz]
undetermined_L<lane>_R2_001.fastq[.gz]
One pair of files per sample per lane. Reads that did not
match any sample within the mismatch / Hamming-gap budget are
written to undetermined_*. Records are emitted
in input cluster order; the writer keeps each gzip stream
open across all GPU batches and checks every
gzwrite/fwrite return value, so a
disk-full or quota error fails loudly rather than silently
truncating.
Contributing
Contributions are welcome — issues, feature requests, and pull requests on the GitHub repository.