matchms.similarity.BlinkCosine module

class matchms.similarity.BlinkCosine.BlinkCosine(tolerance: float = 0.01, bin_width: float = 0.001, mz_power: float = 0.0, intensity_power: float = 1.0, clip_to_one: bool = True, use_numba: bool = True, prefilter: bool = True, min_relative_intensity: float = 0.01, crop_above_precursor: bool = True, remove_zero_intensities: bool = True, top_k: int | None = None, batch_size: int = 1024, sparse_score_min: float = 0.0)[source]

Bases: BaseSimilarity

BLINK-style approximate cosine similarity for mass spectra with fast .pair() and .matrix(). This score is implemented based on the method BLINK, proposed by Harwood et al. (2023, https://www.nature.com/articles/s41598-023-40496-9).

Integer binning with bin_width (Da); tolerance window is ± floor(tolerance/bin_width) bins.
Per-spectrum L2 normalization (after optional mz/intensity weighting).
Blur only one side (queries in .matrix(), smaller spectrum in .pair()).
Pairwise returns (score, ~matches). Matrix returns only scores.

Parameters:

tolerance – True m/z tolerance (Da). Peaks within +/- tolerance are considered matches. Default 0.01.
bin_width – Discretization width (Da). Default 0.001 (1 mDa). Effective radius R=floor(tolerance/bin_width).
mz_power – Power for mz weighting (intensity *= mz**mz_power). Default 0.0.
intensity_power – Power for intensity weighting before normalization. Default 1.0 (set 0.5 for sqrt scaling).
clip_to_one – Clip score to [0,1]. Default True.
use_numba (bool) – Use numba-accelerated pairwise kernel when available. Default True.
prefilter (bool) – Apply BLINK-like pre-filtering (remove <1% base peak, > precursor m/z, zeros). Default True.
min_relative_intensity (float) – Relative base-peak threshold for prefilter. Default 0.01 (1%).
crop_above_precursor (bool) – Drop fragments > precursor m/z if available in metadata. Default True.
remove_zero_intensities (bool) – Remove peaks with intensity <= 0. Default True.
top_k (Optional[int]) – Keep only top-K most intense fragments after other filters (per spectrum). Default None.
path) (# Batching (matrix)
batch_size (int) – Number of query spectra per batch in .matrix(). Default 1024.
sparse_score_min (float) – When array_type=’sparse’, drop scores < sparse_score_min. Default 0.0.

__init__(tolerance: float = 0.01, bin_width: float = 0.001, mz_power: float = 0.0, intensity_power: float = 1.0, clip_to_one: bool = True, use_numba: bool = True, prefilter: bool = True, min_relative_intensity: float = 0.01, crop_above_precursor: bool = True, remove_zero_intensities: bool = True, top_k: int | None = None, batch_size: int = 1024, sparse_score_min: float = 0.0)[source]

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False)[source]

All-vs-all BLINK-style cosine scores.

Implementation: - Build a global dense bin axis in integer bins from min to max across refs+queries

(rows ~ (max_bin - min_bin + 1)), which keeps matrices sparse.

Build a CSR intensity matrix for refs (rows=bins, cols=ref spectra) after per-spectrum L2 normalization.
For queries, build per-batch blurred CSR by expanding each nonzero to its ±R neighbors.
Multiply: scores_batch = (I_ref.T @ I_qry_blur), accumulate into the final output.

Parameters:

references – List of reference spectra.
queries – List of query spectra.
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array

Returns:

If array_type == ‘numpy’: dense (n_ref, n_query) If array_type == ‘sparse’: COO sparse (n_ref, n_query), dropping scores < sparse_score_min

Return type:

numpy.ndarray or scipy.sparse.coo_array

pair(reference: Spectrum, query: Spectrum) → Tuple[float, int][source]

Calculate BLINK-style cosine between two spectra.

Parameters:

reference – Single reference spectrum.
query – Single query spectrum.

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

to_dict() → dict: Return a dictionary representation of a similarity function.