matchms.similarity.FlashSimilarity module

class matchms.similarity.FlashSimilarity.FlashSimilarity(score_type: str = 'spectral_entropy', matching_mode: str = 'fragment', tolerance: float = 0.02, use_ppm: bool = False, remove_precursor: bool = False, precursor_window: float = 1.6, noise_cutoff: float = 0.01, normalize_to_half: bool = True, merge_within: float = 0, identity_precursor_tolerance: float | None = None, identity_use_ppm: bool = False, dtype: dtype = <class 'numpy.float64'>)[source]

Bases: BaseSimilarity

Flash entropy similarity (Li & Fiehn, 2023) with a fast .matrix() that builds a library-wide index over ‘queries’ and streams all ‘references’ through it.

Key options:

matching_mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (fragment-priority).
tolerance in Da or symmetric ppm (use_ppm=True).
cleanup: remove precursor & > (precursor_mz - 1.6), 1% noise removal,
entropy weighting, normalize ∑I’ = 0.5, optional within-peak merge.

Notes:

.pair() works but is not the fast path. Use .matrix().
For identity-search behavior, pass identity_precursor_tolerance (Da or ppm).

Parameters:

score_type – Score type: ‘spectral_entropy’ (default) or ‘cosine’.
matching_mode – Matching mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (default is ‘fragment’). Chose “hybrid” in combination with score_type=”cosine” to compute the modified cosine score.
tolerance – Matching tolerance in Da or ppm (use_ppm=True). Default is 0.02.
use_ppm – If True, interpret tolerance as parts-per-million. Default is False.
remove_precursor – If True, remove precursor peak and peaks within precursor_window. Default is False.
precursor_window – If remove_precursor is True, remove peaks within this window around the precursor m/z. Default is 1.6 Da (as suggested by Li & Fiehn(2023)).
noise_cutoff – If > 0, remove peaks with intensities below this fraction of the maximum intensity. Default is 0.01 (1%).
normalize_to_half – If True, normalize intensities such that the sum of intensities is 0.5. Default is True.
merge_within – If > 0, merge peaks within this distance (in Da) to a single peak. Default is 0.
identity_precursor_tolerance – If not None, enforce identity search behavior by requiring the precursor m/z of the query to be within this tolerance of the reference precursor m/z.
identity_use_ppm – If True, interpret identity_precursor_tolerance as ppm. Default is False.
dtype – Data type for the output scores. Default is np.float64 which properly accounts for highest resolution MS/MS data (even far beyond current MS/MS possibilties!). To save memory, np.float32 can be used instead, which is sufficient for peak resolutions up to about 8,000,000.

__init__(score_type: str = 'spectral_entropy', matching_mode: str = 'fragment', tolerance: float = 0.02, use_ppm: bool = False, remove_precursor: bool = False, precursor_window: float = 1.6, noise_cutoff: float = 0.01, normalize_to_half: bool = True, merge_within: float = 0, identity_precursor_tolerance: float | None = None, identity_use_ppm: bool = False, dtype: dtype = <class 'numpy.float64'>)[source]

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, n_jobs: int = -1) → ndarray[source]

Calculate matrix of Flash entropy similarity scores.

Parameters:

references:: List of reference spectra.
queries:: List of query spectra.
array_type:: Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a SparseStacked COO-style array.
is_symmetric:: If True, the matrix will be symmetric (i.e., references and queries must have the same length). Here has no consequence on runtime.
n_jobs:: Number of parallel jobs to run. Default is set to -1, which means that all available CPUs minus one will be used.

pair(reference: Spectrum, query: Spectrum) → ndarray[source]

Compute Flash similarity for a single (reference, query) pair. Uses the same preprocessing and scoring logic as the matrix path, but builds a tiny 1-spectrum library from the query.

Careful: This is not the fast intended use; better .matrix() instead.

score_datatype: alias of float32

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

to_dict() → dict: Return a dictionary representation of a similarity function.