matchms.similarity.FlashSimilarity module

class matchms.similarity.FlashSimilarity.CosineFlash(*args, dtype: dtype = <class 'numpy.float64'>, **kwargs)[source]

Bases: _BaseFlashSimilarity

Flash Cosine similarity following the original Flash Entropy (Li & Fiehn, 2023) with a fast .matrix() that builds a library-wide index over ‘queries’ and streams all ‘references’ through it. This corresponds to the “CosineGreedy” scoring logic but with the same fast Flash path as Flash Entropy.

Key options:

matching_mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (fragment-priority).
tolerance in Da or symmetric ppm (use_ppm=True).
cleanup: remove precursor & > (precursor_mz - 1.6), 1% noise removal,
entropy weighting, normalize ∑I’ = 0.5, optional within-peak merge.

Notes:

.pair() works but is not the fast path. Use .matrix().
For identity-search behavior, pass identity_precursor_tolerance (Da or ppm).

Parameters:

matching_mode – Matching mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (default is ‘fragment’).
tolerance – Matching tolerance in Da or ppm (use_ppm=True). Default is 0.02.
use_ppm – If True, interpret tolerance as parts-per-million. Default is False.
intensity_power – The power to raise intensity to in the cosine function. The default is 1 (no weighting).
remove_precursor – If True, remove precursor peak and peaks within precursor_window. Default is False.
precursor_window – If remove_precursor is True, remove peaks within this window around the precursor m/z. Default is 1.6 Da (as suggested by Li & Fiehn(2023)).
noise_cutoff – If > 0, remove peaks with intensities below this fraction of the maximum intensity. Default is 0.01 (1%).
normalize_to_half – If True, normalize intensities such that the sum of intensities is 0.5. Default is False.
merge_within – If > 0, merge peaks within this distance (in Da) to a single peak. Default is 0.
identity_precursor_tolerance – If not None, enforce identity search behavior by requiring the precursor m/z of the query to be within this tolerance of the reference precursor m/z.
identity_use_ppm – If True, interpret identity_precursor_tolerance as ppm. Default is False.
dtype – Data type for the output scores. Default is np.float64 which properly accounts for highest resolution MS/MS data (even far beyond current MS/MS possibilties!). To save memory, np.float32 can be used instead, which is sufficient for peak resolutions up to about 8,000,000.

__init__(*args, dtype: dtype = <class 'numpy.float64'>, **kwargs)[source]

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True, n_jobs: int = -1)[source]

Calculate matrix of Flash Cosine scores.

Parameters:

spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only ("score",) is supported.
progress_bar – When True, show a progress bar.
n_jobs – Number of parallel jobs to run. Default is -1, which means that all available CPUs minus one will be used.

Returns:

Dense score matrix as a Scores object.

Return type:

Scores

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → ndarray[source]

Calculate the similarity for one pair of spectra.

Parameters:

spectrum_1 – First spectrum.
spectrum_2 – Second spectrum.

Returns:

Similarity result for one pair. The returned value should be compatible with self.score_datatype.

Return type:

score

Examples

Scalar score:: return np.asarray(score, dtype=self.score_datatype)
Structured score:: return np.asarray((score, matches), dtype=self.score_datatype)

score_datatype: alias of float64

sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True): Sparse score computation is not available for this similarity.

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.FlashSimilarity.FlashEntropy(*args, normalize_to_half: bool = True, **kwargs)[source]

Bases: _BaseFlashSimilarity

Flash entropy similarity (Li & Fiehn, 2023) with a fast .matrix() that builds a library-wide index over ‘queries’ and streams all ‘references’ through it.

Key options:

matching_mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (fragment-priority).
tolerance in Da or symmetric ppm (use_ppm=True).
cleanup: remove precursor & > (precursor_mz - 1.6), 1% noise removal,
entropy weighting, normalize ∑I’ = 0.5, optional within-peak merge.

Notes:

.pair() works but is not the fast path. Use .matrix().
For identity-search behavior, pass identity_precursor_tolerance (Da or ppm).

Parameters:

matching_mode – Matching mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (default is ‘fragment’).
tolerance – Matching tolerance in Da or ppm (use_ppm=True). Default is 0.02.
use_ppm – If True, interpret tolerance as parts-per-million. Default is False.
remove_precursor – If True, remove precursor peak and peaks within precursor_window. Default is False.
precursor_window – If remove_precursor is True, remove peaks within this window around the precursor m/z. Default is 1.6 Da (as suggested by Li & Fiehn(2023)).
noise_cutoff – If > 0, remove peaks with intensities below this fraction of the maximum intensity. Default is 0.01 (1%).
normalize_to_half – If True, normalize intensities such that the sum of intensities is 0.5. Default is True.
merge_within – If > 0, merge peaks within this distance (in Da) to a single peak. Default is 0.
identity_precursor_tolerance – If not None, enforce identity search behavior by requiring the precursor m/z of the query to be within this tolerance of the reference precursor m/z.
identity_use_ppm – If True, interpret identity_precursor_tolerance as ppm. Default is False.
dtype – Data type for the output scores. Default is np.float64 which properly accounts for highest resolution MS/MS data (even far beyond current MS/MS possibilties!). To save memory, np.float32 can be used instead, which is sufficient for peak resolutions up to about 8,000,000.

__init__(*args, normalize_to_half: bool = True, **kwargs)[source]

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True, n_jobs: int = -1)[source]

Calculate matrix of Flash Entropy scores.

Parameters:

spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only ("score",) is supported.
progress_bar – When True, show a progress bar.
n_jobs – Number of parallel jobs to run. Default is -1, which means that all available CPUs minus one will be used.

Returns:

Dense score matrix as a Scores object.

Return type:

Scores

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → ndarray[source]

Compute Flash Entropy for a single (reference, query) pair. Uses the same preprocessing and scoring logic as the matrix path, but builds a tiny 1-spectrum library from the query.

Careful: This is not the fast intended use; better .matrix() instead.

score_datatype: alias of float32

sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True): Sparse score computation is not available for this similarity.

to_dict() → dict: Return a dictionary representation of the similarity function.