matchms.similarity.BinnedEmbeddingSimilarity module

class matchms.similarity.BinnedEmbeddingSimilarity.BinnedEmbeddingSimilarity(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]

Bases: BaseEmbeddingSimilarity

Compare spectra by cosine/euclidean similarity of binned intensities.

Spectra are converted to fixed-length vectors by summing intensities in equally spaced m/z bins. Each vector is normalized to its maximum bin intensity when that maximum is positive. Empty spectra, spectra without peaks in the configured m/z range, and spectra with only zero intensities produce a zero vector instead of NaNs.

Parameters:

similarity – Similarity measure used for comparing embeddings. Supported values are "cosine" and "euclidean".
max_mz – Maximum m/z value to include. Values outside [0, max_mz] are ignored.
bin_width – Width of each m/z bin.
intensity_power – Power applied to peak intensities before binning.

__init__(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]

build_ann_index(reference_spectra: Iterable[Spectrum] | None = None, embeddings_path: str | Path | None = None, k: int = 100, index_backend: str = 'pynndescent', **index_kwargs) → Any

Build an ANN index for the input spectra.

Parameters:

reference_spectra (Optional[Iterable[SpectrumType]]) – List of reference spectra to build the ANN index for.
embeddings_path (Optional[Union[str, Path]]) – If embeddings are already computed, provide the path to the numpy file.
k (int, optional) – Number of nearest neighbors to use for the ANN index.
index_backend (str, optional) – Backend to use for ANN index. Currently only “pynndescent” is supported.
**index_kwargs – Additional keyword arguments passed to the index constructor.

Returns:

The constructed ANN index.

Return type:

Any

Raises:

ImportError – If pynndescent is not installed.
ValueError – If an unsupported index_backend is specified.

compute_embeddings(spectra: Iterable[Spectrum]) → ndarray[source]

Convert spectra into binned embeddings.

Parameters:: spectra (Iterable[SpectrumType]) – The spectra to convert into embeddings.
Returns:: Array of shape (n_spectra, n_bins) containing the binned embeddings.
Return type:: np.ndarray

compute_similarity_matrix_from_embeddings(embeddings_1: ndarray, embeddings_2: ndarray | None = None) → ndarray

Compute a raw NumPy similarity matrix from precomputed embeddings.

This helper keeps the old raw-array use case available without changing the public matrix() contract inherited from BaseSimilarity.

get_anns(query_spectra: Iterable[Spectrum] | ndarray, k: int = 100) → tuple[ndarray, ndarray]

Get approximate nearest neighbors for input spectra.

Parameters:

query_spectra (Union[Iterable[SpectrumType], np.ndarray]) – Query spectra or their embeddings.
k (int, optional) – Number of nearest neighbors to return.

Returns:

Neighbor indices and similarity scores.

Return type:

Tuple[np.ndarray, np.ndarray]

Raises:

ValueError – If no index is built or k is larger than index k.

get_embeddings(spectra: Iterable[Spectrum] | None = None, npy_path: str | Path | None = None) → ndarray

Get embeddings either by computing them or loading from disk.

Parameters:

spectra – List of spectra to compute embeddings for.
npy_path – Path to load/save embeddings from/to. If provided, embeddings are loaded from disk if it exists, otherwise they are computed and saved on disk to the provided path.

Returns:

Embeddings array.

Return type:

np.ndarray

Raises:

ValueError – If neither spectra nor npy_path is provided.

get_index_anns() → tuple[ndarray, ndarray]

Get nearest neighbors for all points in the index.

Returns:: Neighbor indices and similarity scores.
Return type:: Tuple[np.ndarray, np.ndarray]
Raises:: ValueError – If unsupported index backend is used.

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

load_ann_index(path: str | Path) → Any

Load an ANN index from disk.

Parameters:: path (Union[str, Path]) – Path to load the index from.
Returns:: The loaded ANN index.
Return type:: Any
Raises:: ValueError – If loaded index similarity metric doesn’t match current metric.

static load_embeddings(npy_path: str | Path) → ndarray

Load embeddings from a numpy file.

Parameters:: npy_path (Union[str, Path]) – Path to the numpy file.
Returns:: Embeddings array.
Return type:: np.ndarray
Raises:: ValueError – If loaded array is not 2D.

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) → Scores

Compute similarity matrix between spectra_1 and spectra_2.

Parameters:

spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Embedding similarities expose only ("score",).
progress_bar – Included for API compatibility. Embeddings are computed in batch and this implementation currently does not display a progress bar.

Returns:

Similarity matrix.

Return type:

np.ndarray

Raises:

ValueError – If array_type is not “numpy” or is_symmetric is False.

property n_bins: int: Number of bins used for each embedding vector.

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → float

Compute similarity between a pair of spectra.

Parameters:

spectrum_1 (SpectrumType) – Reference spectrum.
spectrum_2 (SpectrumType) – Query spectrum.

save_ann_index(path: str | Path) → None

Save the ANN index to disk.

Parameters:: path (Union[str, Path]) – Path to save the index to.
Raises:: ValueError – If no index exists to save.

score_datatype: alias of float64

sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True): Sparse score computation is not available for this similarity.

static store_embeddings(npy_path: str | Path, embeddings: ndarray) → None

Store embeddings in a numpy file.

Parameters:

npy_path (Union[str, Path]) – Path to save the embeddings to.
embeddings (np.ndarray) – Embeddings array to store.

to_dict() → dict: Return a dictionary representation of the similarity function.