matchms.similarity.BinnedEmbeddingSimilarity module

class matchms.similarity.BinnedEmbeddingSimilarity.BinnedEmbeddingSimilarity(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]

Bases: BaseEmbeddingSimilarity

Compare spectra by cosine/euclidean similarity of binned intensities.

Spectra are converted to fixed-length vectors by summing intensities in equally spaced m/z bins. Each vector is normalized to its maximum bin intensity when that maximum is positive. Empty spectra, spectra without peaks in the configured m/z range, and spectra with only zero intensities produce a zero vector instead of NaNs.

Parameters:
  • similarity – Similarity measure used for comparing embeddings. Supported values are "cosine" and "euclidean".

  • max_mz – Maximum m/z value to include. Values outside [0, max_mz] are ignored.

  • bin_width – Width of each m/z bin.

  • intensity_power – Power applied to peak intensities before binning.

__init__(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]
build_ann_index(reference_spectra: Iterable[Spectrum] | None = None, embeddings_path: str | Path | None = None, k: int = 100, index_backend: str = 'pynndescent', **index_kwargs) Any

Build an ANN index for the input spectra.

Parameters:
  • reference_spectra (Optional[Iterable[SpectrumType]]) – List of reference spectra to build the ANN index for.

  • embeddings_path (Optional[Union[str, Path]]) – If embeddings are already computed, provide the path to the numpy file.

  • k (int, optional) – Number of nearest neighbors to use for the ANN index.

  • index_backend (str, optional) – Backend to use for ANN index. Currently only “pynndescent” is supported.

  • **index_kwargs – Additional keyword arguments passed to the index constructor.

Returns:

The constructed ANN index.

Return type:

Any

Raises:
  • ImportError – If pynndescent is not installed.

  • ValueError – If an unsupported index_backend is specified.

compute_embeddings(spectra: Iterable[Spectrum]) ndarray[source]

Convert spectra into binned embeddings.

Parameters:

spectra (Iterable[SpectrumType]) – The spectra to convert into embeddings.

Returns:

Array of shape (n_spectra, n_bins) containing the binned embeddings.

Return type:

np.ndarray

compute_similarity_matrix_from_embeddings(embeddings_1: ndarray, embeddings_2: ndarray | None = None) ndarray

Compute a raw NumPy similarity matrix from precomputed embeddings.

This helper keeps the old raw-array use case available without changing the public matrix() contract inherited from BaseSimilarity.

get_anns(query_spectra: Iterable[Spectrum] | ndarray, k: int = 100) tuple[ndarray, ndarray]

Get approximate nearest neighbors for input spectra.

Parameters:
  • query_spectra (Union[Iterable[SpectrumType], np.ndarray]) – Query spectra or their embeddings.

  • k (int, optional) – Number of nearest neighbors to return.

Returns:

Neighbor indices and similarity scores.

Return type:

Tuple[np.ndarray, np.ndarray]

Raises:

ValueError – If no index is built or k is larger than index k.

get_embeddings(spectra: Iterable[Spectrum] | None = None, npy_path: str | Path | None = None) ndarray

Get embeddings either by computing them or loading from disk.

Parameters:
  • spectra – List of spectra to compute embeddings for.

  • npy_path – Path to load/save embeddings from/to. If provided, embeddings are loaded from disk if it exists, otherwise they are computed and saved on disk to the provided path.

Returns:

Embeddings array.

Return type:

np.ndarray

Raises:

ValueError – If neither spectra nor npy_path is provided.

get_index_anns() tuple[ndarray, ndarray]

Get nearest neighbors for all points in the index.

Returns:

Neighbor indices and similarity scores.

Return type:

Tuple[np.ndarray, np.ndarray]

Raises:

ValueError – If unsupported index backend is used.

property is_structured_score: bool

Return True if this similarity uses a structured score dtype.

load_ann_index(path: str | Path) Any

Load an ANN index from disk.

Parameters:

path (Union[str, Path]) – Path to load the index from.

Returns:

The loaded ANN index.

Return type:

Any

Raises:

ValueError – If loaded index similarity metric doesn’t match current metric.

static load_embeddings(npy_path: str | Path) ndarray

Load embeddings from a numpy file.

Parameters:

npy_path (Union[str, Path]) – Path to the numpy file.

Returns:

Embeddings array.

Return type:

np.ndarray

Raises:

ValueError – If loaded array is not 2D.

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) Scores

Compute similarity matrix between spectra_1 and spectra_2.

Parameters:
  • spectra_1 – First collection of spectra.

  • spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself.

  • score_fields – Requested score fields. Embedding similarities expose only ("score",).

  • progress_bar – Included for API compatibility. Embeddings are computed in batch and this implementation currently does not display a progress bar.

Returns:

Similarity matrix.

Return type:

np.ndarray

Raises:

ValueError – If array_type is not “numpy” or is_symmetric is False.

property n_bins: int

Number of bins used for each embedding vector.

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) float

Compute similarity between a pair of spectra.

Parameters:
  • spectrum_1 (SpectrumType) – Reference spectrum.

  • spectrum_2 (SpectrumType) – Query spectrum.

save_ann_index(path: str | Path) None

Save the ANN index to disk.

Parameters:

path (Union[str, Path]) – Path to save the index to.

Raises:

ValueError – If no index exists to save.

score_datatype

alias of float64

sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True)

Sparse score computation is not available for this similarity.

static store_embeddings(npy_path: str | Path, embeddings: ndarray) None

Store embeddings in a numpy file.

Parameters:
  • npy_path (Union[str, Path]) – Path to save the embeddings to.

  • embeddings (np.ndarray) – Embeddings array to store.

to_dict() dict

Return a dictionary representation of the similarity function.