matchms.similarity.BinnedEmbeddingSimilarity module
- class matchms.similarity.BinnedEmbeddingSimilarity.BinnedEmbeddingSimilarity(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]
Bases:
BaseEmbeddingSimilarityCompare spectra by cosine/euclidean similarity of binned intensities.
Spectra are converted to fixed-length vectors by summing intensities in equally spaced m/z bins. Each vector is normalized to its maximum bin intensity when that maximum is positive. Empty spectra, spectra without peaks in the configured m/z range, and spectra with only zero intensities produce a zero vector instead of NaNs.
- Parameters:
similarity – Similarity measure used for comparing embeddings. Supported values are
"cosine"and"euclidean".max_mz – Maximum m/z value to include. Values outside
[0, max_mz]are ignored.bin_width – Width of each m/z bin.
intensity_power – Power applied to peak intensities before binning.
- __init__(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]
- build_ann_index(reference_spectra: Iterable[Spectrum] | None = None, embeddings_path: str | Path | None = None, k: int = 100, index_backend: str = 'pynndescent', **index_kwargs) Any
Build an ANN index for the input spectra.
- Parameters:
reference_spectra (Optional[Iterable[SpectrumType]]) – List of reference spectra to build the ANN index for.
embeddings_path (Optional[Union[str, Path]]) – If embeddings are already computed, provide the path to the numpy file.
k (int, optional) – Number of nearest neighbors to use for the ANN index.
index_backend (str, optional) – Backend to use for ANN index. Currently only “pynndescent” is supported.
**index_kwargs – Additional keyword arguments passed to the index constructor.
- Returns:
The constructed ANN index.
- Return type:
Any
- Raises:
ImportError – If pynndescent is not installed.
ValueError – If an unsupported index_backend is specified.
- compute_embeddings(spectra: Iterable[Spectrum]) ndarray[source]
Convert spectra into binned embeddings.
- Parameters:
spectra (Iterable[SpectrumType]) – The spectra to convert into embeddings.
- Returns:
Array of shape (n_spectra, n_bins) containing the binned embeddings.
- Return type:
np.ndarray
- compute_similarity_matrix_from_embeddings(embeddings_1: ndarray, embeddings_2: ndarray | None = None) ndarray
Compute a raw NumPy similarity matrix from precomputed embeddings.
This helper keeps the old raw-array use case available without changing the public
matrix()contract inherited fromBaseSimilarity.
- get_anns(query_spectra: Iterable[Spectrum] | ndarray, k: int = 100) tuple[ndarray, ndarray]
Get approximate nearest neighbors for input spectra.
- Parameters:
query_spectra (Union[Iterable[SpectrumType], np.ndarray]) – Query spectra or their embeddings.
k (int, optional) – Number of nearest neighbors to return.
- Returns:
Neighbor indices and similarity scores.
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
ValueError – If no index is built or k is larger than index k.
- get_embeddings(spectra: Iterable[Spectrum] | None = None, npy_path: str | Path | None = None) ndarray
Get embeddings either by computing them or loading from disk.
- Parameters:
spectra – List of spectra to compute embeddings for.
npy_path – Path to load/save embeddings from/to. If provided, embeddings are loaded from disk if it exists, otherwise they are computed and saved on disk to the provided path.
- Returns:
Embeddings array.
- Return type:
np.ndarray
- Raises:
ValueError – If neither spectra nor npy_path is provided.
- get_index_anns() tuple[ndarray, ndarray]
Get nearest neighbors for all points in the index.
- Returns:
Neighbor indices and similarity scores.
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
ValueError – If unsupported index backend is used.
- load_ann_index(path: str | Path) Any
Load an ANN index from disk.
- Parameters:
path (Union[str, Path]) – Path to load the index from.
- Returns:
The loaded ANN index.
- Return type:
Any
- Raises:
ValueError – If loaded index similarity metric doesn’t match current metric.
- static load_embeddings(npy_path: str | Path) ndarray
Load embeddings from a numpy file.
- Parameters:
npy_path (Union[str, Path]) – Path to the numpy file.
- Returns:
Embeddings array.
- Return type:
np.ndarray
- Raises:
ValueError – If loaded array is not 2D.
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) Scores
Compute similarity matrix between spectra_1 and spectra_2.
- Parameters:
spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If
None, comparespectra_1against itself.score_fields – Requested score fields. Embedding similarities expose only
("score",).progress_bar – Included for API compatibility. Embeddings are computed in batch and this implementation currently does not display a progress bar.
- Returns:
Similarity matrix.
- Return type:
np.ndarray
- Raises:
ValueError – If array_type is not “numpy” or is_symmetric is False.
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum) float
Compute similarity between a pair of spectra.
- Parameters:
spectrum_1 (SpectrumType) – Reference spectrum.
spectrum_2 (SpectrumType) – Query spectrum.
- save_ann_index(path: str | Path) None
Save the ANN index to disk.
- Parameters:
path (Union[str, Path]) – Path to save the index to.
- Raises:
ValueError – If no index exists to save.
- score_datatype
alias of
float64
- sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True)
Sparse score computation is not available for this similarity.