matchms.Fingerprints module

class matchms.Fingerprints.Fingerprints(fingerprint_generator, *, ignore_stereochemistry: bool = False, count: bool = False, folded: bool = True, return_csr: bool = False, invalid_policy: str = 'raise', **config_kwargs)[source]

Bases: object

Compute and store an InChIKey-to-fingerprint mapping for a collection of spectra.

This class is a container for molecular fingerprints keyed by InChIKey. Fingerprints are computed for unique compounds only and stored either as a dense NumPy array or as a SciPy CSR sparse matrix.

Compared to the older implementation, this refactor is designed for larger scale use cases and delegates fingerprint computation to chemap.

Example

import numpy as np
from rdkit.Chem import rdFingerprintGenerator
from matchms import Fingerprints, Spectrum

spectrum_1 = Spectrum(
    mz=np.array([100, 150, 200.]),
    intensities=np.array([0.7, 0.2, 0.1]),
    metadata={
        "inchikey": "OTMSDBZUPAUEDD-UHFFFAOYSA-N",
        "smiles": "CC",
        "precursor_mz": 150.0,
    },
)
spectrum_2 = Spectrum(
    mz=np.array([100, 150, 200.]),
    intensities=np.array([0.7, 0.2, 0.1]),
    metadata={
        "inchikey": "UGFAIRIUMAVXCW-UHFFFAOYSA-N",
        "smiles": "[C-]#[O+]",
        "precursor_mz": 150.0,
    },
)

spectra = [spectrum_1, spectrum_2]

generator = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=256)

fpgen = Fingerprints(
    fingerprint_generator=generator,
    count=False,
    folded=True,
    return_csr=False,
)
fpgen.compute_fingerprints(spectra)

print(fpgen.fingerprint_count)
print(type(fpgen.get_fingerprint_by_inchikey("OTMSDBZUPAUEDD-UHFFFAOYSA-N")))

Should output

2
<class 'numpy.ndarray'>
fingerprints

The computed fingerprints as either a NumPy array or SciPy CSR matrix.

inchikeys

Ordered list of unique InChIKeys corresponding to fingerprint rows.

fingerprint_count

Number of unique fingerprints currently stored.

config

Dictionary with configuration used for fingerprint computation.

to_dataframe

DataFrame containing InChIKeys and fingerprints.

__init__(fingerprint_generator, *, ignore_stereochemistry: bool = False, count: bool = False, folded: bool = True, return_csr: bool = False, invalid_policy: str = 'raise', **config_kwargs)[source]
Parameters:
  • fingerprint_generator – A chemap-compatible fingerprint generator, for example an RDKit fingerprint generator or a scikit-fingerprints object.

  • ignore_stereochemistry – If True, the first 14 characters of the InChIKey are used.

  • count – Whether count fingerprints should be computed.

  • folded – Whether fingerprints should be folded.

  • return_csr – If True, fingerprints are stored as a SciPy CSR matrix. Otherwise they are stored as a dense NumPy array.

  • invalid_policy – Policy passed to chemap for invalid molecular inputs.

  • **config_kwargs – Additional keyword arguments passed into FingerprintConfig.

compute_fingerprint(spectrum: Spectrum)[source]

Compute one fingerprint for a given spectrum.

This does not add the fingerprint to the internal storage. It only computes and returns the fingerprint.

Parameters:

spectrum – A spectrum for which a fingerprint is to be calculated.

Returns:

Fingerprint row, or None if fingerprint could not be computed.

Return type:

Optional[np.ndarray | scipy.sparse.csr_matrix]

compute_fingerprints(spectra: list[Spectrum])[source]

Compute fingerprints for a list of spectra.

Fingerprints are computed only for unique compounds, keyed by InChIKey. Existing stored fingerprints are replaced.

Parameters:

spectra – List of spectra.

property config: dict

Return configuration used for fingerprint computation.

property fingerprint_count: int

Return the number of stored fingerprints.

property fingerprints: ndarray | csr_matrix | None

Return the stored fingerprint matrix.

get_fingerprint_by_inchikey(inchikey: str)[source]

Get fingerprint by InChIKey.

Parameters:

inchikey – InChIKey of a compound.

Returns:

The corresponding fingerprint row, or None if not present.

Return type:

Optional[np.ndarray | scipy.sparse.csr_matrix]

get_fingerprint_by_spectrum(spectrum: Spectrum)[source]

Get fingerprint by spectrum.

Parameters:

spectrum – Spectrum with an InChIKey.

Returns:

The corresponding fingerprint row, or None if not present.

Return type:

Optional[np.ndarray | scipy.sparse.csr_matrix]

property inchikeys: list[str]

Return ordered list of stored InChIKeys.

property is_sparse: bool

Return True if fingerprints are stored as CSR sparse matrix.

property to_dataframe: DataFrame

Return fingerprints as a pandas DataFrame indexed by InChIKey.