Usage

Usage of the library for output consists of first creating a hierarchy of Python objects that together describe the system, and then dumping that hierarchy to an mmCIF file.

For a complete worked example, see the simple docking example.

The top level of the hierarchy in IHM is the ihm.System. All other objects are referenced from a System object.

Datasets

Any data used anywhere in the modeling (including in validation) can be referenced with an ihm.dataset.Dataset. For example, electron microscopy data is referenced with ihm.dataset.EMDensityDataset and small angle scattering data with ihm.dataset.SASDataset.

A dataset uses an ihm.location.Location object to describe where it is stored. Typically this is an ihm.location.DatabaseLocation for something that’s deposited in a experiment-specific database such as PDB, EMDB, PRIDE, or EMPIAR, or ihm.location.InputFileLocation for something that’s stored as a simple file, either on the local disk or at a location described with a DOI such as Zenodo or a publication’s supplementary information. See the locations example for more examples.

System architecture

The architecture of the system is described with a number of classes:

  • ihm.Entity describes each unique sequence.

  • ihm.AsymUnit describes each asymmetric unit (chain) in the system. For example, a homodimer would consist of two asymmetric units, both pointing to the same entity, while a heterodimer contains two entities. It is also possible for an entity to exist with no asymmetric units pointing to it - this typically corresponds to something seen in an experiment (such as a cross-linking study) which was not modeled. Note that the IHM extension currently contains no support for symmetry, so two chains that are symmetrically related should each be represented as an “asymmetric” unit.

  • ihm.Assembly groups asymmetric units and/or entities, or parts of them. Assemblies are used to describe which parts of the system correspond to each input source of data, or that were modeled.

  • ihm.representation.Representation describes how each part of the system was represented in the modeling, for example as atoms or as coarse-grained spheres.

Restraints and sampling

Restraints, that score or otherwise fit the computational model against the input data, can be created as ihm.restraint.Restraint objects. These generally take as input a Dataset pointing to the input data, and an Assembly describing which part of the model the data corresponds to. For example, there are restraints for 3D EM and small angle scattering.

ihm.protocol.Protocol objects describe how models were generated from the input data. A protocol can consist of multiple steps, such as molecular dynamics or Monte Carlo, followed by one or more analyses, such as clustering, filtering, rescoring, or validation, described by ihm.analysis.Analysis objects. These objects generally take an Assembly to indicate what part of the system was considered and a group of datasets to show which data guided the modeling or analysis.

Model coordinates

ihm.model.Model objects give the actual coordinates of the final generated models. These point to the Assembly of what was modeled, the Protocol describing how the modeling was done, and the Representation showing how the model was represented.

Models can be grouped together for any purpose using the ihm.model.ModelGroup class. If a given group describes an ensemble of models, the ihm.model.Ensemble class allows for additional information on the ensemble to be provided, such as localization densities of parts of the system and precision. Due to size, generally only representative models of an ensemble are deposited in mmCIF, but the Ensemble class allows the full ensemble to be referred to, for example in a more compact binary format (e.g. DCD) deposited at a given DOI. Groups of models can also be shown as corresponding to different states of the system using the ihm.model.State class.

Metadata

Metadata can also be added to the system, such as

  • ihm.Citation: publication(s) that describe this modeling or the methods used in it.

  • ihm.Software: software packages used to process the experimental data, generate intermediate inputs, do the modeling itself, and/or process the output.

  • ihm.Grant: funding support for the modeling.

  • ihm.reference.UniProtSequence: information on a sequence used in modeling, in UniProt.

Residue numbering

The library keeps track of several numbering schemes to reflect the reality of the data used in modeling:

  • Internal numbering. Residues are always numbered sequentially starting at 1 in an Entity. All references to residues or residue ranges in the library use this numbering. For polymers, this internal numbering matches the seq_id used in the mmCIF dictionary, while for branched entities, this matches num in the dictionary. (For other types of entities (non-polymers, waters) seq_id is not used in mmCIF, but the residues are still numbered sequentially from 1 in this library.)

  • Author-provided numbering. If a different numbering scheme is used by the authors, for example to correspond to the numbering of the original sequence that is modeled, this can be given as an author-provided numbering for one or more asymmetric units. See the auth_seq_id_map and orig_auth_seq_id_map parameters to AsymUnit. (The mapping between author-provided and internal numbering is given in tables such as pdbx_poly_seq_scheme in the mmCIF file.) Two maps are provided as PDB provides for two distinct author-provided schemes; the “original” author-provided numbering orig_auth_seq_id_map is entirely unrestricted but is only used internally, while auth_seq_id_map must follow certain PDB rules (and generally matches the residue numbers used in legacy PDB files). In most cases, only auth_seq_id_map is used.

  • Starting model numbering. If the initial state of the modeling is given by one or more PDB files, the numbering of residues in those files may not line up with the internal numbering. In this case an offset from starting model numbering to internal numbering can be provided - see the offset parameter to StartingModel.

Output

Once the hierarchy of classes is complete, it can be freely inspected or modified. All the classes are simple lightweight Python objects, generally with the relevant data available as member variables. For example, modeling packages such as IMP will typically generate an IHM hierarchy from their own internal data models, but in many cases some information relevant to IHM (such as the associated publication) cannot be determined automatically and can be filled in by adding more objects to the hierarchy.

The complete hierarchy can be written out to an mmCIF or BinaryCIF file using the ihm.dumper.write() function.

Input

Hierarchies of IHM classes can also be read from mmCIF or BinaryCIF files. This is done using the ihm.reader.read() function, which returns a list of ihm.System objects.