The ihm.metadata Python module

Classes to extract metadata from various input files.

Often input files contain metadata that would be useful to include in the mmCIF file, but the metadata is stored in a different way for each domain-specific file type. For example, MRC files used for electron microscopy maps may contain an EMDB identifier, which the mmCIF file can point to in preference to the local file.

This module provides classes for each file type to extract suitable metadata where available.

class ihm.metadata.Parser[source]

Base class for all metadata parsers.

parse_file(filename)[source]

Extract metadata from the given file.

Parameters:

filename (str) – the file to extract metadata from.

Returns:

a dict with extracted metadata (generally including a Dataset).

class ihm.metadata.MRCParser[source]

Extract metadata from an EM density map (MRC file).

parse_file(filename)[source]

Extract metadata. See Parser.parse_file() for details.

Returns:

a dict with key dataset pointing to the density map, as an EMDB entry if the file contains EMDB headers, otherwise to the file itself.

If the file turns out to be an EMDB entry, this will also query the EMDB web API (if available) to extract version information and details for the dataset.

class ihm.metadata.PDBParser[source]

Extract metadata (e.g. PDB ID, comparative modeling templates) from a PDB file. This handles PDB headers added by the PDB database itself, comparative modeling packages such as MODELLER and Phyre2, and also some custom headers that can be used to indicate that a file has been locally modified in some way.

See also CIFParser for coordinate files in mmCIF format.

parse_file(filename)[source]

Extract metadata. See Parser.parse_file() for details.

Parameters:

filename (str) – the file to extract metadata from.

Returns:

a dict with key dataset pointing to the PDB dataset; ‘templates’ pointing to a dict with keys the asym (chain) IDs in the PDB file and values the list of comparative model templates used to model that chain as ihm.startmodel.Template objects; ‘entity_source’ pointing to a dict with keys the asym IDs and values ihm.source.Source objects; ‘software’ pointing to a list of software used to generate the file (as ihm.Software objects); ‘script’ pointing to the script used to generate the file, if any (as ihm.location.WorkflowFileLocation objects); ‘metadata’ a list of PDB metadata records.

This parser looks at PDB headers. Standard PDB database headers are recognized, plus some added by common comparative modeling packages such as MODELLER and Phyre2, as well as some custom headers that can be used to denote that a PDB file is a locally-modified version of some other resource. Additional details will be extracted from other PDB headers if available, such as TITLE records.

If the first line of the file starts with HEADER and it also contains a PDB ID, then the file is assumed to live in the PDB database. For example, the following will be interpreted as PDB entry 2HBJ:

HEADER    HYDROLASE, GENE REGULATION              14-JUN-06   2HBJ

If the first line starts with EXPDTA    DERIVED FROM then the file is assumed to derive from a given PDB ID or a comparative or integrative model available at a given DOI. TITLE records are expected to describe the nature of the transformation:

EXPDTA    DERIVED FROM PDB:1YKH
EXPDTA    DERIVED FROM COMPARATIVE MODEL, DOI:10.1093/nar/gkt704
EXPDTA    DERIVED FROM INTEGRATIVE MODEL, DOI:10.1016/j.str.2017.01.006

A first line starting with REMARK  99  Chain ID : is assumed to be a model generated by Phyre2. Template information can be added using Modeller-style headers, as below, if desired.

A first line starting with EXPDTA    THEORETICAL MODEL, MODELLER is assumed to be a model generated by Modeller. Headers generated by modern versions of Modeller are parsed to extract information about the comparative modeling script, plus the templates used and their alignment. Templates named 1abcX or 1abcX_N are assumed to be structures deposited in PDB (in this case, chain X in structure 1ABC). A custom TEMPLATE PATH header can be used to point to templates that are not deposited in the PDB database. For example, the model below is assumed to be constructed using templates from PDB codes 3JRO and 3F3F, plus another template in my_custom_pdb_file.pdb, and the given alignment:

EXPDTA    THEORETICAL MODEL, MODELLER 9.18 2017/02/10 22:21:34
REMARK   6 ALIGNMENT: modeller_model.ali
REMARK   6 SCRIPT: model-default.py
REMARK   6 TEMPLATE PATH custom1 ../inputs/my_custom_pdb_file.pdb
REMARK   6 TEMPLATE: 3jroC 33:C - 424:C MODELS 33:A - 424:A AT 100.0%
REMARK   6 TEMPLATE: 3f3fG 482:G - 551:G MODELS 429:A - 488:A AT 10.0%
REMARK   6 TEMPLATE: custom1 9:A - 352:A MODELS 80:A - 414:A AT 32.0%

A first line starting with TITLE     SWISS-MODEL SERVER is assumed to be a model generated by SWISS-MODEL, and information about the template(s) is extracted from REMARK    3 records.

class ihm.metadata.CIFParser[source]

Extract metadata from an mmCIF file. Currently, this does not handle information from comparative modeling packages such as MODELLER (see PDBParser).

See also PDBParser for coordinate files in legacy PDB format.

parse_file(filename)[source]

Extract metadata. See Parser.parse_file() for details.

Parameters:

filename (str) – the file to extract metadata from.

Returns:

a dict with key dataset pointing to the coordinate file, as an entry in the PDB or Model Archive databases if the file contains appropriate headers, otherwise to the file itself; ‘templates’ pointing to a dict with keys the asym (chain) IDs in the PDB file and values the list of comparative model templates used to model that chain as ihm.startmodel.Template objects; ‘software’ pointing to a list of software used to generate the file (as ihm.Software objects); ‘script’ pointing to the script used to generate the file, if any (as ihm.location.WorkflowFileLocation objects).