Chemical table file

Family of chemical file formats From Wikipedia, the free encyclopedia

Chemical table file (CT file) is a family of text-based chemical file formats that describe molecules and chemical reactions. One format, for example, lists each atom in a molecule, the x-y-z coordinates of that atom, and the bonds among the atoms.

The formats were created by MDL Information Systems (MDL), which was acquired by Symyx Technologies then merged with Accelrys Corp., and now called BIOVIA, a subsidiary of Dassault Systèmes of Dassault Group.[1]

The CT file is an open format. BIOVIA publishes its specification.[2] BIOVIA requires users to register to download the CT file format specifications.[3]

Molfile

Quick facts ctab, Filename extension ...
ctab
Filename extension
.mol
Internet media type
chemical/x-mdl-molfile
Type of formatchemical file format
Close

An MDL Molfile is a file format for holding information about the atoms, bonds, connectivity and coordinates of a molecule.

The molfile consists of some header information, the Connection Table (CT) containing atom info, then bond connections and types, followed by sections for more complex information.

The molfile is sufficiently common that most, if not all, cheminformatics software systems/applications are able to read the format, though not always to the same degree. It is also supported by some computational software such as Mathematica.

V2000

The current de facto standard version is molfile V2000, although, more recently, the V3000 format has been circulating widely enough to present a potential compatibility issue for those applications that are not yet V3000-capable.

More information Header Block (3 lines), Connection table ...
The contents of a Molfile of L-Alanine
L-Alanine
Title line (can be blank but line must exist) Header Block

(3 lines)

  ABCDEFGH09071717442D
Program / file timestamp line

Name of source program, a sequence number (often a timestamp), and a 2D or 3D specifier (often meaningless, examining the "Z" coordinates is more reliable)

Exported
Comment line (can be blank but line must exist)
6 5 0 0 1 0 3 V2000
Counts line Connection table
-0.6622  0.5342 0.0000 C 0 0 2 0 0 0
 0.6622 -0.3000 0.0000 C 0 0 0 0 0 0
-0.7207  2.0817 0.0000 C 1 0 0 0 0 0
-1.8622 -0.3695 0.0000 N 0 3 0 0 0 0
 0.6220 -1.8037 0.0000 O 0 0 0 0 0 0
 1.9464  0.4244 0.0000 O 0 5 0
Atom block

(1 line for each atom): x, y, z (in angstroms), element, etc.

1 2 1 0 0 0 0
1 3 1 0 1 0 0
1 4 1 0 0 0 0
2 5 2 0 0 0 0
2 6 1 0 0 0 0
Bond block

(1 line for each bond): 1st atom, 2nd atom, type, etc.

M  CHG 2 4 1 6 -1
M  ISO 1 3 13
Properties block
M  END
END line

(NOTE: some programs don't like a blank line before M END)

END
Close

Counts line block specification

More information Value, V2000 ...
Value 6 5 0 0 0 1 V2000
Description number of atoms number of bonds number of atom list Chiral flag, 1 = chiral;

0 = not chiral

number of stext entries number of lines of

additional properties

mol version
Type [Generic] [Generic] [Query] [Generic] [ISIS/Desktop] [Generic]
Close

Bond block specification

The Bond Block is made up of bond lines, one line per bond, with the following format:

111 222 ttt sss xxx rrr ccc

where the values are described in the following table:

More information Field, Meaning ...
Field Meaning Values
111 first atom number
222 second atom number
ttt bond type 1= Single, 2 = Double, 3 = Triple, 4 = Aromatic,5 = Single or Double, 6 = Single or Aromatic, 7 = Double or Aromatic, 8 = Any
sss bond stereo For single bonds:

0 = not stereo; 1= up; 4=either, 6= down

For double bonds:

0= Use x-, y-, z-coords from atom block to determine cis or trans; 3=Cis or trans (either) double bond

xxx not used
rrr bond topology 0 = Either, 1 = Ring, 2 = Chain
ccc reacting center status 0 = unmarked, 1 = a center, -1 = not a center, Additional: 2 = no change, 4 = bond made/broken, 8 = bond order changes

12 = 4+8 (both made/broken and changes);

5 = (4 + 1), 9 = (8 + 1), and 13 = (12 + 1) are also possible

Close

V3000

The extended (V3000) molfile consists of a regular molfile “no structure” followed by a single molfile appendix that contains the body of the connection table (Ctab). The following figure shows both an alanine structure and the extended molfile corresponding to it.

Note that the “no structure” is flagged with the “V3000” instead of the “V2000” version stamp. There are two other changes to the header in addition to the version:

  • The number of appendix lines is always written as 999, regardless of how many there actually are. (All current readers will disregard the count and stop at M END.)
  • The “dimensional code” is maintained more explicitly. Thus “3D” really means 3D, although “2D” will be interpreted as 3D if any non-zero Z-coordinates are found.

Unlike the V2000 molfile, the V3000 Rgroup molfile has the same header format as a non-Rgroup molfile. V3000 format can represent the following things that are out of reach for V2000:[4]

  • number of atoms or bonds exceeding 999
  • reactions with Rgroups
  • enhanced stereochemistry
More information Description, Header block ...
L-Alanine
Description Header block
GSMACCS-II07189510252D 1 0.00366 0.00000 0
Header with timestamp
Figure 1, J. Chem. Inf. Comput. Sci., Vol 32, No. 3., 1992
Comment line
0 0 0 0 0 999 V3000
V2000-compatibility line
M V30 BEGIN CTAB
Connection table
M V30 COUNTS 6 5 0 0 1
Counts line
M V30 BEGIN ATOM
M V30 1 C -0.6622 0.5342 0 0 CFG=2
M V30 2 C 0.6622 -0.3 0 0
M V30 3 C -0.7207 2.0817 0 0 MASS=13
M V30 4 N -1.8622 -0.3695 0 0 CHG=1
M V30 5 O 0.622 -1.8037 0 0
M V30 6 O 1.9464 0.4244 0 0 CHG=-1
M V30 END ATOM
Atom block
M V30 BEGIN BOND
M V30 1 1 1 2
M V30 2 1 1 3 CFG=1
M V30 3 1 1 4
M V30 4 2 2 5
M V30 5 1 2 6
M V30 END BOND
Bond block
M V30 END CTAB
M END
Close

Counts line

A counts line is required, and must be first. It specifies the number of atoms, bonds, 3D objects, and Sgroups. It also specifies whether or not the CHIRAL flag is set. Optionally, the counts line can specify molregno. This is only used when the regno exceeds 999999 (the limit of the format in the molfile header line). The format of the counts line is:

M V30 COUNTS na nb nsg n3d chiral
M V30 COUNTS na nb nsg n3d chiral [REGNO=regno]
M V30 COUNTS 6 5 0 0 1
number of atoms
number of bonds
number of Sgroups
number of 3D constrains
if 1 = molecule is chiral
molecule or model regno

SDF

Quick facts ctab, Filename extension ...
ctab
Filename extension
.sd, .sdf
Internet media type
chemical/x-mdl-sdfile
Type of formatchemical file format
Close

SDF (structure-data format, also known as "SD file") is developed by MDL for representing structural information. An SDF file consists of several records delimited by a line consisting of four dollar signs ($$$$). Each record starts like a normal molfile, but includes associated data items after the M END line.

SDF can be built on top of either a V2000 or V3000 molfile. As a result there are two separate formats, V2000 SDF and V3000 SDF.

Associated data items are denoted as follows:

>  <Unique_ID>
XCA3464366
 
>  <ClogP>
5.825

>  <Vendor>
Sigma

>  <Molecular Weight>
499.611

Multiple-line data items are also supported. The MDL SDF-format specification requires that a hard-carriage-return character be inserted if a single line of any text field exceeds 200 characters. This requirement is frequently violated in practice, as many SMILES and InChI strings exceed that length.

Other formats of the family

There are other, less commonly used formats of the family.

RXNFile (.rxn)
Has a REACTANT block, a PRODUCT block, and (optionally) an AGENT block. Chemicals mentioned within can be either in an embedded molfile or in a registry. Multiple molfiles can be embedded. Has V2000 and V3000 variants based on the version of the embedded molfile.[4][5]
RDFile (.rdf)
A combination of an RXNfile with SDF-style associated data. Each record can contain chemical structures, reactions, textual and tabular data.[5]
RG File (.rgf)
An extension to molfile from Chemaxon, using an reaction as the root structure. Can be used in Marvin's RXNFile dialect.[4]
Alternatively described as: "for representing the Markush structures (deprecated, Molfile V3000 can represent Markush structures)"

There are also alternative encodings derived from the formats:

Compressed versions
Chemaxon provides a compressor for these formats with the extension names .csmol, .cssdf, .csrxn, .csrdf.
It is more common to use gzip on these text files, yielding .mol.gz, .sdf.gz, .rxn.gz, .rdf.gz.[6]
XD File (.xdf)
A defunct XML version of the above formats from MDL and Accerlrys.

See also

References

Related Articles

Wikiwand AI