summaryrefslogtreecommitdiff
path: root/data/qm9_readme.txt
blob: efd3af9c45648d7f9b8e6ffa5f7fc833ec9eb89a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114

Data set dsgdb9nsd
==================

Thermochemical properties for 133885 small organic molecules at the DFT/B3LYP level of theory.

Please cite this publication if you use this data set:
* Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, O. Anatole von Lilienfeld:
  Quantum chemistry structures and properties of 134 kilo molecules
  Scientific Data (2014)

Related publications:
* Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, O. Anatole von Lilienfeld:
  Learning the error: Augmenting legacy quantum chemistry with machine learning. 
  submitted (2014)

* Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Mueller, O. Anatole von 
  Lilienfeld: Fast and Accurate Modeling of Molecular Atomization Energies with 
  Machine Learning, Physical Review Letters, 108(5): 058301, 2012. 
  DOI: 10.1103/PhysRevLett.108.058301

This data set is publicly available at 
* http://dx.doi.org/10.6084/m9.figshare.XXXX

Files
-----

dsgdb9nsd.xyz.tar.bz2    - 133885 molecules with properties in XYZ-like format
dsC7O2H10nsd.xyz.tar.bz2 - 6095 isomers of C7O2H10 with properties in XYZ-like format
validation.txt           - 100 randomly drawn molecules from the 133885 set with enthalpies of formation
uncharacterized.txt      - 3054 molecules from the 133885 set that failed a consistency check
atomref.txt              - Atomic reference data
readme.txt               - Documentation

Molecules
---------

For a subset of the GDB-9 database [1] consisting of 133885 neutral organic 
molecules composed from elements H,C,N,O,F, molecular geometries were relaxed 
and properties calculated at the DFT/B3LYP/6-31G(2df,p) level of theory. 

For a subset of 6095 isomers of C7O2H10, energetics were calculated 
at the G4MP2 [2] level of theory.

For a validation set of 100 randomly drawn molecules from the 133885 molecules set,
enthalpies of formation were additionally calculated at the 
DFT/B3LYP/6-31G(2df,p), G4MP2, G4 and CBS-QB3 levels of theory.

3054 molecules from the 133885 GDB9 molecules failed a consistency check where the Corina generated
Cartesian coordinates and the B3LYP/6-31G(2df,p) equilibrium geometry lead to different SMILES strings.

Format
------

Each molecule is stored in its own file, ending in ".xyz".
The format is an ad hoc extension of the XYZ format [3].

Line       Content
----       -------
1          Number of atoms na
2          Properties 1-17 (see below)
3,...,na+2 Element type, coordinate (x,y,z) (Angstrom), and Mulliken partial charge (e) of atom
na+3       Frequencies (3na-5 or 3na-6)
na+4       SMILES from GDB9 and for relaxed geometry
na+5       InChI for GDB9 and for relaxed geometry

The properties stored in the second line of each file:

I.  Property  Unit         Description
--  --------  -----------  --------------
 1  tag       -            "gdb9"; string constant to ease extraction via grep
 2  index     -            Consecutive, 1-based integer identifier of molecule
 3  A         GHz          Rotational constant A
 4  B         GHz          Rotational constant B
 5  C         GHz          Rotational constant C
 6  mu        Debye        Dipole moment
 7  alpha     Bohr^3       Isotropic polarizability
 8  homo      Hartree      Energy of Highest occupied molecular orbital (HOMO)
 9  lumo      Hartree      Energy of Lowest occupied molecular orbital (LUMO)
10  gap       Hartree      Gap, difference between LUMO and HOMO
11  r2        Bohr^2       Electronic spatial extent
12  zpve      Hartree      Zero point vibrational energy
13  U0        Hartree      Internal energy at 0 K
14  U         Hartree      Internal energy at 298.15 K
15  H         Hartree      Enthalpy at 298.15 K
16  G         Hartree      Free energy at 298.15 K
17  Cv        cal/(mol K)  Heat capacity at 298.15 K

I. = Property index (properties are given in this order)
For the 6095 isomers, properties 12-16 were calculated at the G4MP2 level of theory.
All other calculations were done at the DFT/B3LYP/6-31G(2df,p) level of theory.

Notes
-----

Out of the 133885 molecules, geometries of the 11 molecules with indices 
21725, 87037, 59827, 117523, 128113, 129053, 129152, 129158, 130535, 6620, 59818 
were difficult to converge.
Low threshold convergence was possible for 21725, 59827, 128113, 129053, 129152, 130535.
Molecules 6620 and 59818 converged to very low-lying saddlepoints, with lowest frequency < 10i cm^-1.

References
----------

[1] Lorenz C. Blum, Jean-Louis Reymond: 970 Million Druglike Small Molecules
    for Virtual Screening in the Chemical Universe Database GDB-13, Journal of
    the American Chemical Society 131(25): 8732-8733, 2009. DOI: 10.1021/ja902302h
[2] Larry A. Curtiss, Paul C. Redfern, Krishnan Raghavachari: Gaussian-4 theory 
    using reduced order perturbation theory, Journal of Chemical Physics 127(12):
    124105, 2007. DOI: 10.1063/1.2770701
[3] The XYZ format, originally developed for the XMol program by the Minnesota
    Supercomputer Center, is a widespread plain text format for exchange of molecules
    (atomic coordinates and annotation). There is no formal specification. See, e.g.,
    http://openbabel.org/wiki/XYZ, or, http://wiki.jmol.org/index.php/File_formats/Formats/XYZ