1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
|
Data set dsgdb9nsd
==================
Thermochemical properties for 133885 small organic molecules at the DFT/B3LYP level of theory.
Please cite this publication if you use this data set:
* Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, O. Anatole von Lilienfeld:
Quantum chemistry structures and properties of 134 kilo molecules
Scientific Data (2014)
Related publications:
* Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, O. Anatole von Lilienfeld:
Learning the error: Augmenting legacy quantum chemistry with machine learning.
submitted (2014)
* Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Mueller, O. Anatole von
Lilienfeld: Fast and Accurate Modeling of Molecular Atomization Energies with
Machine Learning, Physical Review Letters, 108(5): 058301, 2012.
DOI: 10.1103/PhysRevLett.108.058301
This data set is publicly available at
* http://dx.doi.org/10.6084/m9.figshare.XXXX
Files
-----
dsgdb9nsd.xyz.tar.bz2 - 133885 molecules with properties in XYZ-like format
dsC7O2H10nsd.xyz.tar.bz2 - 6095 isomers of C7O2H10 with properties in XYZ-like format
validation.txt - 100 randomly drawn molecules from the 133885 set with enthalpies of formation
uncharacterized.txt - 3054 molecules from the 133885 set that failed a consistency check
atomref.txt - Atomic reference data
readme.txt - Documentation
Molecules
---------
For a subset of the GDB-9 database [1] consisting of 133885 neutral organic
molecules composed from elements H,C,N,O,F, molecular geometries were relaxed
and properties calculated at the DFT/B3LYP/6-31G(2df,p) level of theory.
For a subset of 6095 isomers of C7O2H10, energetics were calculated
at the G4MP2 [2] level of theory.
For a validation set of 100 randomly drawn molecules from the 133885 molecules set,
enthalpies of formation were additionally calculated at the
DFT/B3LYP/6-31G(2df,p), G4MP2, G4 and CBS-QB3 levels of theory.
3054 molecules from the 133885 GDB9 molecules failed a consistency check where the Corina generated
Cartesian coordinates and the B3LYP/6-31G(2df,p) equilibrium geometry lead to different SMILES strings.
Format
------
Each molecule is stored in its own file, ending in ".xyz".
The format is an ad hoc extension of the XYZ format [3].
Line Content
---- -------
1 Number of atoms na
2 Properties 1-17 (see below)
3,...,na+2 Element type, coordinate (x,y,z) (Angstrom), and Mulliken partial charge (e) of atom
na+3 Frequencies (3na-5 or 3na-6)
na+4 SMILES from GDB9 and for relaxed geometry
na+5 InChI for GDB9 and for relaxed geometry
The properties stored in the second line of each file:
I. Property Unit Description
-- -------- ----------- --------------
1 tag - "gdb9"; string constant to ease extraction via grep
2 index - Consecutive, 1-based integer identifier of molecule
3 A GHz Rotational constant A
4 B GHz Rotational constant B
5 C GHz Rotational constant C
6 mu Debye Dipole moment
7 alpha Bohr^3 Isotropic polarizability
8 homo Hartree Energy of Highest occupied molecular orbital (HOMO)
9 lumo Hartree Energy of Lowest occupied molecular orbital (LUMO)
10 gap Hartree Gap, difference between LUMO and HOMO
11 r2 Bohr^2 Electronic spatial extent
12 zpve Hartree Zero point vibrational energy
13 U0 Hartree Internal energy at 0 K
14 U Hartree Internal energy at 298.15 K
15 H Hartree Enthalpy at 298.15 K
16 G Hartree Free energy at 298.15 K
17 Cv cal/(mol K) Heat capacity at 298.15 K
I. = Property index (properties are given in this order)
For the 6095 isomers, properties 12-16 were calculated at the G4MP2 level of theory.
All other calculations were done at the DFT/B3LYP/6-31G(2df,p) level of theory.
Notes
-----
Out of the 133885 molecules, geometries of the 11 molecules with indices
21725, 87037, 59827, 117523, 128113, 129053, 129152, 129158, 130535, 6620, 59818
were difficult to converge.
Low threshold convergence was possible for 21725, 59827, 128113, 129053, 129152, 130535.
Molecules 6620 and 59818 converged to very low-lying saddlepoints, with lowest frequency < 10i cm^-1.
References
----------
[1] Lorenz C. Blum, Jean-Louis Reymond: 970 Million Druglike Small Molecules
for Virtual Screening in the Chemical Universe Database GDB-13, Journal of
the American Chemical Society 131(25): 8732-8733, 2009. DOI: 10.1021/ja902302h
[2] Larry A. Curtiss, Paul C. Redfern, Krishnan Raghavachari: Gaussian-4 theory
using reduced order perturbation theory, Journal of Chemical Physics 127(12):
124105, 2007. DOI: 10.1063/1.2770701
[3] The XYZ format, originally developed for the XMol program by the Minnesota
Supercomputer Center, is a widespread plain text format for exchange of molecules
(atomic coordinates and annotation). There is no formal specification. See, e.g.,
http://openbabel.org/wiki/XYZ, or, http://wiki.jmol.org/index.php/File_formats/Formats/XYZ
|