Chemical Graph Generators
- This is a PLOS Topic Page draft
Public peer review comments will be posted here
|About the Authors
Mehmet Aziz Yirik
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of cheminformatics. Chemical Graph Generators are used in areas such as virtual library generation in drug design, molecular design with specified properties, called inverse QSAR/QSPR, organic synthesis design, retrosynthesis or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of computational biology.
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as valences, bond multiplicity and fragments. These generators are the core of CASE systems. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the DENDRAL project, the first artificial intelligence project in organic chemistry.
CONGEN dealt well with overlaps in substructures (Figure 1). The overlaps among substructures rather than atoms were used as the building blocks. For the case of stereoisomers, symmetry group calculations were performed for duplicate detection.
After DENDRAL, another mathematical method, MASS, a tool for mathematical synthesis and analysis of molecular structures, was reported. Same as CONGEN, the algorithm worked as an adjacency matrix generator. Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev and Read's orderly generation method. Although their reports are from the 1970s, these studies are still the fundamental references for structure generators. In the orderly generation method, specific order-check functions are performed on graph representatives, such as vectors. For example, MOLGEN performs descending order check while filling rows of adjacency matrices.. This descending order check is based on an input valence distribution. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest attempt was made by Abe in 1975 using a pattern recognition-based structure generator. The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. Abe and his collaborators also published the first paper on CHEMICS, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. CHEMICS generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as building blocks in the process.
Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE. The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a tree, the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The second version of ASSEMBLE was released in 2000. Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps.
The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator. Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. Structure generators can also vary based on the type of data used, such as HMBC, HSQC and NMR data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY. To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. LSD (Logic for Structure Determination) is an important contribution from French scientists. The tool uses spectral data information such as HMBC and COSY data to generate all possible structures. LSD is an open source structure generator with General Public License (GPL). A well-known commercial CASE system StrucEluc also consist a NMR based generator. This tool is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. COCON is another NMR based structure generator, relying on theoretical data sets for structure generation. Except J-HMBC and J-COSY, all NMR types can be used as inputs.
In 1994, Chinese scientists reported an integer partitioning-based structure generator. The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.
A series of stochastic generators was reported by Faulon. His software, SIGNATURE, was integrated into this stochastic generator for canonical labelling and duplicate checks. As many other generators, the tree approach is the skeleton of Faulon's structure generators. However, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is connected with other atoms. Similar to former generator, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method (Figure 4). This tree extension terminates when all the branches reach saturated structures.
OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added. Although NAUTY is an efficient tool for graph canonical labelling, OMG is approximately 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator). MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithm is a branch-and-bound method, such as Faradjev's algorithm, and an additional solution to memory problems. Branch-and-bound methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. In these algorithms, canonicity criteria and isomorphism check are based on automorphism groups from mathematics. MASS, SMOG and Bangov’s algorithm are good examples in the literature. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS.
Unlike previous methods, MOLGEN is the only maintained efficient generic structure generator, developed as a closed-source platform by a group of mathematicians as an application of computational group theory. MOLGEN is an orderly generation method. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates.
In the field, the recent studies are from Funatsu's research group. As a type of assemble method, building blocks, such as ring systems and atom fragments, are used in the structure generation. Every intermediate structure is extended by adding building blocks in all possible ways. To reduce the number of duplicates, McKay's canonical path augmentation method is used. To overcome the combinatorial explosion in the generation, applicability domain and ring systems are detected based on inverse QSPR/QSAR analysis.. The applicability domain, or target area, is described based on given biological as well as pharmaceutical activity information from QSPR/QSAR.  In that study, monotonically changed descriptors (MCD) are used to describe applicability domains. For every extension in intermediate structures, the MCDs are updated. The usage of MCDs reduces the search space in the generation process. In the QSPR/QSAR based structure generation, there is the lack of synthesizability of the generated structures. Usage of retrosynthesis paths in the generation makes generation process more efficient. For example, a well-known tool called RetroPath  is used for molecular structure enumeration and virtual screening based on given reaction rules. Its core algorithm is a breadth-first method, generating structures by applying reaction rules to each source compound. Structure generation and enumeration are performed based on canonical augmentation method, proposed by McKay. RetroPath 2.0 provides a variety of workflows such as isomer transformation, enumeration, QSAR and metabolomics.
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA, an exhaustive and recursive bond-removal method. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provided, and structure assembly is faster without constraints. First, the useless connections are eliminated, and then the substructures are assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI. HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively (Figure 2). The bond order corresponds to the edge multiplicity, and as a result, chemical graphs are vertex and edge-labeled graphs. A vertex and edge-labeled graph is described as a chemical graph where is the set of vertices, i.e., atoms, and is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence. A graph is connected if there is at least one path between each pair of vertices. Although chemical mixtures are one of the main interests of many chemists, due to the computational explosion, many structure generators output only connected chemical graphs. Thus, the connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
Symmetry Groups for Molecular Graphs
- There is an element in satisfying , for all elements of .
- For each element of G, there is an element such that is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume is a set of integers. Under the function composition operation, is a symmetry group, the set of all permutations over X. If the size of is , then the order of is . Set systems consist of a finite set and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the automorphisms of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving. If is an edge of the graph, , and is a permutation of , then
A permutation of is an automorphism of the graph if
The automorphism group of a graph , denoted , is the set of all automorphisms on . In molecular graphs, canonical labelling and molecular symmetry (Figure 3) detection are implementations of automorphism groups. Although there are well known canonical labelling methods in the field, such as InChI and ALATIS, NAUTY is a commonly used, an efficient software package for automorphism group calculations and canonical labelling. For example, OMG is an implementation of NAUTY.
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first or depth-first search methods; and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of fast open-source structure generators.
The available softwares and their links are listed below.
- ^ G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.
- ^ V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.
- ^ I. Faradjev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.
- ^ Colbourn CJ, Read RC. ‘Orderly algorithms for generating restricted classes of graphs’, Journal of Graph Theory, 3(2):187-95, 1979.
- ^ Grüner T, Laue R, Meringer M, Bayreuth U., ‘Algorithms for group actions: Homomorphism principle and orderly generation applied to graphs’, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 28:113-22, 1997.
- ^ H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.
- ^ S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978
- ^ C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.
- ^ Carhart RE, DH S, NAB G, ‘GENOA: A computer program for structure elucidation utilizing overlapping and alternative substructures’, 1981.
- ^ H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.
- ^ C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.
- ^ C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.
- ^ Faulon JL. Stochastic generator of chemical structure. 2. Using simulated annealing to search the space of constitutional isomers. Journal of Chemical Information and Computer Sciences. 1996 Jul 24;36(4):731-40.
- ^ J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.
- ^ K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.
- ^ Junker J, ‘Theoretical NMR correlations based structure discussion’, Journal of cheminformatics, 3(1):27, 2011.
- ^ C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.
- ^ J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.
- ^ J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.
- ^ Faulon JL, Churchwell CJ, and Visco DP. ‘The signature molecular descriptor. 2. Enumerating molecules from their extended valence sequences’, J. Chem. Inf. Comput. Sci., 43(3):721-34, 2003.
- ^ J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.
- ^ B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.
- ^ M.A. Yirik, ‘The Benchmark for Structure Generators’, Blogger, 2020.
- ^ M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.
- ^ I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.
- ^ A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.
- ^ Miyao T, Kaneko H, Funatsu K. ‘Ring system-based chemical graph generation for de novo molecular design’, Journal of computer-aided molecular design, 30(5):425-46, 2016.
- ^ Miyao T, Kaneko H, Funatsu K., ‘Ring‐System‐Based Exhaustive Structure Generation for Inverse‐QSPR/QSAR’, Molecular informatics, 33(11‐12):764-78, 2014.
- ^ Miyao T, Arakawa M, Funatsu K., ‘Exhaustive structure generation for inverse‐QSPR/QSAR’, Molecular informatics, 29(1‐2):111-25, 2010.
- ^ Delépine B, Duigou T, Carbonell P, Faulon JL, ‘RetroPath2. 0: A retrosynthesis workflow for metabolic engineers’, Metabolic engineering, 45:158-70, 2018.
- ^ Koch M, Duigou T, Carbonell P, Faulon JL, ‘Molecular structures enumeration and virtual screening in the chemical space with RetroPath2.0’, Journal of cheminformatics, 9(1):64, 2017.
- ^ Kadurin A, Nikolenko S, Khrabrov K, Aliper A, Zhavoronkov A, ‘druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico’, Molecular pharmaceutics, 14(9):3098-104, 2017.
- ^ Blaschke T, Olivecrona M, Engkvist O, Bajorath J, Chen H, ‘Application of generative autoencoder in de novo molecular design’, Molecular informatics, 37(1-2):1700123, 2018.
- ^ S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.
- ^ A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.
- ^ Massiot G, Nuzillard JM, ‘Computer‐assisted elucidation of structures of natural products’, Phytochemical analysis, 3(4):153-9, 1992.
- ^ D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.
- ^ Heller SR, McNaught A, Pletnev I, Stein S, Tchekhovskoi D, ‘InChI, the IUPAC international chemical identifier’, Journal of cheminformatics, 7(1):23, 2015.
- ^ Dashti H, Westler WM, Markley JL, Eghbalnia HR, ‘Unique identifiers for small molecules enable rigorous labeling of their atoms’, Scientific data, 4:170073, 2017.