QuChemPedIA : Quantum Chemistry encycloPed and Intelligence Artificielle.
Invitation code : 3VwMu3-eTCg32
Molecular chemistry is lagging behind in term of open science. Although modelization by quantum mechanics applied to chemistry has become almost mandatory in any major publication, computational raw data is most of the time kept in the labs or destroyed. Furthermore, the software used in this area tend to lack effective quality control and computational details are usually incomplete in the articles and the information may not be reused or reproduced. The first objective of this project is to constitute a large collaborative open platform that will solve and store quantum molecular chemistry results. Original output files will be available to be reused to tackle new chemical studies for different applications. Machine learning and more generally artificial intelligence applied to chemistry data promises to revolutionize this area in the near future, but these methods require a lot of data that this project will be able to provide.
Today, it is impossible for a human to take into account the results, even limited to the most important data, for millions of known molecules. The second objective of this project is to radically change the approach developing artificial intelligence and optimization methods in order to explore efficiently the highly combinatorial molecular space. Generative models aim to provide an artificial assistant, which on the one hand has learned to predict the characteristics of a molecule and estimate its cost of synthesis, and on the other hand is able to browse effectively the molecular space. Generative models would open many perspectives by greatly facilitating the screening of new molecules with many potential applications (energy, medicine, materials, etc.). The bottleneck for our AIs is the computing power needed to verify the properties of the generated molecules.
By supporting this project, you will help chemical researchers around the world by building a unique collection of results. You will also help our AIs to propose much more new targets for the different applications we are addressing than we could do on our own.
Thank you for your help !
Thomas Cauchy (chemist)
Benoit Da Mota (computer scientist)
Molecules are coming!
The new batches of molecules are coming! With them come the new credit system (200 credits per WU) and quorum validation. the expected runtime is 2-3 hours on a recent personal computer.
You can also see a new beta application (NWChem long) that will be used for the bigger calculations we talked about. The inputs are ready since the poll, but I still have to perform some tests. Stay tuned!
3 Feb 2020, 16:10:53 UTC · Discuss
Credits and Gridcoin
Yesterday, I had to suspend an account for two weeks and remove credits, for
obvious credit cheating investigations. I'm quite annoyed that instead of doing science, I have to deal with this kind of behavior. We're small and we're short on time and it doesn't help scientific research...
EDIT : after investigations and fruitful exchanges, the problem has been identified and I'm sorry to have been a bit rough with this user.
The current credit system is too easy to fool, so I'm going to move to something simpler, robust and more generous on average: fixed credits. For short tasks (such as od9), I'm going to award 200 credits. This change requires draining the task queue. At that time, I will submit new tasks. These new tasks will be the opportunity to deploy the new code with checkpoints, system signals and affinity management for large systems (>32 cores). Some errors are to be expected, I can't test everything.
The last point concerns the requests for Gridcoin. I've been asked by the developers and by some of you. I am not against this possibility, but three points do not allow for the moment to be whitelisted. First, I can't guarantee to always have tasks waiting to be calculated. Secondly, the incentive to cheat will increase and I find that increasing the quorum is a waste of resources. Thirdly, I'm struggling with the server to keep it up. The upcoming arrival of larger molecules should settle the first point. For the second point, we are thinking about a validation by analyzing the result. I have already made many optimizations for the third point, at the moment it's much better.
Benoit Da Mota
30 Jan 2020, 9:03:31 UTC · Discuss
New Linux app and new WU
I have written a new version of the application for Linux (0.12), which is deployed in beta. The checkpoints have been added, but the display of the task progress is not correct. Don't worry, the computation is back to where it was. Moreover, I've added an adhoc management of system signals, to interrupt and resume tasks correctly. WARNING, this code is in beta and has a very high chance to fail. Please only use it if you want to monitor what is going on and help debugging. if the code does not cause a problem, it will quickly become the new reference code (ie. not in beta).
For Mac and Windows users, I am currently looking for workarounds for problems with Virtual Box.
I'll soon be putting short tasks for small molecules in the od9 series. Stay tuned !
28 Jan 2020, 11:08:23 UTC · Discuss
Updates and poll
Dear Quchempedia crunchers!
First generation of our newly generated small molecules is almost finished. Thanks again.
We have two propositions for the new phase of calculations :
1. Make a pause (maybe a month or so), in order to parse and treat the recent calculations, learn from the success and failures of the calculations and then generate new small molecules. Probably with a little bit more than 9 atoms.
2. Take some of the newly generated compounds, add them to a core (BTX) used in the chemistry lab here in Angers (see the abstract of this article https://pubs.rsc.org/en/content/articlelanding/2019/nj/c9nj05804d/unauth#!divAbstract) to demonstrate how we can use our newly generated molecules inside a real system, to show how a fragment can modify the core properties and to serve as a screening example. These calculations are very interesting and can lead to very nice applications (drugs and materials).
Beware that the second choice, means that the molecules will have more than 9 heavy atoms, probably more than 30 and so calculations could take days. The good news is that the next workunits will implement checkpointing. Boinc will not be able to display the real level of progress and will think that the calculation starts again from the beginning. But we've run some tests and the calculations restart from the very last step. The expected calculation times will always be very approximate and unreliable, we will voluntarily choose a slightly high value.
If you choose the first option, we will calculate the BTX ones with our private ressources and we will post a news when we will have treated and generated new small molecules.
Thank you for giving your choices and opinions under this post.
Thomas and Benoit
14 Jan 2020, 14:24:40 UTC · Discuss
Our article titled "Dataset’s chemical diversity limits the generalizability of machine learning predictions" was accepted and published ! It is an Open Access article :
If you have any question, feel free to contact us on the forum of the project (under this message).
Here is a message from Thomas Cauchy about our reseach :
I am the chemist of this project. The publication mentioned by Benoit Da Mota was written when we launch the boinc project. But I can extract some sentences of this article to show what we have in mind :
"Abstract: The QM9 dataset has become the golden standard for Machine Learning (ML) predictions of various chemical properties. QM9 is based on the GDB, which is a combinatorial exploration of the chemical space. ML molecular predictions have been recently published with an accuracy on par with Density Functional Theory calculations. Such ML models need to be tested and generalized on real data. PC9, a new QM9 equivalent dataset (only H, C, N, O and F and up to 9 "heavy" atoms) of the PubChemQC project is presented in thisarticle. A statistical study of bonding distances and chemical functions shows that this new dataset encompasses more chemical diversity. Kernel Ridge Regression, Elastic Net and the Neural Network model provided by SchNet have been used on both datasets. The overall accuracy in energy prediction is higher for the QM9 subset. However, a model trained on PC9 shows a stronger ability to predict energies of the other dataset."
The QM9 dataset has around 130k small molecules, when our PC9 has 119k (but was extracted from another type of calculations). The problem is that the full results of the QM9 are not openly available. They have extracted some results of the costly quantum mechanics calculations and trashed the log. We are not satisfied by PC9 that was a simple demonstration that more diversity is needed.
For the moment the boinc project is aiming at recalculating the interesting molecules of QM9 and PC9 with the same level of calculation this time. All the results will be available at the quchempedia document base https://quchempedia.univ-angers.fr when this platform will be a little bit more robust (beginning 2020) in par with our quality control tool as written by my colleague.
We are not fully happy with NWChem yet. With the same boinc project Benoit Da Mota and myself, are using Gaussian (proprietary) which is much efficient. But Nwchem is open source...
We have calculated roughly 130 k over 200 k thanks to your help!
For December we hope to propose to the community to calculate new molecules that maybe don't even exist and are not stable in order to help machine learning tool to generalize better. Those new molecules will be generated by a machine learning procedure. Too long to explain here right now.
If you have any question...
©2020 Benoit DA MOTA - LERIA, University of Angers, France