Scientific publication

Message boards : News : Scientific publication
Message board moderation

To post messages, you must log in.

AuthorMessage
damotbe
Project administrator
Project scientist

Send message
Joined: 23 Jul 19
Posts: 68
Credit: 1,237,766
RAC: 44,063
Message 285 - Posted: 13 Nov 2019, 19:33:58 UTC

Hello everybody!

Our article titled "Dataset’s chemical diversity limits the generalizability of machine learning predictions" was accepted and published ! It is an Open Access article :
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0391-2?fbclid=IwAR0HrALNqT0HRaCUtBMeBcchJxISsiypO2TUJF9zV5EEGK395ODe941Y3_0

If you have any question, feel free to contact us on the forum of the project (under this message).

Cheers !
Benoit

Here is a message from Thomas Cauchy about our reseach :
Hello,

I am the chemist of this project. The publication mentioned by Benoit Da Mota was written when we launch the boinc project. But I can extract some sentences of this article to show what we have in mind :

"Abstract: The QM9 dataset has become the golden standard for Machine Learning (ML) predictions of various chemical properties. QM9 is based on the GDB, which is a combinatorial exploration of the chemical space. ML molecular predictions have been recently published with an accuracy on par with Density Functional Theory calculations. Such ML models need to be tested and generalized on real data. PC9, a new QM9 equivalent dataset (only H, C, N, O and F and up to 9 "heavy" atoms) of the PubChemQC project is presented in thisarticle. A statistical study of bonding distances and chemical functions shows that this new dataset encompasses more chemical diversity. Kernel Ridge Regression, Elastic Net and the Neural Network model provided by SchNet have been used on both datasets. The overall accuracy in energy prediction is higher for the QM9 subset. However, a model trained on PC9 shows a stronger ability to predict energies of the other dataset."

The QM9 dataset has around 130k small molecules, when our PC9 has 119k (but was extracted from another type of calculations). The problem is that the full results of the QM9 are not openly available. They have extracted some results of the costly quantum mechanics calculations and trashed the log. We are not satisfied by PC9 that was a simple demonstration that more diversity is needed.

For the moment the boinc project is aiming at recalculating the interesting molecules of QM9 and PC9 with the same level of calculation this time. All the results will be available at the quchempedia document base https://quchempedia.univ-angers.fr when this platform will be a little bit more robust (beginning 2020) in par with our quality control tool as written by my colleague.
We are not fully happy with NWChem yet. With the same boinc project Benoit Da Mota and myself, are using Gaussian (proprietary) which is much efficient. But Nwchem is open source...
We have calculated roughly 130 k over 200 k thanks to your help!
For December we hope to propose to the community to calculate new molecules that maybe don't even exist and are not stable in order to help machine learning tool to generalize better. Those new molecules will be generated by a machine learning procedure. Too long to explain here right now.

If you have any question...
Kindly
Thomas
ID: 285 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProDigit

Send message
Joined: 16 Nov 19
Posts: 2
Credit: 6,598
RAC: 182
Message 289 - Posted: 16 Nov 2019, 9:18:36 UTC - in response to Message 285.  

Hi Thomas,
What are the hardware requirements to accomplish a task?
I had to abort a task, after it's deadline was 14 days, but my pc would estimate to finish the task in 26 days.
Can i run this better on a multicore CPU, or is a single core CPU of at least 5Ghz needed to beat the deadline?
ID: 289 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Nick Name

Send message
Joined: 19 Oct 19
Posts: 2
Credit: 9,079
RAC: 26
Message 293 - Posted: 17 Nov 2019, 4:38:35 UTC - in response to Message 285.  

Is the paper based on results from this project, or is this project based on what's in the paper? Put another way, which came first, the paper or this project?
Team USA forum | Team USA page
ID: 293 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Dataman
Avatar

Send message
Joined: 7 Oct 19
Posts: 9
Credit: 336,907
RAC: 1,809
Message 294 - Posted: 17 Nov 2019, 15:50:39 UTC - in response to Message 285.  

Excellent paper! I actually understood most of it which is more that I can say about most research papers I read. ;)

Please keep us informed as you progress forward.

Cheers

ID: 294 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tcauchy

Send message
Joined: 4 Aug 19
Posts: 3
Credit: 0
RAC: 0
Message 295 - Posted: 18 Nov 2019, 20:47:26 UTC - in response to Message 293.  

This BOINC project came during the research presented in this article. Like written in the Science forum, we are aiming right now to a better data set for machine learning in chemistry.
A data set of quantum mechanical calculations of small molecules. I will try to write (soon) a small presentation to explain what you are calculating and why sometimes it is awfully long.
ID: 295 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 8 Oct 19
Posts: 6
Credit: 320,446
RAC: 17,385
Message 297 - Posted: 19 Nov 2019, 11:47:34 UTC - in response to Message 289.  

Hi Thomas,
What are the hardware requirements to accomplish a task?
I had to abort a task, after it's deadline was 14 days, but my pc would estimate to finish the task in 26 days.
Can i run this better on a multicore CPU, or is a single core CPU of at least 5Ghz needed to beat the deadline?


ETAs will correct themselves after completing some tasks. They typically complete before the ETA.
ID: 297 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Nick Name

Send message
Joined: 19 Oct 19
Posts: 2
Credit: 9,079
RAC: 26
Message 299 - Posted: 20 Nov 2019, 9:54:32 UTC - in response to Message 295.  

This BOINC project came during the research presented in this article. Like written in the Science forum, we are aiming right now to a better data set for machine learning in chemistry.
A data set of quantum mechanical calculations of small molecules. I will try to write (soon) a small presentation to explain what you are calculating and why sometimes it is awfully long.

Thanks, I was just trying to understand what role BOINC had in the article.
Team USA forum | Team USA page
ID: 299 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProDigit

Send message
Joined: 16 Nov 19
Posts: 2
Credit: 6,598
RAC: 182
Message 306 - Posted: 30 Nov 2019, 10:34:06 UTC - in response to Message 297.  

Hi Thomas,
What are the hardware requirements to accomplish a task?
I had to abort a task, after it's deadline was 14 days, but my pc would estimate to finish the task in 26 days.
Can i run this better on a multicore CPU, or is a single core CPU of at least 5Ghz needed to beat the deadline?


ETAs will correct themselves after completing some tasks. They typically complete before the ETA.

Thanks!
The 26 day task, was finished in ~10-32 hours on a 2Ghz CPU.
You will be hearing more of me, but for the moment being, my Xeon is offline.
ID: 306 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : News : Scientific publication

©2019 Benoit DA MOTA - LERIA, University of Angers, France