Posts by tcauchy

1) Message boards : Science : Series of videos in Youtube dedicated to the QuChemPedia project (Message 1415)
Posted 7 May 2021 by tcauchy
Hello Crunchers !

I have started a series of videos to present the QuChemPedia project. You can find the first episode devoted to the origins of this project here. In English :

In future episodes, I will deal with the organisation of this project, our goals, our results (including the BOINC ones)
and more generally discuss how science is (and should be) made ! ;D

Hope you will find them interesting.
Share if you like !



J'ai commencé une série de vidéos pour présenter le projet QuChemPedia. Vous pouvez trouver le premier épisode consacré aux origines de ce projet ici. En version française :

Dans les prochains épisodes, je traiterai de l'organisation de ce projet, de nos objectifs, de nos résultats (y compris ceux de BOINC) et plus généralement de comment la recherche fonctionne et devrait fonctionner. ;D

J'espère que vous les trouverez intéressantes.
N'hésitez pas à partager si vous les aimez.

2) Message boards : Number crunching : Do you need more computational power? (Message 1228)
Posted 8 Dec 2020 by tcauchy

Thanks for your kind words. We are at the moment finishing an article describing the diversity generated with BOINC in the first phase of this project.
I am also in a rush with a lot of (remote) teaching. I also thought about some youtube videos that could describe our project in a pedagogical way (always the lecturer ;D)
It is a project for january...
I will ask for the advice of the cruncher then.

3) Message boards : Number crunching : Pending Validation (Message 502)
Posted 6 Feb 2020 by tcauchy
The validation process is based on a comparison on results (energy, conformation and so on).
The thing is that since we generate new and unknown molecules, we see that the optimized geometry can vary depending on the machine that made the calculation.
We need to find the correct parameters to discriminate "similar" results that do correspond to the same chemical object but that are not identical results and false results (cheated ones that can mimic a real result)...
4) Message boards : News : Updates and poll (Message 462)
Posted 19 Jan 2020 by tcauchy
Thanks to all your support !

We have faced several technical issues this week.
1. Checkpointing will not work since the stop signal is not received by nwchem. Right now te being stops but the calculation will still run :(
That is why the longer calculations are not in the scheluder
2. Vm is disappointing. However some of you spent a lot of time to make it work ! We clearly recommend a Linux or a Vm of a Linux for a Windows.
3. Our infrastructure was optimized for storage and not a huge load. At the moment we were not able to parsed the recent results when boinc was running and writing on the drives. We hope to get a new grant for September. Until now, we will try to make the best of our situation.
4. We have repeated our small molecules generation for diversity [current batch] and found a possible flaw. This week we will compare the results and maybe relaunch new calculations before the big ones.

I will give you some news soon.
We have launched this boinc project because we believe in community work and data sharing and open science. Thanks to your support, this adventure is already a huge success. For machine learning in chemistry and much more since with a small programming and administration workforce we could produce a very large collaborative, open and curated molecular database :)

5) Message boards : Number crunching : Very little CPU usage (Message 412)
Posted 10 Jan 2020 by tcauchy
Dear Aurum,

We are indeed understaffed. I am the theoretical chemist and Benoit is the computer scientist. We are both lecturers with a huge teaching time.
We also depend on internships.

We do not want to spread too much. Clearly the linux app works great and the windows VM is not perfect but some users manage to make it work.
Around 30k-50k running tasks is perfect. And no, our infrastructure cannot handle 500k WU.

However, what Benoit meant was that we are calculating right now small molecules with at most 9 atoms of C, N, O and F. It is a strong limitation in terms of chemical diversity.
Our scientifc goal is to at least be representative of organic chemistry. That means to include more elements like B, S, Cl... and increase slightly the molecule size to see the impact on machine learning predictions. Therefore, calculation's time could be longer and an automatic aborting habit could be problematic.

6) Message boards : Science : What are you calculating? Some explanation (Message 375)
Posted 24 Dec 2019 by tcauchy
Hi mmonnin,

For OD9, we are using a lower level of theory than the first batch we have used. [B3LYP with 3-21G instead of B3LYP with 6-31G(2df,p)]
Therefore to be able to compare the results we need to calculate the previous molecules (dsgdb9...). We know that some will crash but the thing is that with different computational parameters, the crash could concern differents molecules. This was never documented at a such large scale!
That is why you are seeing "old" calculations reappering.

For the numbers, we have generated a new dataset of 211k molecules totally new (with another 200k just in case). With a first rapid estimation we have seen that 30% of the calculations on those seems to fail, and 30% of the calculation change radically the molecule (probably the longest calculations). That means that 40% of the newly generated molecules will be kept.

After the holidays, we will try our first Machine leanring predictions on those and maybe generate new ones to reach 211k.

Merry Christmas to all of you.
7) Message boards : Cafe : QChem@home (Message 374)
Posted 24 Dec 2019 by tcauchy
Dear Aurum,
Thanks for your proposition.
However, in computational chemistry Chem is already taken by a computational software (that we do not use).
To avoid misunderstanding we have added the u. And since the encyclopedia objective, the open database is really important for us, we really want to keep the pedia.
8) Message boards : Number crunching : Native Multi-Threading (Message 350)
Posted 16 Dec 2019 by tcauchy
Dear Tomas,

NWChem that is used here for the calculations of the molecules, profit from multithreading only with bigger molecules.
We have tested 1, 2 and 4 threads and the gain was none!

Thomas, the chemist of the project
9) Message boards : Science : What are you calculating? Some explanation (Message 313)
Posted 2 Dec 2019 by tcauchy
Hi all! Thanks again for contributing to this scientific project.
I will try here to briefly present you the calculations that you are running on your machines.
Since we have finished the first re-calculation of the original data set, we are entering in a new phase.

We are calculating each time ONE molecule with Quantum mechanics. A molecules is a set of atoms bonded together. For now, we limit ourselves to small molecules with only H, C, N, O and F. In quantum mechanics, we use an approximation of the Schrödinger equation. Therefore, calculations can only be compared if they correspond to the same approximations (usually referred as the level of theory).

The first calculation is the optimization of the atomic positions. A geometrical optimization to obtain the 3D atomic positions that is the most stable. If the starting atomic positions are far from the stable ones, this step can take a very long time. Hence, the sometimes unpredictable very long calculations times! (See but it is quite mathematical.)

Then the second step is to calculate the full derivative of the energy with respect to the position of the atoms. That means to see what are the forces between each atoms. This calculation give us also the Infrared absorption frequencies.

Finally the last step is the calculation of the electronic excitations to simulate the UV-visible spectra. It gives valuable information for photo-voltaic application for example.

In the BOINC private we have a proprietary program that is more than 10 times faster. But for the public part we use NwChem, an open solution. In NWchem, with the current level of theory the optimization step the average time is around 5h and 25h each for the freq and electronic excitations steps. Since we want to generate a lot of new unknown molecules, calculations could take much more time than before ! So we are searching for a lower level of theory that could help us discriminate the good and bad candidates. In the private part, we will only calculate the good ones. We will need you as a super filter. We could generate several thousands of new molecules per day with probably a lot of errors!

By the way, we would like to give you the opportunity to see the drawing of the molecules that you have calculated. But we probably won't have time until next year. ;D
If some of you are proficient in python, it could be useful later for some small tasks... ^^


En français: Salut a tou.te.s ! Merci encore de contribuer à ce projet scientifique.
Vu que nous entrons dans la seconde phase de ce projet, je vais tâcher de vous expliquer brièvement ce que vous calculer.

Ce projet de chimie quantique s'intéresse à chaque fois à UNE seule molécule. C'est à dire un assemblage d'atomes liés chimiquement les uns des autres. Nous nous limitons au départ à de petites molécules contenant des H, C, N, O et F. En mécanique quantique, nous utilisons des approximations de l'équation de Schrödinger. Il est très important de comparer des calculs ayant le même niveaux d'approximation (souvent appelé niveau de théorie).

La première étape correspond à l'optimisation géométrique des positions atomiques. A partir d'un point de départ donné, l'on recherche les positions qui minimisent l'énergie totale. Dès lors si nos positions de départ sont loin de l'état d'équilibre cette étape peut durer longtemps. Ce qui est très peut prévisible.

Après, on dérive l'énergie en fonction des positions atomiques afin de connaître les forces entre les atomes. Cela nous donne accès aux fréquences infrarouge absorbées.

Finalement, la dernière étape correspond au calcul des états excités électronique. Cette étape est très intéressante car elle nous renseigne sur l'absorption UV-visble de la molécule. Ce qui est primordiale pour des applications comme le photovoltaïque organique.

Dans le projet BOINC private, on utilise un programme de calcul propriétaire assez efficace. Mais pour la partie publique nous avons choisi, un code ouvert, NWChem. Or, nous désirons maintenant générer des molécules nouvelles et les calculs risquent d'être encore plus long. Déjà pour la partie optimisation, le temps moyen de 5h et sur les deux autres étapes c'est plus de l'ordre de 25h chaque. Donc nous sommes en train actuellement de recherche un compromis en abaissant le niveau de théorie afin d'écarter rapidement les molécules qui ne sont pas réalistes du tout. Comme on peut sortir facilement 1000 molécules à la journée avec beaucoup d'erreurs, attendez-vous à ce que les temps fassent le yo-yo :D

On aimerait bien vous proposer de voir les dessins des molécules que vous calculer mais je ne sais pas si nous aurons le temps avant la fin d'année.
Si certains d'entre-vous sont bien compétant en python envoyez-nous un message.
10) Message boards : News : Scientific publication (Message 295)
Posted 18 Nov 2019 by tcauchy
This BOINC project came during the research presented in this article. Like written in the Science forum, we are aiming right now to a better data set for machine learning in chemistry.
A data set of quantum mechanical calculations of small molecules. I will try to write (soon) a small presentation to explain what you are calculating and why sometimes it is awfully long.
11) Message boards : Science : Preliminary results? (Message 252)
Posted 4 Nov 2019 by tcauchy

I am the chemist of this project. The publication mentioned by damotbe was written when we launch the boinc project. But I can extract some sentences of this article to show what we have in mind :

"Abstract: The QM9 dataset has become the golden standard for Machine Learning (ML) predictions of various chemical properties. QM9 is based on the GDB, which is a combinatorial exploration of the chemical space. ML molecular predictions have been recently published with an accuracy on par with Density Functional Theory calculations. Such ML models need to be tested and generalized on real data. PC9, a new QM9 equivalent dataset (only H, C, N, O and F and up to 9 "heavy" atoms) of the PubChemQC project is presented in thisarticle. A statistical study of bonding distances and chemical functions shows that this new dataset encompasses more chemical diversity. Kernel Ridge Regression, Elastic Net and the Neural Network model provided by SchNet have been used on both datasets. The overall accuracy in energy prediction is higher for the QM9 subset. However, a model trained on PC9 shows a stronger ability to predict energies of the other dataset."

The QM9 dataset has around 130k small molecules, when our PC9 has 119k (but was extracted from another type of calculations). The problem is that the full results of the QM9 are not openly available. They have extracted some results of the costly quantum mechanics calculations and trashed the log. We are not satisfied by PC9 that was a simple demonstration that more diversity is needed.

For the moment the boinc project is aiming at recalculating the interesting molecules of QM9 and PC9 with the same level of calculation this time. All the results will be available at the quchempedia document base when this platform will be a little bit more robust (beginning 2020) in par with our quality control tool as written by my colleague.
We are not fully happy with NWChem yet. With the same boinc project damotbe and myself, are using Gaussian (proprietary) which is much efficient. But Nwchem is open source...
We have calculated roughly 130 k over 200 k thanks to your help!
For December we hope to propose to the community to calculate new molecules that maybe don't even exist and are not stable in order to help machine learning tool to generalize better. Those new molecules will be generated by a machine learning procedure. Too long to explain here right now.

If you have any question...

©2023 Benoit DA MOTA - LERIA, University of Angers, France