
【Key Words】Machine learning, Artificial Intelligence, Bioinformatics
Our lab aims to develop novel algorithms to discover new knowledge from large and heterogeneous data. As a center of data-centric science in Japan, we collaborate with top researchers of different disciplines including life sciences, chemistry, pharmacology, material and environmental sciences. Students are expected to develop important data analysis skills that are indispensable in current scientific protocols.
More than three transcription factors often work together to enable cells to respond to various signals. The detection of combinatorial regulation by multiple transcription factors, however, is not only computationally nontrivial but also extremely unlikely because of multiple testing correction. The exponential growth in the number of tests forces us to set a strict limit on the maximum arity. We developed a novel statistical test called LAMP (limitless-arity multiple testing procedure)[1]. LAMP counts the exact number of testable combinations and calibrates the Bonferroni factor to the smallest possible value. LAMP lists significant combinations without any limit, while the family-wise error rate is kept under the threshold. In the human breast cancer transcriptome, LAMP discovered statistically significant combinations of as many as eight binding motifs.
Fig.1 Combinatorial effect discovery by LAMP
Design of new molecules and materials are of scientific and industrial importance. We apply machine learning and artificial intelligence methods to accelerate the design of new molecules and materials. To this aim, we are developing new methods involving Bayesian optimization and Monte Carlo tree search. Recently our lab developed a python package COMBO [3] that automatically selects promising candidates for simulations and experiments.
Fig. 2 Automatic design of molecules and materials.
We are also committed in developing fundamental theories and algorithms for machine learning and data mining. It requires expertise in statistical theories, discrete algorithms and optimization. For example, we have developed the gBoost algorithm that accurately predicts properties of graph-structured data such as chemical compounds [2]. In addition, fast methods to discover similar pairs from a large dataset are in focus. These methods are expected to contribute in multi-omics data analysis including genomic, epigenomic and metabolite data.
Fig 3. Subgraph features discovered from chemical data by gBoost