DATAI Seminars. Course 2022-2023
The problem addressed is the jet engine aircraft fuel consumption during the take-off, climb and cruise flight phases. Due to the globalization phenomena, a continuous increase in air traffic demand has been brought which, consequently, has resulted in an increase of fuel consumption and its associated pollutant gases emitted to the atmosphere.
Previous studies have presented tools and frameworks that help quantify the aircraft's fuel consumption and hence, it's pollutant gas emissions, showing the magnitude of such problem and the urgent need of addressing it. In general, such tools and frameworks rely on aircraft performance models that resolve the equations of motion for each flight phase by employing energy balance, numerical or statistical methods. Although being very accurate, these methods do not provide closed-form expressions that can relate the aircraft's fuel consumption with its aerodynamic, engine and design parameters.
Our contribution in this work is based on proposing a mathematical model that provides closed-form formulae for the quantification of the fuel consumption (and, hence, of the pollutant gases emitted) and several of the aircraft's state variables (fuel flow rate, velocity, thrust, lift, drag, weight, rate-of-climb, etc.) during the takeoff, climb and cruise flight phases; with the advantage that such closed-form formulae enables further optimization and sensibility analyses.
Intra-tumor heterogeneity renders the identification of somatic single-nucleotide variants (SNVs) a challenging problem. In particular, low-frequency SNVs are hard to distinguish from sequencing artifacts. While the increasing availability of multi-sample tumor DNA sequencing data holds the potential for more accurate variant calling, there is a lack of high-sensitivity multi-sample SNV callers that utilize these data. Here we report Moss, a method to identify low-frequency SNVs that recur in multiple sequencing samples from the same tumor. Moss provides any existing single-sample SNV caller the ability to support multiple samples with little additional time overhead. We demonstrate that Moss improves recall while maintaining high precision in a simulated dataset. On multi-sample hepatocellular carcinoma, acute myeloid leukemia and colorectal cancer datasets, Moss identifies new low-frequency variants that meet guide review criteria and are consistent with the tumor's mutational signature profile. In addition, Moss detects the presence of variants in more samples of the same tumor than reported by the single-sample caller. Moss' improved sensitivity in SNV calling will enable more detailed downstream analyses in cancer genomics.
The temporal sequences of satellite images constitute a very valuable and abundant resource for the analysis of a region of interest. On the other hand, deep learning techniques are currently the benchmark in terms of automatic image classification. This is why the application of this subject of models in the field of satellite imagery is attracting more and more attention from academic community and business . However, labeled data , which are generally necessary for training deep learning models, are very scarce and expensive to obtain for satellite imagery. In this context, the research of a fully unsupervised procedure is carried out in which, given a sequence of images, a semantic embedding is learned and a partition of the terrain is created according to its semantic properties and its evolution over time. This approach offers a novel global perspective of the terrain, where large areas sharing similar semantics and temporal evolution are connected to form clearly defined patterns. The results also show the close relationship between the distribution of clusters in the geographic space and their distribution in the embedded space. The semantic analysis is completed by obtaining the time series representing each cluster, the series representing the boundaries and a graph explaining the connection between the different clusters. The methodology is illustrated by performing a semantic analysis using a sequence of satellite images of the Navarra region (Spain).
Supervised classification is a fundamental part of machine learning whose applications to real problems are of great interest. Both in the literature and in machine learning software libraries, we can find multiple proposals for supervised classification paradigms (decision trees, neural networks, Bayesian networks, etc.) as well as various methods for fitting these models. Having a methodology to evaluate and compare fairly the result of these models is essential to obtain the right conclusions. However, on many occasions we tend to neglect the correct validation of the results obtained.
This talk presents a review of the methodology for the honest assessment of classifiers, providing useful information to choose the best alternative in validation processes when trying to solve supervised classification problems. Since the fundamental aspects of honest model validation are spread over a long list of literature references, this talk may be useful to condense the fundamental aspects and provide sufficient information to guide on the use of different alternatives.
The content is structured in three main blocks. After an introduction to the problem of supervised classification and its importance in the honest validation of models, the first block is devoted to scores as measures of classifier quality, their main characteristics and their use. The second block presents the problem of estimating the value of scores using finite sets of data , the most commonly used estimation methods and their properties in terms of bias and variance as well as possible variations and improvements. Finally, the last part briefly presents the hypothesis tests as tool to compare classifiers in different situations, presenting the possible alternatives depending on the conditions of the problem to be solved.
The most popular ways of learning models from data have been "supervised classification" and "unsupervised or clustering". While the former requires the entire sample to be annotated in order to learn a predictive model , the latter works on an unlabeled sample with the goal of discovering the algebraic Structures of the data.
Outside this "comfort zone", the so-called "weakly-supervised classification" has emerged strongly during the last decade: not all the sample is labeled, there may be extra information about the annotation at prediction time, and the "case-label" relations may go beyond the classical "one sample -- one label".
The seminar room will serve to review the main "weakly-supervised" scenarios that have emerged in the scientific literature, focusing on the genuine characteristics of sample labeling in each of them. A "taxonomy" will be provided to differentiate and characterize them. Each scenario will be illustrated with applications and reference works.
Combining machine learning and computational chemistry for predictive insights into chemical systems
08/09/2022. Valentín Vassilev Galindo. University of Luxembourg