DATAI Seminars. Course 2022-2023
AI and sustainability
05/31/23 / Amparo Alonso Betanzos
Catedrática de Ciencias de la Computación e Inteligencia Artificial/ Full Prof. Computer Science and Artificial Intelligence.
Adjunta al President para Inteligencia Artificial/ Assistant to the President for Artificial Intelligence.
School Department of Computer Science. Universidade da Coruña.15071 A Coruña (Spain)
Department of Psychology. NTNU- Norway
The intersection of AI and sustainability is an increasing area of research and innovation. As we continue to face critical environmental challenges there is a growing need for advanced technical solutions that can help us to build a more sustainable future. In this talk, we will offer a double perspective: AI for sustainability and sustainable AI. Regarding the first case, we will present an application using agent- based modeling for studying the acceptability of certain local innovations (in this case, the use of superblocks) by citizens. In the second case, we will show several ideas that can help us to reduce the environmental impact of AI algorithms.
Despite being the core of a wide range of technologies, deep learning models are still far from being fully reliable and trustworthy. One of their main limitations is their intriguing vulnerability to adversarial examples: inputs imperceptibly yet maliciously manipulated in order to change the predictions of the model. The aim of this talk is to introduce novel adversarial attack paradigms capable not only of producing misclassifications, but also of achieving more ambitious objectives, thereby broadening the horizon of adversarial examples. More particularly, two recent contributions will be presented. First, a novel "multiple-instance" attack paradigm (i.e., that can only be carried out by considering multiple inputs) will be presented, which is capable not only of fooling the model for any incoming input, but also of controlling the frequency with which each class is predicted by the model after multiple attacks. This paradigm exposes new threats in a wide range of scenarios and use cases, such as producing adversarial label-drifts, or carrying out several attacks in a less detectable way. Secondly, the possibilities and limits of adversarial attacks for explainable machine learning models will be discussed, exploring whether (and how) adversarial examples can be generated in scenarios in which the inputs, the outputs and the explanations are assessed by humans. The talk will conclude by illustrating several potential attacks in such scenarios.
The technology for inspecting wood is essential in many facets of contemporary industry. Among other issues, the number of rings in a stave has a direct relationship with the wood quality. The appearance of sawn wood has many natural variations and distinct appearance that a human inspector can easily compensate for when determining the type of each stave or board. However, for automatic wood inspection systems, these variations are a major source of complication. Several approaches to an automatic detection of tree-ring boundaries exist; however, they use basic image processing techniques. As a result, their accuracy is limited, and their application is restricted mainly to wood where the tree-ring boundaries are clearly defined. There also exists some works based on segmentation deep learning techniques but again, the wood processed has ring boundaries easily detectable. The aim of this paper is to deal with the problem of wood classification when there is noticeable heterogeneity in the texture of the samples. To the authors' best knowledge, this is the first approach to grain classification in such heterogeneous images. The solution uses a hybrid approach combining classic computer vision for preprocessing and deep learning-based algorithms in order to classify wood into three quality categories. Cropping data was used in order to augment the original dataset, to avoid intra-class problem that appears in single staves and to improve the performance and accuracy of final voting system.
The problem addressed is the jet engine aircraft fuel consumption during the take-off, climb and cruise flight phases. Due to the globalization phenomena, a continuous increase in air traffic demand has been brought which, consequently, has resulted in an increase of fuel consumption and its associated pollutant gases emitted to the atmosphere.
Previous studies have presented tools and frameworks that help quantify the aircraft's fuel consumption and hence, it's pollutant gas emissions, showing the magnitude of such problem and the urgent need of addressing it. In general, such tools and frameworks rely on aircraft performance models that resolve the equations of motion for each flight phase by employing energy balance, numerical or statistical methods. Although being very accurate, these methods do not provide closed-form expressions that can relate the aircraft's fuel consumption with its aerodynamic, engine and design parameters.
Our contribution in this work is based on proposing a mathematical model that provides closed-form formulae for the quantification of the fuel consumption (and, hence, of the pollutant gases emitted) and several of the aircraft's state variables (fuel flow rate, velocity, thrust, lift, drag, weight, rate-of-climb, etc.) during the takeoff, climb and cruise flight phases; with the advantage that such closed-form formulae enables further optimization and sensibility analyses.
Intra-tumor heterogeneity renders the identification of somatic single-nucleotide variants (SNVs) a challenging problem. In particular, low-frequency SNVs are hard to distinguish from sequencing artifacts. While the increasing availability of multi-sample tumor DNA sequencing data holds the potential for more accurate variant calling, there is a lack of high-sensitivity multi-sample SNV callers that utilize these data. Here we report Moss, a method to identify low-frequency SNVs that recur in multiple sequencing samples from the same tumor. Moss provides any existing single-sample SNV caller the ability to support multiple samples with little additional time overhead. We demonstrate that Moss improves recall while maintaining high precision in a simulated dataset. On multi-sample hepatocellular carcinoma, acute myeloid leukemia and colorectal cancer datasets, Moss identifies new low-frequency variants that meet guide review criteria and are consistent with the tumor's mutational signature profile. In addition, Moss detects the presence of variants in more samples of the same tumor than reported by the single-sample caller. Moss' improved sensitivity in SNV calling will enable more detailed downstream analyses in cancer genomics.
The temporal sequences of satellite images constitute a very valuable and abundant resource for the analysis of a region of interest. On the other hand, deep learning techniques are currently the benchmark in terms of automatic image classification. This is why the application of this subject of models in the field of satellite imagery is attracting more and more attention from academic community and business . However, labeled data , which are generally necessary for training deep learning models, are very scarce and expensive to obtain for satellite imagery. In this context, the research of a fully unsupervised procedure is carried out in which, given a sequence of images, a semantic embedding is learned and a partition of the terrain is created according to its semantic properties and its evolution over time. This approach offers a novel global perspective of the terrain, where large areas sharing similar semantics and temporal evolution are connected to form clearly defined patterns. The results also show the close relationship between the distribution of clusters in the geographic space and their distribution in the embedded space. The semantic analysis is completed by obtaining the time series representing each cluster, the series representing the boundaries and a graph explaining the connection between the different clusters. The methodology is illustrated by performing a semantic analysis using a sequence of satellite images of the Navarra region (Spain).
Supervised classification is a fundamental part of machine learning whose applications to real problems are of great interest. Both in the literature and in machine learning software libraries, we can find multiple proposals for supervised classification paradigms (decision trees, neural networks, Bayesian networks, etc.) as well as various methods for fitting these models. Having a methodology to evaluate and compare fairly the result of these models is essential to obtain the right conclusions. However, on many occasions we tend to neglect the correct validation of the results obtained.
This talk presents a review of the methodology for the honest assessment of classifiers, providing useful information to choose the best alternative in validation processes when trying to solve supervised classification problems. Since the fundamental aspects of honest model validation are spread over a long list of literature references, this talk may be useful to condense the fundamental aspects and provide sufficient information to guide on the use of different alternatives.
The content is structured in three main blocks. After an introduction to the problem of supervised classification and its importance in the honest validation of models, the first block is devoted to scores as measures of classifier quality, their main characteristics and their use. The second block presents the problem of estimating the value of scores using finite sets of data , the most commonly used estimation methods and their properties in terms of bias and variance as well as possible variations and improvements. Finally, the last part briefly presents the hypothesis tests as tool to compare classifiers in different situations, presenting the possible alternatives depending on the conditions of the problem to be solved.
The most popular ways of learning models from data have been "supervised classification" and "unsupervised or clustering". While the former requires the entire sample to be annotated in order to learn a predictive model , the latter works on an unlabeled sample with the goal of discovering the algebraic Structures of the data.
Outside this "comfort zone", the so-called "weakly-supervised classification" has emerged strongly during the last decade: not all the sample is labeled, there may be extra information about the annotation at prediction time, and the "case-label" relations may go beyond the classical "one sample -- one label".
The seminar will serve to review the main "weakly-supervised" scenarios that have emerged in the scientific literature, highlighting the genuine characteristics of sample labeling in each of them. A "taxonomy" will be offered to differentiate and characterize them. Each scenario will be illustrated with applications and reference works.
Combining machine learning and computational chemistry for predictive insights into chemical systems
08/09/2022. Valentín Vassilev Galindo. University of Luxembourg