DATAI Seminars. Course 2023-2024
Biosignals, such as electrocardiograms (ECGs), electroencephalograms (EEGs) and electromyograms (EMGs), contain complex patterns that traditional signal processing methods are not always able to interpret accurately. On the other hand, the use of deep learning techniques, especially convolutional neural networks (CNNs) and recurrent neural networks (RNNs), is widespread for feature extraction and classification from data.
This talk reviews the current state of the art of deep learning applications in biosignal analysis, focusing mainly on EEG signals. Its suitability for different types of biosignals is discussed and challenges such as the scarcity of data or the interpretability of models are addressed.
On fingerprinting amd functional reconfiguration of functional connectomes
06/13/2024 / Joaquín Goñi
Functional connectomes (FCs) contain pairwise estimations of functional couplings based on pairs of brain regions activity derived from fMRI BOLD signals. FCs are commonly represented as correlation matrices that are symmetric positive definite (SPD) matrices lying on or inside the SPD manifold. Since the geometry on the SPD manifold is non-Euclidean, the inter-related entries of FCs undermine the use of Euclidean-based distances and its stability when using them as features in machine learning algorithms. By projecting FCs into a tangent space, we can obtain tangent functional connectomes (tangent-FCs), whose entries would not be inter-related, and thus, allow the use of Euclidean-based methods. Tangent-FCs have shown a higher predictive power of behavior and cognition, but no studies have evaluated the effect of such projections with respect to fingerprinting.
In this work, we hypothesize that tangent-FCs have a higher fingerprint than "regular" (i.e., no tangent- projected) FCs. Fingerprinting was measured by identification rates (ID rates) using the standard test-retest approach as well as incorporating monozygotic and dizygotic twins. We assessed: (i) Choice of the Reference matrix Cref. Tangent projections require a reference point on the SPD manifold, so we explored the effect of choosing different reference matrices. (ii) Main-diagonal Regularization. We explored the effect of weighted main diagonal regularization1. (iii) Different fMRI conditions. We included resting state and seven fMRI tasks, (iv) Parcellation granularities from 100 to 900 cortical brain regions (plus subcortical), (v) Different distance metrics. Correlation and Euclidean distances were used to compare regular FCs as well as tangent-FCs. (vi) fMRI scan length on resting state and when comparing task-based versus (matching scan length) resting-state fingerprint.
Our results showed that identification rates are systematically higher when using tangent-FCs. Specifically, we found: (i) Riemann and log-Euclidean matrix references systematically led to higher ID rates for all configurations assessed. (ii) In tangent-FCs, Main-diagonal regularization prior to tangent space projection was critical for ID rate when using Euclidean distance, whereas barely affected ID rates when using correlation distance. (iii) ID rates were dependent on condition and fMRI scan length. (iv)
Parcellation granularity was key for ID rates in FCs, as well as in tangent-FCs with fixed regularization, whereas optimal regularization of tangent-FCs mostly removed this effect. (v) Correlation distance in tangent-FCs outperformed any other configuration of distance on FCs or on tangent-FCs across the "fingerprint gradient" (here sampled by assessing test-retest, Monozygotic twins and Dizygotic twins). (vi)
ID rates tended to be higher in task scans compared to resting-state scans when accounting for fMRI scan length.
I will also introduce our ongoing work on task-to-rest and rest-to-task functional reconfiguration based on tangent space projections of functional connectivity.
Data scarcity remains a major obstacle in biomedical machine learning, where data acquisition is frequently costly or difficult. Synthetic data generation offers a compelling solution for training more robust and generalizable machine learning models. While the generation of synthetic data for cancer diagnosis has been investigated in the literature, it has primarily focused on single-modality settings, such as whole-slide image tiles or RNA-Seq data. However, inspired by the success of text-to-image synthesis models in generating natural images based on a text prompt, a compelling question arises: can this framework be extended to gene expression data and digital pathology, given the established relationship between the two in existing research? This presentation introduces two novel RNA-to-Image generation models. We demonstrate that they effectively synthesize image tiles preserving gene expression patterns for both healthy and multi-cancer tissues. These models offer significant potential for biomedical research, including augmenting datasets to improve model performance, generating privacy-preserving synthetic samples, and enabling the investigation of causal links between gene expression modifications and tissue morphology changes.
This talk will go over the basics of the PageRank problem, studied initially by the founders of Google, which allowed them to create their search engine by applying it to the internet graph with hyperlinks defining edges. Then, I will explain our new results on the problem for undirected graphs, whose main application is finding local clusters in networks, and is used in many branches of science. We have now algorithms that find local clusters fast in a time that does not depend on the whole graph but on the local cluster itself, which is significantly smaller. This is a joint work with Elias Wirth and Sebastian Pokutta.
In essence, crime analysis has as goal to know the criminal reality with the goal to provide anticipatory capacity to the public safety and criminal justice system and to the police and judicial agencies involved in them. We will very briefly explore the state of the art and some advances in spatio-temporal crime forecasting methods. The disciplinary field we will explore includes tools from the Science of data, and Geographic Information Science and Systems.
The empirical risk minimization (ERM) approach for supervised learning chooses prediction rules that fit training samples and are "simple" (generalize). This approach has been the workhorse of machine learning methods and has enabled a myriad of applications. However, ERM methods strongly rely on the specific training samples available and cannot easily address scenarios affected by distribution shifts and corrupted samples. Robust risk minimization (RRM) is an alternative approach that does not aim to fit training examples and instead chooses prediction rules minimizing the maximum expected loss (risk). This talk presents a learning framework based on the generalized maximum entropy principle that leads to minimax risk classifiers (MRCs). The proposed MRCs can efficiently minimize worst-case expected 0-1 loss and provide tight performance guarantees. In particular, MRCs are strongly universally consistent using feature mappings given by characteristic kernels. MRC learning is based on expectation estimates and does not strongly rely on specific training samples. Therefore, the methods presented can provide techniques that are robust to practical situations that defy conventional assumptions, e.g., training samples that follow a different distribution or are corrupted by noise.
In today's talk, I will delve into the realms of digital forensics, shedding light on the crucial aspects of my work. Together, we will explore the challenges and triumphs encountered in the pursuit of recovering invaluable data. It is not just a profession for me; it is a passion that has driven me to establish a pioneering data recovery lab right here in Navarra. I will share insights into real-world cases that highlight the intricacies of digital forensics. These cases will not only captivate your interest but also provide a glimpse into the critical role that data recovery plays in our technologically driven world.
In today's AI-driven landscape, understanding the decisions made by artificial intelligence systems is crucial. Explainable AI (XAI) emerges as a pivotal solution, shedding light on the opaque nature of some AI models. The significance of XAI spans various industries, and its application is particularly transformative in the pharmaceutical sector. As we delve into a practical case study, we'll witness how XAI has revolutionized drug research and development, offering both efficiency gains and cost reductions and a profound comprehension of critical decisions in this vital field.
Knowledge Graphs are machine-readable representations of the information via predicative triples, typically defined by an underlining ontology schema. The recent rise of the Open Science paradigm and advances in Natural Language Processing models has led to the creation of Information Extraction pipelines that can generate large-scale scholarly Knowledge Graphs from scientific publications and patents, enabling advanced 'semantic' services such as fine-grained document classification, retrieval, question answering, and innovation tracking. However, tracking the complex research-industry dynamics of a target technological domain requires also incorporating alternative text sources like news and micro-blogging posts, from where conventional NLP methods and models typically struggle to accurately extract information with high recall. In this talk, we present an enhanced information extraction pipeline tailored to the generation of a knowledge graph comprising open-domain entities from micro-blogging posts. It leverages dependency parsing and classifies entity relations in an unsupervised manner through hierarchical clustering over non-contextual word embeddings. We provide a use case that demonstrates the extraction of semantic triples within the domain of Digital Transformation from X/Twitter.
Resource-Constrained Project Scheduling Problem: A bi-objective approach with time-dependent resource costs
10/25/23 / Laura Anton Sanchez
This talk provides new insights on bi-criteria resource-constrained project scheduling problems. We define a realistic problem where the objectives to combine are the makespan and the total cost for resource usage. Time-dependent costs are assumed for the resources, i.e., they depend on when a resource is used. An optimization model is presented, followed by the development of an algorithm aiming at finding the set of Pareto solutions. The intractability of the optimization models underlying the problem also justifies the development of a metaheuristic for approximating the same front. We design a bi-objective evolutionary algorithm that includes problem-specific knowledge and is based on the Non-dominated Sorting Genetic Algorithm (NSGA-II). The results demonstrate the efficiency of the proposed metaheuristic. In a more recent work, another six multi-objective evolutionary algorithms have been implemented to solve this problem and then, an exhaustive comparison of their performance with the NSGA-II based algorithm has been carried out. A computational and statistically supported study is conducted, using instances built from those available in the literature and applying a set of performance measures to the solution sets obtained by each methodology.