Publications

Topics:
  1. B. Gahtan, R. J. Sahala, A. M. Bronstein, R. Cohen, Exploring QUIC dynamics: a large-scale dataset for encrypted traffic analysis, arXiv:2410.03728, 2024 details

    Exploring QUIC dynamics: a large-scale dataset for encrypted traffic analysis

    B. Gahtan, R. J. Sahala, A. M. Bronstein, R. Cohen
    arXiv:2410.03728, 2024

    QUIC, a new and increasingly used transport protocol, addresses and resolves the limitations of TCP by offering improved security, performance, and features such as stream multiplexing and connection migration. These features, however, also present challenges for network operators who need to monitor and analyze web traffic. In this paper, we introduce VisQUIC, a labeled dataset comprising over 100,000 QUIC traces from more than 44,000 websites (URLs), collected over a four-month period. These traces provide the foundation for generating more than seven million images, with configurable parameters of window length, pixel resolution, normalization, and labels. These images enable an observer looking at the interactions between a client and a server to analyze and gain insights about QUIC encrypted connections. To illustrate the dataset’s potential, we offer a use-case example of an observer estimating the number of HTTP/3 responses/requests pairs in a given QUIC, which can reveal server behavior, client–server interactions, and the load imposed by an observed connection. We formulate the problem as a discrete regression problem, train a machine learning (ML) model for it, and then evaluate it using the proposed dataset on an example use case.

    B. Gahtan, S. Funk, E. Kodesh, I. Ketko, T. Kuflik, A. M. Bronstein, Automatic identification and visualization of group training activities using wearable data, arXiv:2410.05452, 2024 details

    Automatic identification and visualization of group training activities using wearable data

    B. Gahtan, S. Funk, E. Kodesh, I. Ketko, T. Kuflik, A. M. Bronstein
    arXiv:2410.05452, 2024

    Human Activity Recognition (HAR) identifies daily activities from time-series data collected by wearable devices like smartwatches. Recent advancements in Internet of Things (IoT), cloud computing, and low-cost sensors have broadened HAR applications across fields like healthcare, biometrics, sports, and personal fitness. However, challenges remain in efficiently processing the vast amounts of data generated by these devices and developing models that can accurately recognize a wide range of activities from continuous recordings, without relying on predefined activity training sessions. This paper presents a comprehensive framework for imputing, analyzing, and identifying activities from wearable data, specifically targeting group training scenarios without explicit activity sessions. Our approach is based on data collected from 135 soldiers wearing Garmin 55 smartwatches over six months. The framework integrates multiple data streams, handles missing data through cross-domain statistical methods, and identifies activities with high accuracy using machine learning (ML). Additionally, we utilized statistical analysis techniques to evaluate the performance of each individual within the group, providing valuable insights into their respective positions in the group in an easy-to-understand visualization. These visualizations facilitate easy understanding of performance metrics, enhancing group interactions and informing individualized training programs. We evaluate our framework through traditional train-test splits and out-of-sample scenarios, focusing on the model’s generalization capabilities. Additionally, we address sleep data imputation without relying on ML, improving recovery analysis. Our findings demonstrate the potential of wearable data for accurately identifying group activities, paving the way for intelligent, data-driven training solutions.

    B. Gahtan, R. J. Shahla, R. Cohen, A. M. Bronstein, Estimating the number of HTTP/3 responses in QUIC using deep learning, arXiv:2410.06140, 2024 details

    Estimating the number of HTTP/3 responses in QUIC using deep learning

    B. Gahtan, R. J. Shahla, R. Cohen, A. M. Bronstein
    arXiv:2410.06140, 2024

    QUIC, a new and increasingly used transport protocol, enhances TCP by providing better security, performance, and features like stream multiplexing. These features, however, also impose challenges for network middle-boxes that need to monitor and analyze web traffic. This paper proposes a novel solution for estimating the number of HTTP/3 responses in a given QUIC connection by an observer. This estimation reveals server behavior, client-server interactions, and data transmission efficiency, which is crucial for various applications such as designing a load balancing solution and detecting HTTP/3 flood attacks. The proposed scheme transforms QUIC connection traces into a sequence of images and trains machine learning (ML) models to predict the number of responses. Then, by aggregating images of a QUIC connection, an observer can estimate the total number of responses. As the problem is formulated as a discrete regression problem, we introduce a dedicated loss function. The proposed scheme is evaluated on a dataset of over seven million images, generated from 100,000 traces collected from over 44,000 websites over a four-month period, from various vantage points. The scheme achieves up to 97% cumulative accuracy in both known and unknown web server settings and 92% accuracy in estimating the total number of responses in unseen QUIC traces.

    H. Abraham, B. Gahtan, A. Kobovich, O. Leitersdorf, A. M. Bronstein, E. Yaakobi, Beyond the alphabet: deep signal embedding for enhanced DNA clustering, arXiv:2410.06188, 2024 details

    Beyond the alphabet: deep signal embedding for enhanced DNA clustering

    H. Abraham, B. Gahtan, A. Kobovich, O. Leitersdorf, A. M. Bronstein, E. Yaakobi
    arXiv:2410.06188, 2024

    The emerging field of DNA storage employs strands of DNA bases (A/T/C/G) as a storage medium for digital information to enable massive density and durability. The DNA storage pipeline includes: (1) encoding the raw data into sequences of DNA bases; (2) synthesizing the sequences as DNA strands that are stored over time as an unordered set; (3) sequencing the DNA strands to generate DNA reads; and (4) deducing the original data. The DNA synthesis and sequencing stages each generate several independent error-prone duplicates of each strand which are then utilized in the final stage to reconstruct the best estimate for the original strand. Specifically, the reads are first clustered into groups likely originating from the same strand (based on their similarity to each other), and then each group approximates the strand that led to the reads of that group. This work improves the DNA clustering stage by embedding it as part of the DNA sequencing. Traditional DNA storage solutions begin after the DNA sequencing process generates discrete DNA reads (A/T/C/G), yet we identify that there is untapped potential in using the raw signals generated by the Nanopore DNA sequencing machine before they are discretized into bases, a process known as basecalling, which is done using a deep neural network. We propose a deep neural network that clusters these signals directly, demonstrating superior accuracy, and reduced computation times compared to current approaches that cluster after basecalling.

    B. Gahtan, R. Cohen, A. M. Bronstein, E. Shapira, Data-driven cellular network selector for vehicle teleoperations, arXiv:2410.19791, 2024 details

    Data-driven cellular network selector for vehicle teleoperations

    B. Gahtan, R. Cohen, A. M. Bronstein, E. Shapira
    arXiv:2410.19791, 2024

    Remote control of robotic systems, also known as teleoperation, is crucial for the development of autonomous vehicle (AV) technology. It allows a remote operator to view live video from AVs and, in some cases, to make real-time decisions. The effectiveness of video-based teleoperation systems is heavily influenced by the quality of the cellular network and, in particular, its packet loss rate and latency. To optimize these parameters, an AV can be connected to multiple cellular networks and determine in real time over which cellular network each video packet will be transmitted. We present an algorithm, called Active Network Selector (ANS), which uses a time series machine learning approach for solving this problem. We compare ANS to a baseline non-learning algorithm, which is used today in commercial systems, and show that ANS performs much better, with respect to both packet loss and packet latency.

    T. Shor, C. Baskin, A. M. Bronstein, Leveraging latents for efficient thermography classification and segmentation, Proc. Medical Imaging with Deep Learning (MIDL), 2024 details

    Leveraging latents for efficient thermography classification and segmentation

    T. Shor, C. Baskin, A. M. Bronstein
    Proc. Medical Imaging with Deep Learning (MIDL), 2024

    Breast cancer is a prominent health concern worldwide, currently being the secondmost common and second-deadliest type of cancer in women. While current breast cancer diagnosis mainly relies on mammography imaging, in recent years the use of thermography for breast cancer imaging has been garnering growing popularity. Thermographic imaging relies on infrared cameras to capture body-emitted heat distributions. While these heat signatures have proven useful for computer-vision systems for accurate breast cancer segmentation and classification, prior work often relies on handcrafted feature engineering or complex architectures, potentially limiting the comparability and applicability of these methods. In this work, we present a novel algorithm for both breast cancer classification and segmentation. Rather than focusing efforts on manual feature and architecture engineering, our algorithm focuses on leveraging an informative, learned feature space, thus making our solution simpler to use and extend to other frameworks and downstream tasks, as well as more applicable to data-scarce settings. Our classification produces SOTA results, while we are the first work to produce segmentation regions studied in this paper.

    A. A. Rosenberg, S. Vedula, A. M. Bronstein, A. Marx, Seeing Double: Molecular dynamics simulations reveal the stability of certain alternate protein conformations in crystal structures, bioRxiv 2024.08.31.610605, 2024 details

    Seeing Double: Molecular dynamics simulations reveal the stability of certain alternate protein conformations in crystal structures

    A. A. Rosenberg, S. Vedula, A. M. Bronstein, A. Marx
    bioRxiv 2024.08.31.610605, 2024

    Proteins jiggle around, adopting ensembles of interchanging conformations. Here we show through a large-scale analysis of the Protein Data Bank and using molecular dynamics simulations, that segments of protein chains can also commonly adopt dual, transiently stable conformations which is not explained by direct interactions. Our analysis highlights how alternate conformations can be maintained as non-interchanging, separated states intrinsic to the protein chain, namely through steric barriers or the adoption of transient secondary structure elements. We further demonstrate that despite the commonality of the phenomenon, current structural ensemble prediction methods fail to capture these bimodal distributions of conformations.

    S. Vedula, V. Maiorca, L. Basile, F. Locatello, A. M. Bronstein, Scalable unsupervised alignment of general metric and non-metric structures, arXiv preprint arXiv:2406.13507, 2024 (also in Proc. ICML Workshop on AI4Science) details

    Scalable unsupervised alignment of general metric and non-metric structures

    S. Vedula, V. Maiorca, L. Basile, F. Locatello, A. M. Bronstein
    arXiv preprint arXiv:2406.13507, 2024 (also in Proc. ICML Workshop on AI4Science)

    Aligning data from different domains is a fundamental problem in machine learning with broad applications across very different areas, most notably aligning experimental readouts in single-cell multiomics. Mathematically, this problem can be formulated as the minimization of disagreement of pair-wise quantities such as distances and is related to the Gromov-Hausdorff and Gromov-Wasserstein distances. Computationally, it is a quadratic assignment problem (QAP) that is known to be NP-hard. Prior works attempted to solve the QAP directly with entropic or low-rank regularization on the permutation, which is computationally tractable only for modestly-sized inputs, and encode only limited inductive bias related to the domains being aligned. We consider the alignment of metric structures formulated as a discrete Gromov-Wasserstein problem and instead of solving the QAP directly, we propose to learn a related well-scalable linear assignment problem (LAP) whose solution is also a minimizer of the QAP. We also show a flexible extension of the proposed framework to general non-metric dissimilarities through differentiable ranks. We extensively evaluate our approach on synthetic and real datasets from single-cell multiomics and neural latent spaces, achieving state-of-the-art performance while being conceptually and computationally simple.

    A. A. Rosenberg, A. Marx, A. M. Bronstein, A dataset of alternately located segments in protein crystal structures, Scientific Data, 11 (783), 2024 details

    A dataset of alternately located segments in protein crystal structures

    A. A. Rosenberg, A. Marx, A. M. Bronstein
    Scientific Data, 11 (783), 2024

    Protein Data Bank (PDB) files list the relative spatial location of atoms in a protein structure as the final output of the process of fitting and refining to experimentally determined electron density measurements. Where experimental evidence exists for multiple conformations, atoms are modelled in alternate locations. Programs reading PDB files commonly ignore these alternate conformations by default leaving users oblivious to the presence of alternate conformations in the structures they analyze. This has led to underappreciation of their prevalence, under characterisation of their features and limited the accessibility to this high-resolution data representing structural ensembles. We have trawled PDB files to extract structural features of residues with alternately located atoms. The output includes the distance between alternate conformations and identifies the location of these segments within the protein chain and in proximity of all other atoms within a defined radius. This dataset should be of use in efforts to predict multiple structures from a single sequence and support studies investigating protein flexibility and the association with protein function.

    D. Freedman, E. Rozenberg, A. M. Bronstein, A theoretical framework for an efficient normalizing flow-based solution to the Schrödinger equation, arXiv preprint arXiv:2406.00047, 2024 details

    A theoretical framework for an efficient normalizing flow-based solution to the Schrödinger equation

    D. Freedman, E. Rozenberg, A. M. Bronstein
    arXiv preprint arXiv:2406.00047, 2024

    A central problem in quantum mechanics involves solving the Electronic Schrödinger Equation for a molecule or material. The Variational Monte Carlo approach to this problem approximates a particular variational objective via sampling, and then optimizes this approximated objective over a chosen parameterized family of wavefunctions, known as the ansatz. Recently neural networks have been used as the ansatz, with accompanying success. However, sampling from such wavefunctions has required the use of a Markov Chain Monte Carlo approach, which is inherently inefficient. In this work, we propose a solution to this problem via an ansatz which is cheap to sample from, yet satisfies the requisite quantum mechanical properties. We prove that a normalizing flow using the following two essential ingredients satisfies our requirements: (a) a base distribution which is constructed from Determinantal Point Processes; (b) flow layers which are equivariant to a particular subgroup of the permutation group. We then show how to construct both continuous and discrete normalizing flows which satisfy the requisite equivariance. We further demonstrate the manner in which the non-smooth nature (“cusps”) of the wavefunction may be captured, and how the framework may be generalized to provide induction across multiple molecules. The resulting theoretical framework entails an efficient approach to solving the Electronic Schrödinger Equation.

    Y. Elul, E. Rozenberg, A. Boyarski, Y. Yaniv, A. Schuster, A. M. Bronstein , Data-driven modeling of interrelated dynamical systems, Nature Communications Physics (7), 144, 2024 details

    Data-driven modeling of interrelated dynamical systems

    Y. Elul, E. Rozenberg, A. Boyarski, Y. Yaniv, A. Schuster, A. M. Bronstein
    Nature Communications Physics (7), 144, 2024

    Non-linear dynamical systems describe numerous real-world phenomena, ranging from the weather, to financial markets and disease progression. Individual systems may share substantial common information, for example patients’ anatomy. Lately, deep-learning has emerged as a leading method for data-driven modeling of non-linear dynamical systems. Yet, despite recent breakthroughs, prior works largely ignored the existence of shared information between different systems. However, such cases are quite common, for example, in medicine: we may wish to have a patient-specific model for some disease, but the data collected from a single patient is usually too small to train a deep-learning model. Hence, we must properly utilize data gathered from other patients. Here, we explicitly consider such cases by jointly modeling multiple systems. We show that the current single-system models consistently fail when trying to learn simultaneously from multiple systems. We suggest a framework for jointly approximating the Koopman operators of multiple systems, while intrinsically exploiting common information. We demonstrate how we can adapt to a new system using order-of-magnitude less new data and show the superiority of our model over competing methods, in terms of both forecasting ability and statistical fidelity, across chaotic, cardiac, and climate systems.

    O. Wengrowicz, A. M. Bronstein, O. Cohen, Unsupervised physics-informed deep learning-based reconstruction for time-resolved imaging by multiplexed ptychography, Optics Express 32(6), pp. 8791-8803, 2024 details

    Unsupervised physics-informed deep learning-based reconstruction for time-resolved imaging by multiplexed ptychography

    O. Wengrowicz, A. M. Bronstein, O. Cohen
    Optics Express 32(6), pp. 8791-8803, 2024

    We explore numerically an unsupervised, physics-informed, deep learning-based reconstruction technique for time-resolved imaging by multiplexed ptychography. In our method, the untrained deep learning model replaces the iterative algorithm’s update step, yielding superior reconstructions of multiple dynamic object frames compared to conventional methodologies. More precisely, we demonstrate improvements in image quality and resolution, while reducing sensitivity to the number of recorded frames, the mutual orthogonality of different probe modes, overlap between neighboring probe beams and the cutoff frequency of the ptychographic microscope – properties that are generally of paramount importance for ptychographic reconstruction algorithms.

    G. Serussi, T. Shor, T. Hirshberg, C. Baskin, A. M. Bronstein, Active propulsion noise shaping for multi-rotor aircraft localization, Proc. Int'l Conf. on Intelligent Robots and Systems (IROS), 2024 details

    Active propulsion noise shaping for multi-rotor aircraft localization

    G. Serussi, T. Shor, T. Hirshberg, C. Baskin, A. M. Bronstein
    Proc. Int'l Conf. on Intelligent Robots and Systems (IROS), 2024

    Multi-rotor aerial autonomous vehicles (MAVs) primarily rely on vision for navigation purposes. However, visual localization and odometry techniques suffer from poor performance in low or direct sunlight, a limited field of view, and vulnerability to occlusions. Acoustic sensing can serve as a complementary or even alternative modality for vision in many situations, and it also has the added benefits of lower system cost and energy footprint, which is especially important for micro aircraft. This paper proposes actively controlling and shaping the aircraft propulsion noise generated by the rotors to benefit localization tasks, rather than considering it a harmful nuisance. We present a neural network architecture for selfnoise-based localization in a known environment. We show that training it simultaneously with learning time-varying rotor phase modulation achieves accurate and robust localization. The proposed methods are evaluated using a computationally affordable simulation of MAV rotor noise in 2D acoustic environments that is fitted to real recordings of rotor pressure fields.

    M. Pegoraro, S. Vedula, A. A. Rosenberg, I. Tallini, E. Rodolà, A. M. Bronstein, Vector quantile regression on manifolds, Proc. AIStats, 2024 details

    Vector quantile regression on manifolds

    M. Pegoraro, S. Vedula, A. A. Rosenberg, I. Tallini, E. Rodolà, A. M. Bronstein
    Proc. AIStats, 2024

    Quantile regression (QR) is a statistical tool for distribution-free estimation of conditional quantiles of a target variable given explanatory features. QR is limited by the assumption that the target distribution is univariate and defined on an Euclidean domain. Although the notion of quantiles was recently extended to multi-variate distributions, QR for multi-variate distributions on manifolds remains underexplored, even though many important applications inherently involve data distributed on, e.g., spheres (climate and geological phenomena), and tori (dihedral angles in proteins). By leveraging optimal transport theory and c-concave functions, we meaningfully define conditional vector quantile functions of high-dimensional variables on manifolds (M-CVQFs). Our approach allows for quantile estimation, regression, and computation of conditional confidence sets and likelihoods. We demonstrate the approach’s efficacy and provide insights regarding the meaning of non-Euclidean quantiles through synthetic and real data experiments.

    Y. Chen, H. Ye, S. Vedula, A. M. Bronstein, R. Dreslinski, T. Mudge, N. Talati, Demystifying graph sparsification algorithms in graph properties preservation, Proc.Int'l Conf. on Very Large Databases (VLDB), 2024 details

    Demystifying graph sparsification algorithms in graph properties preservation

    Y. Chen, H. Ye, S. Vedula, A. M. Bronstein, R. Dreslinski, T. Mudge, N. Talati
    Proc.Int'l Conf. on Very Large Databases (VLDB), 2024

    Graph sparsification is a technique that approximates a given graph by a sparse graph with a subset of vertices and/or edges. The goal of an effective sparsification algorithm is to maintain specific graph properties relevant to the downstream task while minimizing the graph’s size. Graph algorithms often suffer from long execution time due to the irregularity and the large real-world graph size. Graph sparsification can be applied to greatly reduce the run time of graph algorithms by substituting the full graph with a much smaller sparsified graph, without significantly degrading the output quality. However, the interaction between numerous sparsifiers and graph properties is not widely explored, and the potential of graph sparsification is not fully understood.
    In this work, we cover 16 widely-used graph metrics, 12 representative graph sparsification algorithms, and 14 real-world input graphs spanning various categories, exhibiting diverse characteristics, sizes, and densities. We developed a framework to extensively assess the performance of these sparsification algorithms against graph metrics, and provide insights to the results. Our study shows that there is no one sparsifier that performs the best in preserving all graph properties, e.g. sparsifiers that preserve distance-related graph properties (eccentricity) struggle to perform well on Graph Neural Networks (GNN). This paper presents a comprehensive experimental study evaluating the performance of sparsification algorithms in preserving essential graph metrics. The insights inform future research in incorporating matching graph sparsification to graph algorithms to maximize benefits while minimizing quality degradation. Furthermore, we provide a framework to facilitate the future evaluation of evolving sparsification algorithms, graph metrics, and ever-growing graph data.