2 What is Hourglass?

2.1 tl;dr

What is it?

Software tool to facilitate exploration of multi-parametric data (especially bioimage quantification).

For who?

Individuals who need data plotted hands-free.

How?

Provide user preferences to filter and subset large datasets and compare groups in an automated manner.

2.2 Roadmap

Schematic depicting Hourglass pipeline. Top: Collecting data. Middle: Hourglass. Bottom: Output visualizations in structured folder hierarchy.

Figure 2.1: Schematic depicting Hourglass pipeline. Top: Collecting data. Middle: Hourglass. Bottom: Output visualizations in structured folder hierarchy.

2.3 Is this tool for me?

What is it optimal for?

  • Multiparametric data analysis
  • Systematic data interrogation
    • Explores multiple groups
    • Wide range of plotting functions
    • Method of quality control

What can it do, but is not optimal for?

  • Omics visualization, i.e. any numeric dataset with annotations

What can it not do?

  • Integration of multiple independent datasets, e.g. can’t compare proteome vs transcriptome
  • Image analysis, i.e. read and interpret raw image files

2.4 Terminology

  • Annotation = any discrete label or description for features/parameters and patients/samples. See Input Data for more information.

    • Patient or sample annotations are another layer of data that we can add to an already multi-parametric data.
Patients have different labels or 'annotations'.

Figure 2.2: Patients have different labels or ‘annotations’.

  • Feature = stains, genes, markers, etc (e.g. Gene1, ProteinX, IL6)
  • Parameters = readouts, measurements of features (e.g. Numbers of Positive Pixels, Percent Positive Pixels, Stained Area mm^2,Total Area mm^2)
Columns in stain data can also have labels or 'annotations'.

Figure 2.3: Columns in stain data can also have labels or ‘annotations’.

  • Multiparametric = data with more than two dimensions, results from multiple parameters being measured from features. For example, from methods such as Immunohistochemistry (IHC), immunomasscytometry (IMC), etc.

If we look at the feature IL6, using bioimage analysis tools, we can measure many readouts from one feature, such as: IL6 –> Number of positive pixels AND Area AND Positive Pixel percent

  • vs. Uniparametric = data from methods such as microarray, transcriptomics, proteomics, etc.

If we look at the same feature IL6, using traditional omics, we get a single readout from one feature, such as: IL6 –> protein abundance OR gene expression

2.5 By-Patient vs. By-Sample

In addition to enabling a highly systematic visualization of multiple research questions within the integrated dataset, Hourglass is designed to enable this structured analysis across multiple levels. Specifically, Hourglass is designed to leverage the fact that bioimaging datasets are typically recorded from multiple samples per patient. For example, TMAs comprise multiple samples per patient while IMC and spatial omic methods usually record multiple regions of interest (ROIs) per patient.

Hourglass can compare annotations at the patient or sample level.

Figure 2.4: Hourglass can compare annotations at the patient or sample level.

Hourglass accesses all levels of this unique data structure: ROIs/samples per patient may be averaged for by-patient analysis, run in parallel as individual sample analysis and, furthermore, be aggregated as patient-matched samples. Hourglass thereby accelerates exploration and direct comparison of regional versus global versus patient-paired features for each patient, providing a comprehensive overview over inter- and intratumoral stain differences.

The multiple levels are (see below for description):

Patients have different labels or 'annotations'.

Figure 2.5: Patients have different labels or ‘annotations’.

  1. Global/ByPatient: Typically, TMA-derived datasets comprising multiple samples per patient are averaged. Because there are multiple samples per patient, common practice is to collapse to a patient level by averaging samples across patients commonly done for this type of dataset on a by-patient basis. Collapsing our dataset resulted in patient x feature-parameter matrix dimensions of 165 x 631, with a total of 104,115 data points. For example, we could now compare IL6 hi vs IL6 low patients.

  2. Regional/BySample: This analysis is performed to figure out when intratumoral heterogeneity can be increasingly appreciated as a key factor in X. We saw a specific opportunity to dissolve those regions and look at heterogeneity across samples, so this is a specific angle that we implemented in Hourglass. For example, IL6 hi vs IL6 low regions across all samples.

  3. Patient-Paired: To analyze intrapatient heterogeneity, Hourglass can also create slopegraphs to resolve regional differences in a patient-paired manner. For example, for comparison of IL6 hi vs IL6 low regions within the same tumors. Hourglass averages the values and plots a graph in which the means of each group are paired. For example, tumor subtypes may vary across samples or ROIs in the same patient. If four samples from a patient are tumor subtype A and three are B, the means of these 2 groups are connected directly by a line and this comparison can be statistically assessed for all patients.

These different analysis levels can be utilized alone or in combination, i.e. the user can easily focus on standard per-patient analysis only or perform the analysis under identical settings at the global, regional and patient-paired levels to systematically identify communalities and differences. For example, if regional differences are driving global trends then we know that either or both heterogeneity and sampling bias is relevant in this research question.

2.6 Comparisons and Feature Sets

The Hourglass Workflow (left to right).

Figure 2.6: The Hourglass Workflow (left to right).

Hourglass enables versions of comparisons
Multiple versions of an analysis can be submitted in parallel to the same Hourglass run, enabling direct comparison of the results. In the case of a main comparison of IL6 high vs IL6 low patients, the user can submit additional versions of this analysis to directly compare if IL6-related differences could be sex-specific or subtype-specific.

When interacting with their data, users can thereby easily focus on different parameters and biological differences while having fixed analysis settings (same feature sets, correlation method, two-tailed p-value test). This facilitates systematic recognition and investigation of side/sub-correlations or confounding factors that may have otherwise been overlooked. Consequently, Hourglass provides automated large-scale integrated analysis with and without inclusion of different subgroups of patients or samples. This allows users to gain a quick and comprehensive overview of how different subgroups might be affecting their results.

Hourglass enables targeted analysis and grouping of feature sets
Often-times a dataset may contain features that are not of interest to a specific question or, related to this, sets of features that are biologically related (e.g. T cell stains CD3, CD8, FoxP3).

Thus, extensive analysis customization options were implemented in Hourglass. This optional analysis step allows for related parameters and features to be grouped together in composite plots, which is referred to as ‘feature sets’. For example, users may define T cell and B cell marker sets (set 1 and 2). Users can then furthermore create super-sets from sets, such as “immune markers” from B cell and T cell sets, to be displayed in custom composite plots as depicted in the heatmap. Any combination of features and parameters can be used to create a set, which enables easy but powerful customization of the numerous and partially related features within complex datasets.

Hourglass was designed to take in a general data format with user options and instantaneously generate a wealth of information.

The power lies in the manipulation of both axes of a numeric sample by feature matrix:

  1. the labeling and filtering of rows (patients/samples), and
  2. the subsetting of columns (features/parameters)

Any label that is assigned to patients can be used as a main comparison. For example, in an Hourglass-independent manner, annotations for patients can be derived from transcriptomic data such as Moffitt subtypes or genomic calls such as KRAS wild-type vs KRAS mutated. On the other axis, biologically relevant features and corresponding parameters can be grouped to make feature sets such as T cell markers and immune markers. Consequently, by compounding multiple sets with multiple sample or patient comparisons, the potential for data exploration is exponentiated.

New annotations can be added prior to running Hourglass. For example, transcriptional profiles of pancreatic cancer can be used to determine Moffitt molecular subtype.

Figure 2.7: New annotations can be added prior to running Hourglass. For example, transcriptional profiles of pancreatic cancer can be used to determine Moffitt molecular subtype.

2.7 Strengths

There are multiple benefits of using Hourglass as a framework for multiparametric dataset interrogation.

Accessibility

  • Accessible for non-expert users, fully reproducible and open-source.
  • Makes publication-ready plots via robust and automated pipeline reduces manual work.
  • Leverages metrics of heterogeneity from standard tissue microarray designs.

Query and Exploration

  • Customizable and endless combinations of selections of what labels/annotations to compare. Thus, multiple layers of information from TMAs are extracted but can only be understood or interpreted via a smart way of visualizing the data to make sense of it. Hourglass is the tool to subset, integrate, and pick specific parameters to appreciate all of the information from a dataset.

  • Different levels of analysis (global or by patient vs regional or by sample) can be run. Common practice is to average these values and collapse by patient, however a wealth of information is contained in the sample data and intra-patient differences.

  • Resolves regional vs global differences via paired analysis and heterogeneity barplots. Multiple layers of information from TMAs are extracted but can only be understood or interpreted via a smart way of visualizing the data to make sense of it.

Reproducibility

  • As the nature of such analyses is iterative, and users often return to refine or add to an analysis after extended periods of time, Hourglass supports a high level of reproducibility.
  • Each analysis produces a report (Excel file) sufficient to easily reproduce that analysis and also produce different additional versions of that analysis with identical settings.

Data cleaning and wrangling

  • Subset input table with all parameters to be compared
  • Filtering - Incorporates exclusion/inclusion criteria, within subgroups. For example, a researcher may want to exclude certain patient subgroups or focus on a subset of a 40 stain panel or pre-treated patients may be excluded to observe the effects of basic biology.
  • Imputation for dropouts. Often, datasets are patchy containing NA values.
  • Supports removal of outliers below the first quartile and above the third.

2.8 Limitations

Hourglass facilitates efficient knowledge extraction from complex datasets. However, it does have certain weaknesses.

Minimum number of stains for it to be useful

  • It is not useful when you have 5 samples and one stain/ typical use case when you have the two. You can still benefit from subsetting.
  • Feature sets have to contain 3 or more features in order to be visualized.

Time commitment

  • While Hourglass automatically creates publication-ready plots, it is still semi-automated, the “semi” meaning that you as a user are required to do some work. Firstly, Hourglass requires user tailors its input parameters to guide the analysis. Next, hundreds to thousands of plots will be created. While you may be swimming in a pool of beautiful plots, it may become tedious to look through all the files. However, the organized file output structure seeks to mitigate this through order of folder names: By Patient/By Sample > Imputed or not > Comparison > Feature Set