TILJ5104 STAT1: Applied Compositional Data Analysis: Multivariate and Functional Approaches (JSS31) (2 cr)
Description
Compositional data can be characterized as multivariate observations carrying relative information, typically expressed in units like proportions, percentages, mg/kg, ppm or mg/l, and they occur in a wide range of applications from natural and social sciences. Compositions are thus primarily data where the relevant information is contained in (log-)ratios between components; this led to development of the logratio methodology which is nowadays commonly considered as the preferred choice for their statistical processing. First aim of the course will be to introduce basic concepts of compositional data analysis including their geometrical properties (compositions are characterized by the so-called Aitchison geometry) and interpretable logratio coordinate representations which enable to use popular multivariate methods for statistical analysis of compositional data sets. Secondly, an important case of compositions are distributional data, resulting usually from aggregation of large streams of data, which can be expressed in terms probability mass function of one or more random variables (factors). The latter case leads for two factors to the so called compositional tables, or in general, to multifactorial compositional data. They can be decomposed orthogonally into independent and interactive parts and for each of them an interpretable coordinate representation is built. Finally, also the functional counterpart to compositional data (distributional data expressed in form of probability density functions) has recently gained increasing attention in the applications. The course will provide an introduction to the analysis of these data using a Functional Data Analysis (FDA) approach, grounded on the perspective of Bayes spaces. These spaces are mathematical spaces whose points are densities, which generalize to the FDA setting the Aitchison geometry for multivariate compositional data. The theoretical parts of the course will be accompanied by examples with real-world data using statistical software R.
Preliminary program:
Day 1: [8 hours]
First block on compositional data analysis (concepts, geometry, log-ratio coordinate representations, exploratory data analysis - 6 x 45 minutes)
Seminar (exercises in R, discussing own compositional data problems)
Day 2: [8 hours]
Second block on compositional data analysis (popular multivariate methods: principal component analysis, regression, classification; high-dimensional compositions; irregularities - dealing with zeros and missing values - 4 x 45 minutes)
Third block on compositional data analysis (compositional tables, their decomposition and coordinate representation, 2 x 45 minutes)
Seminar (exercises in R, discussing own compositional data problems)
Day 3: [4 hours]
Block on functional data analysis of density functions (concepts of Bayes spaces, exploratory data analysis, principal component analysis - 2x 45 minutes)
Seminar (discussing own compositional data problems and learning outcomes)
Learning outcomes
Students familiarize with the logratio methodology of compositional data analysis so that at the end of the course they will be able to use it for statistical processing of their own data. For this purpose lecture notes and R scripts from the course will be provided.
Description of prerequisites
Although the course will be taught on applied level, the minimum are one semester undergraduate courses in statistics and mathematics. Familiarity with multivariate data analysis and statistical software R will be beneficial; for those who are not working with R instructions for using an alternative ("clickable") software will be provided. Functional data analysis of densities will be taught from scratch, here no prerequisites are needed.
Completion methods
Method 1
Participation in teaching (2 cr)
lectures, exercises and discussions