What is Symbolic Data Analysis?
Symbolic data analysis (SDA) is a new field of research in the border among Statistics, Data Mining and Computer Science consisting of extracting information from complex and massive data in order to understand, to analyze and to take decisions about the system which generates them.
SDA can be considered to be an extension of the standard data analysis techniques (reduction of the information, clustering, forecasting) applied to symbolic data. In these data, individuals can be described by symbolic variables such as lists of categorical or quantitative values with or without associated weights, intervals, histograms and frequency distributions. These variables, in contrast to the classical approach where only one single number or category is allowed, have a great potential to characterize real-life situations (e.g. time-varying patterns, class descriptions, uncertain or inaccurate data, and so on) and to summarize massive datasets in an efficient way.
An elementary example of a symbolic data table is showed in Table 1, where each unit represents a class of a given university. Units are described by symbolic variables. `Ages´ is a multi-valued variable representing the lists of students ages in each group. `Foreign languages´ is a modal multi-valued variable that represents the languages spoken (out of the class) in each group and the proportion of students speaking them (it is not required that the proportions add up to 1). Modal variables allow proportions, frequencies, probabilities or weights attached to each specific value. `Heights´ is an interval-valued variable which shows the range of heights (a quantitative feature) in each group. Finally, `Weights´ is a histogram variable which shows the frequency distribution of the weights in each group. A histogram variable is a particular case of an interval-valued modal variable with non-overlapping intervals and weights adding up to 1.
As Ward, Peng & Wang (2004) point up, nowadays datasets suffer increasingly from the problem of scale, either in terms of the number of variables or the number of records. It is often desirable to reduce the size of the data maintaining their essential features as much as possible. This reduction can be performed by manually pruning the dataset basing on some domain knowledge, or via sampling, or by dimensionality reduction methods such as principal component analysis and multidimensional scaling, or by aggregation/summarization methods, such as clustering or partitioning. Symbolic data analysis is a new alternative to address this problem. It offers a comprehensive approach that consists of summarization of the dataset by means of symbolic variables, resulting in a smaller and more manageable dataset which preserves the main information, and its subsequent analysis by means of symbolic methods.
Bock and Diday (2000) present an excellent review of this field, where statistical methods such as descriptive statistics, principal component analysis, clustering, and discrimination techniques are described and illustrative examples of this approach, mainly from official statistics, are presented. However, Billard and Diday (2003) center their attention to the enormous need of methodologies to deal with symbolic data.
Classrrom |
Foreign Languages |
Ages (years) |
Heights (cms) |
Weights (kg) |
1 |
{Spanish,0.5; French;0.4} |
{20,21,25} |
[162,80] |
{(40,50],.1;(50,60],.15;(60,70],.25;(70,80],.2;(80,90],.3} |
2 |
{German, 0.8; Chinese, 0.2} |
{22, 23, 30} |
[160,195] |
{(60,70],.1;(70,80],.2;(80,90],.2;(90,100],.25;(100,110],.25} |
3 |
{Francés, 0.7; Aleman, 0.4} |
{18, 19} |
[165,205] |
{(40,50],.13;(50,60],.4;(60,70],.07;(70,80],.2;(80,90],.2} |
4 |
{German, 0.8; Chinese, 0.1} |
{18,19,21} |
[175,205] |
{(50,60],.45;(60,70],.35;(80,90],.1;(100,110],.1;(100,110],.1} |
. |
. |
. |
. |
. |
. |
. |
. |
. |
. |
. |
. |
. |
. |
. |
Table 1. Elementary symbolic data table
References
-
Billard, L., and Diday, E (2003),'From the Statistics of Data to the Statistics of Knowledge: Symbolic Data Analysis',Journal of the American Statistical Association,98,991-999.
-
Bock, H. -H., and Diday, E. (eds.)(2000),Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information From Complex Data'.1st ed.Berlin.Springer-Verlag.
-
Ward, M., Peng, W., and Wang, X. (2004),'Hierarchical Visual Data Mining for Large-Scale Data', Computational Statistics, 19, 147-158.
Download(PDF)