Titre du document / Document title
Segmentation of multivariate mixed data via lossy data coding and compression
Auteur(s) / Author(s)
YI MA
(1) ;
DERKSEN Harm
(2) ;
WEI HONG
(3) ;
WRIGHT John
(1) ;
Affiliation(s) du ou des auteurs / Author(s) Affiliation(s)
(1) Electrical and Computer Engineering Department, University of Illinois at Urbana-Champaign, Coordinated Science Laboratory, 1308 West Main Street, Urbana, IL 61801-2307, ETATS-UNIS
(2) Department of Mathematics, University of Michigan, 530 Church Street, Ann Arbor, MI 48109-1043, ETATS-UNIS
(3) DSP Solutions Research and Development Center, Texas Instruments, PO Box 660199, MS/8649, Dallas, TX 75266-0199, ETATS-UNIS
Résumé / Abstract
-In this paper, based on ideas from lossy data coding and compression, we present a simple but effective technique for segmenting multivariate mixed data that are drawn from a mixture of Gaussian distributions, which are allowed to be almost degenerate. The goal is to find the optimal segmentation that minimizes the overall coding length of the segmented data, subject to a given distortion. By analyzing the coding length/rate of mixed data, we formally establish some strong connections of data segmentation to many fundamental concepts in lossy data compression and rate-distortion theory. We show that a deterministic segmentation is approximately the (asymptotically) optimal solution for compressing mixed data. We propose a very simple and effective algorithm that depends on a single parameter, the allowable distortion. At any given distortion, the algorithm automatically determines the corresponding number and dimension of the groups and does not involve any parameter estimation. Simulation results reveal intriguing phase-transition-like behaviors of the number of segments when changing the level of distortion or the amount of outliers. Finally, we demonstrate how this technique can be readily applied to segment real imagery and bioinformatic data.
Revue / Journal Title
IEEE transactions on pattern analysis and machine intelligence
ISSN 0162-8828
CODEN ITPIDJ
Source / Source
2007, vol. 29, n
o9, pp. 1546-1562 [17 page(s) (article)] (36 ref.)
Langue / Language
Anglais
Editeur / Publisher
IEEE Computer Society, Los Alamitos, CA, ETATS-UNIS
(1979)
(Revue)
Mots-clés anglais / English Keywords
Cluster analysis ;
Phase transitions ;
Modeling ;
System identification ;
Parameter estimation ;
Asymptotic approximation ;
Deterministic approach ;
Degenerate system ;
Gaussian process ;
DNA chip ;
Outlier ;
Rate distortion theory ;
Mixed distribution ;
Image processing ;
Classification ;
Bioinformatics ;
Optimal solution ;
Data compression ;
Image segmentation ;
Pattern analysis ;
Artificial intelligence ;
Mots-clés français / French Keywords
. ;
Analyse amas ;
Transition phase ;
Modélisation ;
Identification système ;
Estimation paramètre ;
Approximation asymptotique ;
Approche déterministe ;
Système dégénéré ;
Processus Gauss ;
Puce à DNA ;
Observation aberrante ;
Théorie vitesse distorsion ;
Mélange loi probabilité ;
Traitement image ;
Classification ;
Bioinformatique ;
Solution optimale ;
Compression donnée ;
Segmentation image ;
Analyse forme ;
Intelligence artificielle ;
Mots-clés espagnols / Spanish Keywords
Analisis cluster ;
Transición fase ;
Modelización ;
Identificación sistema ;
Estimación parámetro ;
Aproximación asintótica ;
Enfoque determinista ;
Sistema degenerado ;
Proceso Gauss ;
Pulga de DNA ;
Observación aberrante ;
Mezcla ley probabilidad ;
Procesamiento imagen ;
Clasificación ;
Bioinformática ;
Solución óptima ;
Compresión dato ;
Análisis forma ;
Inteligencia artificial ;
Mots-clés d'auteur / Author Keywords
Multivariate mixed data ;
data segmentation ;
data clustering ;
rate distortion ;
lossy coding ;
lossy compression ;
image segmentation ;
microarray data clustering ;
Localisation / Location
INIST-CNRS, Cote INIST : 222 T, 35400014669395.0050
Nº notice refdoc (ud4) : 18972985