Sebastian Thrun's Homepage

Netcube: A scalable tool for fast data mining and compression.

D. Margaritis, C. Faloutsos, and S. Thrun.

We propose an novel method of computing and storing DataCubes. Our idea is to use Bayesian Networks, which can generate approximate counts for any query combination of attribute values and "don't cares." A Bayesian network represents the underlying joint probability distribution of the data that were used to generate it. By means of such a network the proposed method, NetCube, exploits correlations among attributes. Our proposed preprocessing algorithm scales linearly on the size of the database, and is thus scalable; it is also parallelizable with a straightforward parallel implementation. Moreover, we give an algorithm to estimate counts of arbitrary queries that is fast (constant on the database size). Experimental results show that NetCubes have fast generation and use.

The full paper is available in PDF and gzipped Postscript

@INPROCEEDINGS{Margaritis01a,
  AUTHOR	= {Margaritis, D. and Faloutsos, C. and Thrun, S.},
  TITLE		= {NetCube: A Scalable Tool for Fast Data Mining and Compression},
  YEAR		= {2001},
  BOOKTITLE	= {Proceedings of the 2001 International Conference on Very Large Databases},
  ADDRESS	= {Rome, Italy},
  NOTE          = {}
}