2004 IPDPS Conference
18th International Parallel and Distributed Processing Symposium
April 26–April 30, Santa Fe, New Mexico
TUTORIAL TITLE: An Introduction to Distributed Data Mining
Hillol Kargupta, PhD
Computer Science and Electrical Engineering Department
University of Maryland Baltimore County
e-mail: hillol AT cs DOT umbc DOT edu
President, AGNIK, LLC.
S Rolling Road,
e-mail: Hillol AT agnik DOT com
Advances in computing and communication over wired and wireless networks have resulted in many pervasive distributed computing environments. The Internet, intranets, local area networks, ad hoc wireless networks, and sensor networks are some examples. These environments often come with different distributed sources of data and computation. Mining in such environments naturally calls for proper utilization of these distributed resources. Moreover, in many privacy sensitive applications different, possibly multi-party, data sets collected at different sites must be processed in a distributed fashion without collecting everything to a single central site. However, most off-the-shelf data mining systems are designed to work as a monolithic centralized application. They normally down-load the relevant data to a centralized location and then perform the data mining operations. This centralized approach does not work well in many of the emerging distributed, ubiquitous, possibly privacy-sensitive data mining applications.
The field of Distributed Data Mining (DDM) offers an alternate choice. It pays careful attention to the distributed resources of data, computing, communication, and human factors in order to use them in a near optimal fashion. This tutorial will offer an introduction to the emerging field of Distributed Data Mining. The attendees will be exposed to the following aspects of this field:
1) An overview of the emerging DDM applications
2) An overview of the existing DDM algorithms
3) More detailed discussion of some important DDM algorithms
4) An overview of the systems research issues in DDM
5) Detailed case study of an existing DDM system and hands on demonstration
6) Future directions
7) Pointers to more advanced material
1. Distributed data mining (DDM) in a
ubiquitous environment: An overview
a) Motivation (5mins)
b) Some of the Emerging Applications: (10mins)
i) Large-scale distributed grid-based applications
ii) Wireless applications
iii) Privacy-preserving applications
c) Challenges: (5mins)
i) Algorithmic issues
ii) Systems issues
iii) Communication issues
iv) Security issues.
2. Algorithms and architectures: (1hr 30mins)
a) Distributed data mining algorithms:
i) Computing statistical aggregates in a distributed manner
ii) Distributed principal component analysis
iii) Distributed clustering
iv) Distributed Bayesian algorithms
v) Distributed classifier/predictive-model learning: Decision tree learning, multi-variate regression
b) Architectures: (20mins)
i) Distributed and monolithic architectures.
ii) Multi-agent-based architectures.
1. Communication languages for DDM applications. (5mins)
2. Human-computer interaction issues. (10mins)
3. Applications: Case study of a distributed vehicle fleet mining system. (15mins)
4. Conclusions (5mins)
Since the field is very new and hardly any commercial/academic system is available, the presenter may have to use some of the systems generated by the research from his group for demonstrating different aspects of this technology.