2004 IPDPS Conference
18th International Parallel and Distributed Processing Symposium

April 26–April 30, Santa Fe, New Mexico
Eldorado Hotel


TUTORIAL TITLE: An Introduction to Distributed Data Mining



Hillol Kargupta, PhD

Associate Professor

Computer Science and Electrical Engineering Department

University of Maryland Baltimore County

1000 Hilltop Circle, Baltimore, Maryland 21250, USA

e-mail: hillol AT cs DOT umbc DOT edu


President, AGNIK, LLC.

1450 S Rolling Road, Baltimore, MD 21227

e-mail: Hillol AT agnik DOT com





Advances in computing and communication over wired and wireless networks have resulted in many pervasive distributed computing environments. The Internet, intranets, local area networks, ad hoc wireless networks, and sensor networks are some examples. These environments often come with different distributed sources of data and computation. Mining in such environments naturally calls for proper utilization of these distributed resources. Moreover, in many privacy sensitive applications different, possibly multi-party, data sets collected at different sites must be processed in a distributed fashion without collecting everything to a single central site. However, most off-the-shelf data mining systems are designed to work as a monolithic centralized application. They normally down-load the relevant data to a centralized location and then perform the data mining operations. This centralized approach does not work well in many of the emerging distributed, ubiquitous, possibly privacy-sensitive data mining applications.


The field of Distributed Data Mining (DDM) offers an alternate choice. It pays careful attention to the distributed resources of data, computing, communication, and human factors in order to use them in a near optimal fashion. This tutorial will offer an introduction to the emerging field of Distributed Data Mining. The attendees will be exposed to the following aspects of this field:


1)       An overview of the emerging DDM applications

2)       An overview of the existing DDM algorithms

3)       More detailed discussion of some important DDM algorithms

4)       An overview of the systems research issues in DDM

5)       Detailed case study of an existing DDM system and hands on demonstration

6)       Future directions

7)       Pointers to more advanced material and resources




1.       Distributed data mining (DDM) in a ubiquitous environment: An overview

a)   Motivation  (5mins)
b)   Some of the Emerging Applications: (10mins)
      i)  Large-scale distributed grid-based applications
      ii) Wireless applications
      iii) Privacy-preserving applications
c)   Challenges: (5mins)
      i)   Algorithmic issues
      ii)  Systems issues
      iii) Communication issues
      iv) Security issues.

2.       Algorithms and architectures: (1hr 30mins)

a)         Distributed data mining algorithms:

i)                     Computing statistical aggregates in a distributed manner

ii)                   Distributed principal component analysis

iii)                  Distributed clustering

iv)                  Distributed Bayesian algorithms

v)                    Distributed classifier/predictive-model learning: Decision tree learning, multi-variate regression

b)         Architectures: (20mins)

i)                     Distributed and monolithic architectures.

ii)                   Multi-agent-based architectures.

1.       Communication languages for DDM applications. (5mins)

2.       Human-computer interaction issues. (10mins)

3.       Applications: Case study of a distributed vehicle fleet mining system. (15mins)

4.       Conclusions (5mins)


Since the field is very new and hardly any commercial/academic system is available, the presenter may have to use some of the systems generated by the research from his group for demonstrating different aspects of this technology.