o Pranam Kolari o UMBC Baltimore MD o kolari1@umbc.edu o http://www.cs.umbc.edu/~kolari1/ o

Web Mining Research - Pointers

In what follows, I have listed information useful for researchers who are starting out in this field. My survey is also slightly biased towards Web Usage Mining.

If you find this information useful also consider reading ( and citing ) my work for the IEEE Computing in Science and Engineering Magazine at:

You can also check out my UMBC homepage.

NGDM - *Web Mining – Accomplishments & Future Directions* - Srivastava
    - A general survey talking about web mining taxonomy and domain.

NGDM - *Using Mining for and on the Semantic Web*  -  Gerd Stumme
    -  Discusses how Semantic Web/Ontologies can improve Web Usage Mining and the use of Mining techniques for the Semantic Web. In the Semantic Web links carry meaning and this can be used effectively for web structure mining. More detailed explanation follows later under Dr. Berendt.

Low Complexity Fuzzy Relational Clustering Algorithms for Web Mining - Joshi

Researchers active in this field are listed below along with links to their highly cited papers. Please note that papers that I could briefly go through and found interesting have been marked using *"Publication"*.

For quick reading, based on what I came across these are the surveys that already exist on Web Mining.
1. Web usage Mining - Srivatsava..many publications
2. Web content Mining -  Data mining for hypertext: A tutorial survey (2000) -Soumen Chakrabarti
3. Web structure Mining - Web Structure Mining Exploiting the Graph Structure of World-Wide Web.(2002) - Fürnkranz

Jiawei Han - UIUC

*Data Mining for Web Intelligence* a journal article published in 2002.
- A pretty comprehensive article on Web Mining.

Su-Jeong Ko
 a research scientist has also worked on Web Mining. Most of her publications seem to be in some kind of user preference mining.

Web stream mining seems to be something new that his groups is working on lately.

Vipin Kumar - UMN

*Mining Association Patterns in Web Usage Data* (2002)  - Pang-Ning Tan, Vipin Kumar
Compares current data mining techniques for non-Web data and suggests why they are not sufficient. Comes up with refinements to the association rules and effective ways of removing "Web Robot" data from logs. Finding negative associations are also handled/talked about in detail for classification of user groups.

Document Categorization and Query Generation on the World Wide Web Using WebACE (1999)   (37 citations) - Daniel Boley
New document clustering and quering algorithms are put forward .. with particular stress on scalbility of this approach.

Jaideep Srivastava - UMN 

Automatic Personalization Based on Web Usage Mining (1999)  (28 citations)
Talks about the process of web usage mining starting from  data collection and going on upto finding usage patterns.

*Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data*(2000) (45 citations)
Looks similar to the one listed above. A very good introductory paper on Web usage mining.

Bamshad Mobasher, Robert Cooley, Jaideep Srivastava

http://www.cs.umn.edu/research/websift/  ( Group )

An extensive web mining survery has been published by this group, primarily led by Mobasher when he was a student here.

Robert Cooley - UMN

Web Mining: Information and Pattern Discovery on the World Wide Web (1997)    (63 citations) - R. Cooley, B. Mobasher, J. Srivastava
Again, talks about Web usage mining in depth.

Bamshad Mobasher - Depaul  

Web Mining: Pattern Discovery from World Wide Web Transactions (1996)   (32 citations) - Bamshad Mobasher, Namit Jain, Eui-Hong (Sam) Han, Jaideep Srivastava

Is very active in this field. He is currently teaching a course on web data mining at Depaul. He has a number of publications listed in his publications page.

Cyrus Shahabi - USC

1. Cyrus Shahabi, Yi-Shin Chen, *Web Information Personalization: Challenges and Approaches*, The the 3nd International Workshop on Databases in Networked Information Systems (DNIS 2003) , Aizu-Wakamatsu, Japan, September, 2003
    - A pretty good paper , covers the currently hot "Recommender" systems.

2. Cyrus Shahabi and Farnoush Banaei-Kashani, *A Framework for Efficient and Anonymous Web Usage Mining Based on Client-Side Tracking* , In WEBKDD 2001 - Mining Web Log Data Across All Customers Touch Points , Springer-Verlag New York, 2002, ISBN 3-5404-3969-2
    - A book chapter

3. Yi-Shin Chen and Cyrus Shahabi, *Improving User Profiles for E-Commerce by Genetic Algorithms *, In E-Commerce and Intelligent Methods Studies in Fuzziness and Soft Computing, Kulwer Academic Publishers, 2002, ISBN 3-7908-1499-7
 - Something worth mentioning, I have also come across the use of some kind of genetic algorithms in other publications.

4. Cyrus Shahabi, Leila Kaghazian, Soham Mehta, Amol Ghoting, Gautam Shanbhag, and Margaret L. McLaughlin, Understanding of User Behavior in Immersive Environments , In Touch in Virtual Environments: Haptics and the Design of Interactive Systems, Margaret L. McLaughlin , Joao Hespanha , and Gaurav Sukhatme, Editors, All of University of Southern California Prentice Hall, ISBN 0-13-065097-8

An older paper authored by him.

5. Cyrus Shahabi, Amir Zarkesh, Jafar Adibi, Vishal Shah, Knowledge Discovery from Users Web-Page Navigation , In Proceedings of the IEEE RIDE97 Workshop, April 1997.

Ronny Kohavi - Now at Amazon

Ten Supplementary Analyses to Improve E-commerce Web Sites, WEBKDD 2003
Lessons and Challenges from Mining Retail E-Commerce Data, by Ron Kohavi, et al., Journal of Machine Learning, 2003.

Raymond Kosala - Belgium

*Web Mining - A Survey* is a highly cited paper written by him. .

Sarah Spiekarmann - Berlin

Privacy and web-mining. Works in collaboration with Bettina Berendt listed below.
Myra Spiliopoulou - Berlin

A Framework for the Evaluation of Session Reconstruction Heuristics in Web Usage Analysis (2003)
 - Capturing user sessions from Web logs and clear demarcation is looked at here.

*Improving the Effectiveness of a Web Site with Web Usage Mining* (1999)  (Make Corrections)  (11 citations)
 - Comparing the navigational patterns of customers and non-customers , to figure out improvements in web page design / navigation.
This paper also talks and describes their  miner WUM.

WUM: A Web Utilization Miner (1998)  (Make Corrections)  (17 citations)
 - This miner ( WUM ) was developed by this group . This publication talks about it.This was also supposed to be the first Web Usage Miner.

Web Usage Mining for Web Site Evaluation, by Myra Spiliopoulou, Communications of ACM, August 2000.
Dr. Bettina Berendt - Berlin

Has worked on Semantic Web mining and a paper authored by her is listed in my first 3 papers.
NGDM - *Using Mining for and on the Semantic Web*  -  Gerd Stumme
    -  Discusses how Semantic Web/Ontologies can improve Web Usage Mining and the use of Mining techniques for the Semantic Web. In the Semantic Web links carry meaning and this can be used effectively for web structure mining.
In the Usage Mining area additional applications are discussed in detail. URL's requested can be mapped to a site concept hierarchy ( taxonomy of the site ) to come up with additional relationships. They argue that this approach will help in better understanding the requirements of the user. A simple example being this : A user searches for a particular information on a web page .. say www.umbc.edu/search.html?search="graduate courses". There should be some way to map this request by a user to a ontology having "graduate courses". Just knowing that the user browsed "search.html" will not be successful in capturing user requirements. So the key here is that .."what is requested" by the user has to be captured into an ontology and not just "what was served by website".

Other issues are also discussed along with the use of mining techniques for the Semantic Web. A very good idea here is that usage navigation patterns can be used to build Semantics to a website, i.e by usage pattern we will know that the users of this website expect the following Semantics , Link Structure for Semantic Annotation of this website.

I could not get hold of a recent paper:
Berendt, B., Günther, O., & Spiekermann, S. (in press). Privacy in E-Commerce: Stated preferences vs. actual behavior. To appear in Communications of the ACM.
But the bottomline is that people are looking at privacy issues seriously.

Bing Liu - UIC

Lan Yi, Bing Liu. "Eliminating Noisy Information in Web Pages for Data Mining." To appear Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), Washington, DC, USA, August 24 - 27, 2003.
- Talks about pruning websites for better mining. ( HTML pages )

Bing Liu, Chee Wee Chin, Hwee Tou Ng. "Mining Topic-Specific Concepts and Definitions on the Web." To appear in Proceedings of the twelfth international World Wide Web conference (WWW-2003), 20-24 May 2003, Budapest, HUNGARY.
- Paper talks about searching for all information related to a particular topic using web mining methods, to bring about an improvement on search engines.

Bing Liu, Kaidi Zhao, and Lan Yi. "*Visualizing Web site comparisons*." Proceedings of the Eleventh International World Wide Web Conference (WWW-2002). Honolulu, Hawaii, USA 7-11 May 2002.
- Proposes comparison of 2 websites to figure out their structure and other attributes. Mostly useful to figure out why your competitor is doing better business that you are.

Brij Masand - Data Miners

Ming-Syan Chen - National Taiwan Univ
Some of his papers .. mainly web information hierarchy mining

1. H.-Y. Kao, S.-H. Lin, J.-M. Ho and M.-S. Chen, ``Mining Web Information Structures and Contents based on Entropy Analysis,''IEEE Trans. on Knowledge and Data Engineering, Vol. 16, No. 1, January 2004
 - Suggests web structure mining techniques to better organize a website's link structure.

2. H.-Y. Kao, J.-M. Ho, and M.-S. Chen, ``Information Clustering on DOM with Multi-Granularity Centroid Converging for Web Information Hierarchy Mining,'' Proc. of the IEEE 2003 Intern'l Conf. on Web Intelligence (WI-2003), October 13-17, 2003
3. H.-Y. Kao, S.-H. Lin, J.-M. Ho and M.-S. Chen, ``Exploiting Hyperlink Analysis to Mine Informative Structures of News Web Sites,'' Proc. of the ACM 11th Intern'l Conf. on Information and Knowledge Management (CIKM-02), November 4-9, 2002, pp. 574-581
4. C.-H. Yun and M.-S. Chen, ``Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment,'' Proc. of the 24th annual Intern'l Computer Software and Application Conference (COMPSAC-2000), pp. 99-104, October 25-27, 2000.
5. I.-Y. Lin, X.-M. Huang and M.-S. Chen, ``Capturing User Access Patterns in the Web for Data Mining,'' Proc. of the 11th IEEE International Conference Tools with Artificial Intelligence, pp. 345-348, November 9-11, 1999

  Dr. Yannis Manolopoulas 

1. Manolopoulos, Y., Morzy, M., Morzy, T., Nanopoulos, A., Wojciechowski, M., Zakrzewicz, M.: "; Indexing Techniques for Web Access Logs, " chapter in the book "Web Information Systems," IDEA Group Inc., to appear, 2003
2. Nanopoulos A., Katsaros D. and Manolopoulos Y.: "A Data Mining Algorithm for Generalized Web Prefetching", IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 5, Sep./Oct. 2003
3. Nanopoulos A., Zakrzewicz M., Morzy T., Manolopoulos Y.: "Indexing Web Access-Logs for Pattern Queries", Proceedings 4th International Workshop on Web Information and Data Management (WIDM'02), pp.398-404, McLean, VA, 2002.
4. Nanopoulos A., Katsaros D. and Manolopoulos Y.: "Exploiting Web Log Mining for Web Cache Enhancement", Lecture Notes on Artificial Intelligence (LNAI), Springer-Verlag, vol. 2356, pp. 68-87, 2002.
5. Nanopoulos A., Katsaros D. and Manolopoulos Y.: “Effective Prediction of Web-user Accesses: a Data Mining Approach", Proceedings Conference on Mining Log Data Across All Customer Touchpoints (WebKDD), San Francisco, 2001

Information about publications gathered from citeseer
I have filtered irrelevant one's .. and listed them under authors if I have identified the authors separately.

1. Web Structure Mining Exploiting the Graph Structure of World-Wide Web. - Fürnkranz
      A very recent paper on web structure mining. This paper talks about classification of web mining and then focuses on Web Structure mining. This paper inturn cites other papers in this area.
     a) Data mining for hypertext: A tutorial survey (2000) -Soumen Chakrabarti
     b)F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, March 2002.

2. Towards Semantic Web Mining (2002) - Bettina Berendt, Andreas Hotho, Gerd Stumme
3. Complex Queries over Web Repositories  - Sriram Raghavan and Hector Garcia-Molina
4. Mining Topic Specific Concepts and Definitions on the Web  - Bing Liu, et al.
5. Hierarchical Document Clustering Using Frequent Itemsets - Benjamin C.M. Fung, Ke Wang, Martin Ester
6. Preprocessing and Mining Web Log Data for Web Personalization - M. Baglioni, U. Ferrara, A. Romei, S. Ruggieri, F. Turini
7. Web Site Mining : A new way to spot Competitors, Customers And Suppliers in the World Wide Web (2002)  - Martin Ester, Hans-Peter Kriegel, Matthias Schubert
8. Web Mining for Web Image Retrieval  - Zheng Chen, Liu Wenyin, Feng Zhang, Mingjing Li, Hongjiang Zhang
User Intention Modeling in Web Applications Using Data Mining (2002)  - Zheng Chen, Fan Lin, Huan Liu, Wei-Ying Ma, Liu Wenyin
10. Geniminer: Web Mining With A Genetic-Based Algorithm F. Picarougne
i-Miner: A Web Usage Mining Framework Using Hierarchical Intelligent Systems (2003)  - Ajith Abraham
12. A Guide to Economists on Data Collection from the Web  - Michael X. Zhang
13. Clustering the Users of Large Web Sites into Communities  - Georgios Palioura
Trawling the web for emerging cyber-communities (1999)  (Make Corrections)  (64 citations) -Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins
15. Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs (1998)  (42 citations) - Osmar R. Zaïane, Man Xin, Jiawei Han
16. The World Wide Web: quagmire or gold mine? (1996) (10 citations) - Oren Etzioni
17. Web Mining in Soft Computing Framework: Relevance, State of the Art and Future Directions (2002)  (3 citations) - Sankar K. Pal, Varun Talwar, Pabitra Mitra
18. Inferring Web Communities from Link topology (1998)  (Make Corrections)  (66 citations) - David Gibson, Jon Kleinberg, Prabhakar Raghavan
19. Parasite: Mining Structural Information on the Web (1999 ) (66 citations) - Ellen Spertus
20.  Extracting Patterns and Relations from the World Wide Web, by Sergey Brin, Stanford University.

Papers published in WebKDD workshops are a good resource . The most recent being WebKDD 2003.

Proceeding available here : http://www.acm.org/sigkdd/proceedings/webkdd03/

26 September, 2004 12:53