Human-in-the-Loop Entity Mining from Noisy Web Data

Professor Eduard Dragut, Temple University

1-2 pm, Tuesday, 6 April 2021
online via WebEx


Recognizing entities that follow or closely resemble a regular expression (regex) pattern is an important task in information extraction. Due to a vast diversity of web documents and ways in which they are generated, even seemingly straightforward tasks such as identifying mentions of date in a document becomes very challenging. It is reasonable to claim that it is impossible to create a regex that is capable of identifying such entities from web documents with perfect precision and recall. Rather than abandoning regex as a go-to approach for entity detection, we present methods to combine the expressive power of regexes, the ability of deep learning to learn from large data, and the human-in-the-loop approach into a new integrated framework for entity identification from web data. The framework starts by creating or collecting the existing regexes for a particular type of entity. Those regexes are then used over a large document corpus to collect weak labels for the entity mentions and a neural network is trained to predict those regex-generated weak labels. Finally, a human expert is asked to label a set of documents and the neural network is fine-tuned on those documents.

While human effort is critical to build an entity recognition model, surprisingly little is known about how to best invest that effort given a limited time budget. Should a human’s effort be spent on writing a regex recognizing an entity or on manually label entity mentions in a document corpus? When a user is allowed to choose between regex construction and manual labeling, we discover that (1) if the time budget is low, spending all time for regex construction is often advantageous, (2) if the time budget is high, spending all time for manual labeling seems to be superior, and (3) between those two extremes, writing regexes followed by manual labeling is typically the best approach. I will also give an overview of the ongoing and future projects.


Eduard Dragut is an Associate Professor in the Computer and Information Sciences Department at Temple University. He received his Ph.D. degree in Computer Science from the University of Illinois at Chicago. He previously was a Postdoctoral Research Associate at Purdue University, Discovery Park, Cyber Center. His main area of research is Web data management, e.g., retrieval, extraction, representation, cleaning, analysis, and integration. He is actively pursuing projects in  Data Cleaning, Social  Media Mining (e.g., user behavior and fake news), the Future of Work, and Cyber-Infrastructure for Scientific Research. He is co-author of a book on Deep Web data integration, Deep Web Query Interface Understanding, and Integration.