In developing countries, people are now more likely to have access to a mobile phone than clean water, making cellular based technology the only viable medium for collecting, aggregating, and communicating local data so that it can be turned into useful information.

UMBC Computer Science and Electrical Engineering
Ph.D. Dissertation Proposal

Paper form digitization for information systems strengthening and socio-economic development in developing countries

Huguens Jean

3:00pm Tuesday, 5 March 2013, ITE346, UMBC

In developing countries, people are now more likely to have access to a mobile phone than clean water, making cellular based technology the only viable medium for collecting, aggregating, and communicating local data so that it can be turned into useful information. While mobile phones have found broad application in reporting health, financial, and environmental data, many data collection methods still suffer from delays, inefficiency and difficulties maintaining quality. In environments with insufficient IT support and infrastructure, and among populations with limited education and experience with technology, paper forms rather than electronic methods remain the predominant means for data collection. To meet the digitization needs of paper driven data collection practices, this thesis proposes the development and study of a software platform that automatically converts unknown paper forms into digital structured data and uses human intelligence when necessary to improve its performance.

We begin by identifying a high-level system architecture for dealing with infrastructure constraints and human resources limitations. We then break the architecture into its integral pieces and organize them into three distinct functional and interacting stages: data collection, data conversion, and crowdsourcing. In the collection phase, we focus on visually detecting structurally identical form instances and transmitting the images of their raw input data to a remote server. During this phase, we present a novel framework for identifying specific form types by generating a multipart template for unknown forms and decomposing the form identification problem into three distinct tasks: similar image retrieval, learning, and duplicate matching. The conversion phase uses a mixture of Optical Character Recognition (OCR) and human annotations techniques to convert images into digital information and group structurally identical forms in their respective database table. In crowdsourcing, we investigates how to use low-end smartphones for collecting training information to improve OCR related tasks and verify the accuracy of converted input values. We pay special emphasis on identifying natural interaction forms that lower the technical and knowledge threshold for local residents. Furthermore, because crowdsourcing can also provide money to the mobile workers of its micro-tasking platform, we concurrently explore how systems that facilitate collaboration between humans and machines for improving the quality of intelligent information systems can be used a vehicle for delivering socioeconomic opportunities to developing countries.

Committee: Dr. Timothy Oates (Chair), Dr. Janet Rutledge, Dr. Fow-Sen Choa, Dr. Jesus Caban