Combining Web Classification and Web Information Extraction
Library Cataloging

Combining Web Classification and Web Information Extraction


Towards Combining Web Classification and Web Information Extraction: a Case Study by Ping Luo, Fen Lin, Yuhong Xiong, Yong Zhao, and Zhongzhi Shi appears as HP Tech report HPL-2009-86.
Web content analysis often has two sequential and separate steps: Web Classification to identify the target Web pages and Web Information Extraction to extract the metadata contained in the target Web pages. This decoupled strategy is highly ineffective since the errors in Web classification will be propagated to Web information extraction and eventually accumulate to a high level. In this paper we study the mutual dependencies between these two steps and propose to combine them by using a model of Conditional Random Fields (CRFs). This model can be used to simultaneously recognize the target Web pages and extract the corresponding metadata. Systematic experiments in our project OfCourse for online course search show that this model significantly improves the F1 value for both of the two steps. We believe that our model can be easily generalized to many Web applications.




- Free Your Metadata
Free Your Metadata is a site that describes using Google Refine and some extensions to clean and reconcile metadata, and automate the creation of personal, corporate and geographic names. Clean up Clean up your metadata and discover how to handle those...

- Citation Software
Citations are tricky, so many different formats. FreeCite is a new open-source tool in this space.Please help us beta test "FreeCite", a new citation parser for non-structured bibliographic data. FreeCite is the result of collaboration between the Brown...

- Classification
Building a Better Classification System: A Case Study. Association of Records Managers and Administrators International (ARMA), San Antonio, Texas.In 1996 the University of Calgary developed a classification system using functional analysis. Ten years...

- European Conference On Digital Libraries (ecdl)
Volume 2, Issue 2 of the IEEE TCDL Bulletin is now available. Some papers are:Visualization and Classification of Documents: A New Probabilistic Model to Automated Text Classification by Giorgio Maria Di NunzioUsing Controlled Vocabularies in Automated...

- Classification
Dynamic and hierarchical classification of Web pages by Ben Choi; Xiaogang Peng appears in Online Information Review (2004) v. 28, no. 2, pp. 139-147.Automatic classification of Web pages is an effective way to organise the vast amount of information...



Library Cataloging








.