Overview of an architecture for a document harvesting system. This paper is a good description of such systems and provides a good vocabulary. My favorite section is the one on 'possible problems with web pages' - from where I sit, there is no 'possible' about it; I have encountered every one of these problems with my own harvester.
Today: 4 Total: 4 [Share]
] [