<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Documents on Documents</title>
	<atom:link href="http://docsondocs.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://docsondocs.wordpress.com</link>
	<description></description>
	<lastBuildDate>Sat, 13 Sep 2008 13:22:10 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='docsondocs.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Documents on Documents</title>
		<link>http://docsondocs.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://docsondocs.wordpress.com/osd.xml" title="Documents on Documents" />
	<atom:link rel='hub' href='http://docsondocs.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Ha et al.:  Recursive X-Y cut using bounding boxes of connected components</title>
		<link>http://docsondocs.wordpress.com/2008/09/11/ha-et-al-recursive-x-y-cut-using-bounding-boxes-of-connected-components/</link>
		<comments>http://docsondocs.wordpress.com/2008/09/11/ha-et-al-recursive-x-y-cut-using-bounding-boxes-of-connected-components/#comments</comments>
		<pubDate>Thu, 11 Sep 2008 22:30:30 +0000</pubDate>
		<dc:creator>tamirhassan</dc:creator>
				<category><![CDATA[papers]]></category>
		<category><![CDATA[1995]]></category>
		<category><![CDATA[bitmap]]></category>
		<category><![CDATA[icdar]]></category>
		<category><![CDATA[score 4]]></category>
		<category><![CDATA[washington]]></category>
		<category><![CDATA[x-y cut]]></category>

		<guid isPermaLink="false">http://docsondocs.wordpress.com/?p=94</guid>
		<description><![CDATA[By: Ha, J., Haralick, R.M., Philips, I.T. Notes: The X-Y cut paper! Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=602059 Abstract: A top-down page segmentation technique known as the recursive X-Y cut decomposes a document image recursively into a set of rectangular blocks. This paper proposes that the recursive X-Y cut be implemented using bounding boxes of connected components of black [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=94&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>By:</strong> Ha, J., Haralick, R.M., Philips, I.T.<span class="bodyCopyBlackLargeSpaced"> </span></p>
<p><strong>Notes:</strong> The X-Y cut paper!</p>
<p><strong>Available: </strong>http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=602059</p>
<p><strong>Abstract: </strong> A top-down page segmentation technique known as the recursive X-Y cut decomposes a document image recursively into a set of rectangular blocks. This paper proposes that the recursive X-Y cut be implemented using bounding boxes of connected components of black pixels instead of using image pixels. The advantage is that great improvement can be achieved in computation. In fact, once bounding boxes of connected components are obtained, the recursive X-Y cut is completed within an order of a second on Sparc-10 workstations for letter-sized document images scanned at 900 dpi resolution</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/docsondocs.wordpress.com/94/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/docsondocs.wordpress.com/94/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/docsondocs.wordpress.com/94/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/docsondocs.wordpress.com/94/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/docsondocs.wordpress.com/94/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/docsondocs.wordpress.com/94/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/docsondocs.wordpress.com/94/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/docsondocs.wordpress.com/94/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/docsondocs.wordpress.com/94/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/docsondocs.wordpress.com/94/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/docsondocs.wordpress.com/94/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/docsondocs.wordpress.com/94/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/docsondocs.wordpress.com/94/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/docsondocs.wordpress.com/94/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/docsondocs.wordpress.com/94/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/docsondocs.wordpress.com/94/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=94&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://docsondocs.wordpress.com/2008/09/11/ha-et-al-recursive-x-y-cut-using-bounding-boxes-of-connected-components/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/ddb4e9ffde1e700ffe463d3e1c9a35c3?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tamirhassan</media:title>
		</media:content>
	</item>
		<item>
		<title>Table papers (to be expanded later&#8230;)</title>
		<link>http://docsondocs.wordpress.com/2008/09/11/table-papers-to-be-expanded-later/</link>
		<comments>http://docsondocs.wordpress.com/2008/09/11/table-papers-to-be-expanded-later/#comments</comments>
		<pubDate>Thu, 11 Sep 2008 22:20:17 +0000</pubDate>
		<dc:creator>tamirhassan</dc:creator>
				<category><![CDATA[papers]]></category>

		<guid isPermaLink="false">http://docsondocs.wordpress.com/?p=92</guid>
		<description><![CDATA[Nagy: Why Table Ground Truthing is Hard Kieninger: Table Recognition based on Robust Block Segmentation Embley, Hurst, Lopresti, Nagy (IJDAR): Table-processing paradigms: a research survey<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=92&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Nagy: Why Table Ground Truthing is Hard</p>
<p>Kieninger: Table Recognition based on Robust Block Segmentation</p>
<p>Embley, Hurst, Lopresti, Nagy (IJDAR): Table-processing paradigms: a research survey</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/docsondocs.wordpress.com/92/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/docsondocs.wordpress.com/92/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/docsondocs.wordpress.com/92/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/docsondocs.wordpress.com/92/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/docsondocs.wordpress.com/92/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/docsondocs.wordpress.com/92/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/docsondocs.wordpress.com/92/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/docsondocs.wordpress.com/92/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/docsondocs.wordpress.com/92/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/docsondocs.wordpress.com/92/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/docsondocs.wordpress.com/92/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/docsondocs.wordpress.com/92/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/docsondocs.wordpress.com/92/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/docsondocs.wordpress.com/92/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/docsondocs.wordpress.com/92/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/docsondocs.wordpress.com/92/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=92&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://docsondocs.wordpress.com/2008/09/11/table-papers-to-be-expanded-later/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/ddb4e9ffde1e700ffe463d3e1c9a35c3?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tamirhassan</media:title>
		</media:content>
	</item>
		<item>
		<title>Rus and Summers: Geometric Algorithms and Experiments for Automated Document Structuring</title>
		<link>http://docsondocs.wordpress.com/2008/09/11/rus-and-summers-geometric-algorithms-and-experiments-for-automated-document-structuring/</link>
		<comments>http://docsondocs.wordpress.com/2008/09/11/rus-and-summers-geometric-algorithms-and-experiments-for-automated-document-structuring/#comments</comments>
		<pubDate>Thu, 11 Sep 2008 22:11:16 +0000</pubDate>
		<dc:creator>tamirhassan</dc:creator>
				<category><![CDATA[papers]]></category>
		<category><![CDATA[1997]]></category>
		<category><![CDATA[ascii]]></category>
		<category><![CDATA[cornell]]></category>
		<category><![CDATA[indentation alphabet]]></category>
		<category><![CDATA[mathematical and computer modelling]]></category>
		<category><![CDATA[ps2ascii]]></category>
		<category><![CDATA[rule-based]]></category>
		<category><![CDATA[score 4]]></category>
		<category><![CDATA[table recognition]]></category>
		<category><![CDATA[wdg]]></category>

		<guid isPermaLink="false">http://docsondocs.wordpress.com/?p=89</guid>
		<description><![CDATA[By: Rus, D., Summers, K. Notes: an old paper, which deals with everything under the sun&#8230; what interests us is: (1) mentions the notion of a &#8216;zoomed-out view&#8217;; they use an OCR for scanned documents; postscript documents are used too (using a parser on top of ps2ascii) and a tree structure for the layout. The [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=89&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>By:</strong> Rus, D., Summers, K.</p>
<p><strong>Notes: </strong>an old paper, which deals with everything under the sun&#8230; what interests us is: (1) mentions the notion of a &#8216;zoomed-out view&#8217;; they use an OCR for scanned documents; postscript documents are used too (using a parser on top of ps2ascii) and a tree structure for the layout.  The segmentation algorithm is based on, among other things, indentation alphabets and grammar-based (rule-based?) logical manipulations to generate a hierarchy (sounds too inflexible to me&#8230;)  Table algorithm uses WDG and seems only to be limited to ASCII text (monospaced).  Also includes some kind of DU for line drawings&#8230; <img src='http://s1.wp.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p><strong>Conference:</strong> Mathematical and Computer Modelling, 1997</p>
<p><strong>Available:</strong> http://www.dbai.tuwien.ac.at/education/wie/SS06/papers/94-Rus_Summers&#8211;Geometric_Algorithms_and_Experiments_for_Automated_Document_Structuring.ps.gz<br />
(ps also available at citeseer)</p>
<p><strong>Abstract: </strong>We present and analyse algorithms for the automated segmentation and classification of layout structures in electronic documents.  The key idea is to use the patterns in the distribution of white space in a document to recognize and interpret its components.  The segmentation algorithms classify these divisions as base-text, tables, indented lists, polygonal drawings and graphs.  We present experimental data and discuss and information access application.  Our methodology allows the automatic markup of documents (for instance in the SGML format) and the creation of multi-level indices and browsing tools for electronic libraries.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/docsondocs.wordpress.com/89/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/docsondocs.wordpress.com/89/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/docsondocs.wordpress.com/89/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/docsondocs.wordpress.com/89/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/docsondocs.wordpress.com/89/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/docsondocs.wordpress.com/89/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/docsondocs.wordpress.com/89/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/docsondocs.wordpress.com/89/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/docsondocs.wordpress.com/89/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/docsondocs.wordpress.com/89/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/docsondocs.wordpress.com/89/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/docsondocs.wordpress.com/89/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/docsondocs.wordpress.com/89/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/docsondocs.wordpress.com/89/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/docsondocs.wordpress.com/89/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/docsondocs.wordpress.com/89/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=89&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://docsondocs.wordpress.com/2008/09/11/rus-and-summers-geometric-algorithms-and-experiments-for-automated-document-structuring/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/ddb4e9ffde1e700ffe463d3e1c9a35c3?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tamirhassan</media:title>
		</media:content>
	</item>
		<item>
		<title>Altamura et al.: Transforming paper documents into XML format with WISDOM++</title>
		<link>http://docsondocs.wordpress.com/2008/09/11/altamura-et-al-transforming-paper-documents-into-xml-format-with-wisdom/</link>
		<comments>http://docsondocs.wordpress.com/2008/09/11/altamura-et-al-transforming-paper-documents-into-xml-format-with-wisdom/#comments</comments>
		<pubDate>Thu, 11 Sep 2008 21:37:49 +0000</pubDate>
		<dc:creator>tamirhassan</dc:creator>
				<category><![CDATA[papers]]></category>
		<category><![CDATA[2001]]></category>
		<category><![CDATA[bari]]></category>
		<category><![CDATA[bitmap]]></category>
		<category><![CDATA[definitions]]></category>
		<category><![CDATA[document analysis]]></category>
		<category><![CDATA[document classification]]></category>
		<category><![CDATA[document understanding]]></category>
		<category><![CDATA[documetn image analysis]]></category>
		<category><![CDATA[ijdar]]></category>
		<category><![CDATA[induction of decision trees]]></category>
		<category><![CDATA[layout analysis]]></category>
		<category><![CDATA[pdf]]></category>
		<category><![CDATA[score 4]]></category>
		<category><![CDATA[text recognition]]></category>
		<category><![CDATA[text transformation]]></category>
		<category><![CDATA[wisdom++]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://docsondocs.wordpress.com/?p=84</guid>
		<description><![CDATA[By: Altamura, O., Esposito, F., Malerba, D. Notes: source is a scan; uses the traditional bitmap morphological etc. techniques; decision-tree (learning) based classification of blocks followed by knowledge-based detection of the layout structure (layout analysis) &#8212; declarative knowledge in prolog &#8212; followed by document classification and document understanding &#8212; a good paper to cite just [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=84&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>By: </strong>Altamura, O., Esposito, F., Malerba, D.</p>
<p><strong>Notes: </strong>source is a scan; uses the traditional bitmap morphological etc. techniques; decision-tree (learning) based classification of blocks followed by knowledge-based detection of the layout structure (layout analysis) &#8212; declarative knowledge in prolog &#8212; followed by document classification and document understanding &#8212; a good paper to cite just for definitions!  finally, all this gets transformed into html/xml &#8212; but not as we know it (lixto)</p>
<p><strong>Available: </strong>http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=38267FF42197FE109A013A6EE76EB09D?doi=10.1.1.33.4193&amp;rep=rep1&amp;type=pdf</p>
<p><strong>Journal: </strong>IJDAR (2001)</p>
<p><strong>Abstract: </strong>he transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: <em>document analysis, document classification, document understanding, text recognition</em> with an OCR, and <em>transformation</em> into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/docsondocs.wordpress.com/84/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/docsondocs.wordpress.com/84/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/docsondocs.wordpress.com/84/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/docsondocs.wordpress.com/84/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/docsondocs.wordpress.com/84/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/docsondocs.wordpress.com/84/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/docsondocs.wordpress.com/84/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/docsondocs.wordpress.com/84/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/docsondocs.wordpress.com/84/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/docsondocs.wordpress.com/84/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/docsondocs.wordpress.com/84/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/docsondocs.wordpress.com/84/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/docsondocs.wordpress.com/84/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/docsondocs.wordpress.com/84/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/docsondocs.wordpress.com/84/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/docsondocs.wordpress.com/84/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=84&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://docsondocs.wordpress.com/2008/09/11/altamura-et-al-transforming-paper-documents-into-xml-format-with-wisdom/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/ddb4e9ffde1e700ffe463d3e1c9a35c3?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tamirhassan</media:title>
		</media:content>
	</item>
		<item>
		<title>Lovegrove and Brailsford: Document analysis of PDF files: methods, results and implications</title>
		<link>http://docsondocs.wordpress.com/2008/09/11/lovegrove-and-brailsford-document-analysis-of-pdf-files-methods-results-and-implications/</link>
		<comments>http://docsondocs.wordpress.com/2008/09/11/lovegrove-and-brailsford-document-analysis-of-pdf-files-methods-results-and-implications/#comments</comments>
		<pubDate>Thu, 11 Sep 2008 21:25:26 +0000</pubDate>
		<dc:creator>tamirhassan</dc:creator>
				<category><![CDATA[papers]]></category>
		<category><![CDATA[1995]]></category>
		<category><![CDATA[acrobat sdk]]></category>
		<category><![CDATA[blackboard architecture]]></category>
		<category><![CDATA[document analysis]]></category>
		<category><![CDATA[electronic publishing]]></category>
		<category><![CDATA[knowledge base]]></category>
		<category><![CDATA[line-finding]]></category>
		<category><![CDATA[logical labelling]]></category>
		<category><![CDATA[logical relationships]]></category>
		<category><![CDATA[nottingham]]></category>
		<category><![CDATA[pdf]]></category>
		<category><![CDATA[score 4]]></category>
		<category><![CDATA[segmentation]]></category>

		<guid isPermaLink="false">http://docsondocs.wordpress.com/?p=79</guid>
		<description><![CDATA[By: Lovegrove, W.S., Brailsford, D.F. Notes: Probably the first paper on DA of PDF.  Make use of the Adobe SDK.  Segmentation is via simple concepts based on the PDF objects (after pre-processing (line finding?) by the SDK).  Logical labelling: Mention blackboard architectures (would be an idea to look at a paper on these); for logical [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=79&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>By: </strong>Lovegrove, W.S., Brailsford, D.F.</p>
<p><strong>Notes: </strong>Probably the first paper on DA of PDF.  Make use of the Adobe SDK.  Segmentation is via simple concepts based on the PDF objects (after pre-processing (line finding?) by the SDK).  Logical labelling: Mention blackboard architectures (would be an idea to look at a paper on these); for logical labelling uses knowledge database techniques.  For &#8216;document understanding&#8217; (i.e. logical relationships &#8212; not sure if reading order too) uses rule-based heuristics. Now a long time ago!</p>
<p><strong>Available: </strong>http://www.cs.nott.ac.uk/~dfb/Publications/Download/1995/will.pdf</p>
<p><strong>Journal: </strong>Electronic Publishing, Vol. 8 (2 &amp; 3), pp. 207-220 (June &amp; Sept. 1995)</p>
<p><strong>Summary: </strong>A strategy for document analysis is presented which uses Portable Document Format<br />
(PDF—the underlying file structure for Adobe Acrobat software) as its starting point. This<br />
strategy examines the appearance and geometric position of text and image blocks distributed<br />
over an entire document. A blackboard system is used to tag the blocks as a first stage in<br />
deducing the fundamental relationships existing between them. PDF is shown to be a useful<br />
intermediate stage in the bottom-up analysis of document structure. Its information on line<br />
spacing and font usage gives important clues in bridging the ‘semantic gap’ between the<br />
scanned bitmap page and its fully analysed, block-structured form. Analysis of PDF can yield<br />
not only accurate page decomposition but also sufficient document information for the later<br />
stages of structural analysis and document understanding.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/docsondocs.wordpress.com/79/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/docsondocs.wordpress.com/79/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/docsondocs.wordpress.com/79/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/docsondocs.wordpress.com/79/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/docsondocs.wordpress.com/79/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/docsondocs.wordpress.com/79/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/docsondocs.wordpress.com/79/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/docsondocs.wordpress.com/79/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/docsondocs.wordpress.com/79/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/docsondocs.wordpress.com/79/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/docsondocs.wordpress.com/79/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/docsondocs.wordpress.com/79/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/docsondocs.wordpress.com/79/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/docsondocs.wordpress.com/79/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/docsondocs.wordpress.com/79/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/docsondocs.wordpress.com/79/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=79&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://docsondocs.wordpress.com/2008/09/11/lovegrove-and-brailsford-document-analysis-of-pdf-files-methods-results-and-implications/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/ddb4e9ffde1e700ffe463d3e1c9a35c3?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tamirhassan</media:title>
		</media:content>
	</item>
		<item>
		<title>Hadjar et al.: Xed: a new tool for eXtracting hidden structures from Electronic Documents</title>
		<link>http://docsondocs.wordpress.com/2008/09/11/hadjar-et-al-xed-a-new-tool-for-extracting-hidden-structures-from-electronic-documents/</link>
		<comments>http://docsondocs.wordpress.com/2008/09/11/hadjar-et-al-xed-a-new-tool-for-extracting-hidden-structures-from-electronic-documents/#comments</comments>
		<pubDate>Thu, 11 Sep 2008 21:18:11 +0000</pubDate>
		<dc:creator>tamirhassan</dc:creator>
				<category><![CDATA[papers]]></category>
		<category><![CDATA[2004]]></category>
		<category><![CDATA[bitmap]]></category>
		<category><![CDATA[dial]]></category>
		<category><![CDATA[diuf]]></category>
		<category><![CDATA[fribourg]]></category>
		<category><![CDATA[ingold]]></category>
		<category><![CDATA[pdf]]></category>
		<category><![CDATA[reading order]]></category>
		<category><![CDATA[score 4]]></category>
		<category><![CDATA[xed]]></category>
		<category><![CDATA[xmillum]]></category>

		<guid isPermaLink="false">http://docsondocs.wordpress.com/?p=72</guid>
		<description><![CDATA[By: Hadjar, K., Rigamonti, M., Lalanne, D., Ingold, R. Notes: PDF!!! But they seem to fall back to a rendered version to perform segmentation, etc. This paper describes how the (single, full!) reading order of the front page of the International Herald Tribune is found. XMillum is mentioned Available: http://diuf.unifr.ch/people/lalanned/Articles/XedDIAL04.pdf Workshop: Document Image Analysis for [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=72&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>By: </strong>Hadjar, K., Rigamonti, M., Lalanne, D., Ingold, R.</p>
<p><strong>Notes: </strong>PDF!!! But they seem to fall back to a rendered version to perform segmentation, etc.  This paper describes how the (single, full!) reading order of the front page of the International Herald Tribune is found.  XMillum is mentioned</p>
<p><strong>Available: </strong>http://diuf.unifr.ch/people/lalanned/Articles/XedDIAL04.pdf</p>
<p><strong>Workshop: </strong>Document Image Analysis for Libraries, DIAL04</p>
<p><strong>Abstract: </strong>PDF became a very common format for exchanging printable documents. Further, it can be easily generated from the major documents formats, which make a huge number of PDF documents available over the net. However its use is limited to displaying and printing, which considerably reduces the search and retrieval capabilities. For this reason, additional tools have recently appeared that allow to extract the textual content. However their practical use is limited in the sense that the text&#8217;s reading order is not necessary preserved, especially when handling multicolumn documents, or in presence of complex layout. Our thesis is that those tools do not consider the hidden layout and logical structures of documents, which could greatly improve their results. We propose a novel approach to overcome the document content extraction, by merging a) low-level extraction methods applied on PDF files with b) layout analysis performed on a synthetically generated TIFF image. The paper describes the various steps necessary to achieve this task. Finally, we present a first experiment on the restitution of the newspapers&#8217; reading order which shows encouraging results.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/docsondocs.wordpress.com/72/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/docsondocs.wordpress.com/72/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/docsondocs.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/docsondocs.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/docsondocs.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/docsondocs.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/docsondocs.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/docsondocs.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/docsondocs.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/docsondocs.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/docsondocs.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/docsondocs.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/docsondocs.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/docsondocs.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/docsondocs.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/docsondocs.wordpress.com/72/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=72&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://docsondocs.wordpress.com/2008/09/11/hadjar-et-al-xed-a-new-tool-for-extracting-hidden-structures-from-electronic-documents/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/ddb4e9ffde1e700ffe463d3e1c9a35c3?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tamirhassan</media:title>
		</media:content>
	</item>
		<item>
		<title>Summers: Automatic Discovery of Logical Document Structure (PhD thesis)</title>
		<link>http://docsondocs.wordpress.com/2008/09/11/summers-automatic-discovery-of-logical-document-structure-phd-thesis/</link>
		<comments>http://docsondocs.wordpress.com/2008/09/11/summers-automatic-discovery-of-logical-document-structure-phd-thesis/#comments</comments>
		<pubDate>Thu, 11 Sep 2008 21:05:34 +0000</pubDate>
		<dc:creator>tamirhassan</dc:creator>
				<category><![CDATA[papers]]></category>
		<category><![CDATA[1998]]></category>
		<category><![CDATA[cornell]]></category>
		<category><![CDATA[summers]]></category>

		<guid isPermaLink="false">http://docsondocs.wordpress.com/?p=70</guid>
		<description><![CDATA[Available: http://ecommons.library.cornell.edu/bitstream/1813/7352/1/98-1698.pdf deals with: everything you could possibly want to know about logical structure! (in more detail than i would need it, for example)  1998 vintage<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=70&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>Available: </strong>http://ecommons.library.cornell.edu/bitstream/1813/7352/1/98-1698.pdf</p>
<p>deals with: everything you could possibly want to know about logical structure! (in more detail than i would need it, for example)  1998 vintage</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/docsondocs.wordpress.com/70/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/docsondocs.wordpress.com/70/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/docsondocs.wordpress.com/70/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/docsondocs.wordpress.com/70/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/docsondocs.wordpress.com/70/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/docsondocs.wordpress.com/70/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/docsondocs.wordpress.com/70/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/docsondocs.wordpress.com/70/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/docsondocs.wordpress.com/70/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/docsondocs.wordpress.com/70/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/docsondocs.wordpress.com/70/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/docsondocs.wordpress.com/70/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/docsondocs.wordpress.com/70/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/docsondocs.wordpress.com/70/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/docsondocs.wordpress.com/70/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/docsondocs.wordpress.com/70/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=70&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://docsondocs.wordpress.com/2008/09/11/summers-automatic-discovery-of-logical-document-structure-phd-thesis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/ddb4e9ffde1e700ffe463d3e1c9a35c3?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tamirhassan</media:title>
		</media:content>
	</item>
		<item>
		<title>Chao et al.: PDF Document Layout Study with Page Elements and Bounding Boxes</title>
		<link>http://docsondocs.wordpress.com/2008/09/11/chao-et-al-pdf-document-layout-study-with-page-elements-and-bounding-boxes/</link>
		<comments>http://docsondocs.wordpress.com/2008/09/11/chao-et-al-pdf-document-layout-study-with-page-elements-and-bounding-boxes/#comments</comments>
		<pubDate>Thu, 11 Sep 2008 20:56:30 +0000</pubDate>
		<dc:creator>tamirhassan</dc:creator>
				<category><![CDATA[papers]]></category>
		<category><![CDATA[2001]]></category>
		<category><![CDATA[acrobat sdk]]></category>
		<category><![CDATA[bounding boxes]]></category>
		<category><![CDATA[Distiller]]></category>
		<category><![CDATA[dlia]]></category>
		<category><![CDATA[hewlett-packard]]></category>
		<category><![CDATA[PDEElement]]></category>
		<category><![CDATA[pdf]]></category>
		<category><![CDATA[PDFWriter]]></category>
		<category><![CDATA[score 4]]></category>
		<category><![CDATA[text run]]></category>

		<guid isPermaLink="false">http://docsondocs.wordpress.com/?p=68</guid>
		<description><![CDATA[By: Chao, H., Beretta, G., Sang, H. Notes: unlike Chao&#8217;s previous paper, this paper doesn&#8217;t deal with document understanding per se, but rather examines the way PDFs are created from a variety of applications, e.g. Microsoft Word. A similar study should also be included in my thesis! Here, the Adobe SDK is used (which, I [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=68&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>By: </strong>Chao, H., Beretta, G., Sang, H.</p>
<p><strong>Notes: </strong>unlike Chao&#8217;s previous paper, this paper doesn&#8217;t deal with document understanding per se, but rather examines the way PDFs are created from a variety of applications, e.g. Microsoft Word.  A similar study should also be included in my thesis!  Here, the Adobe SDK is used (which, I believe, performs some pre-processing &#8212; they talk of <em>text runs</em>).  I need to do the same with PDFBox&#8230;</p>
<p><strong>Available: </strong>http://www.science.uva.nl/events/dlia2001/program/s12_DL03.pdf</p>
<p><strong>Workshop: </strong><span>Workshop on Document Layout Interpretation and its Applications (DLIA01)</span></p>
<p><strong>Abstract:</strong> The Portable Document Format (PDF) has been<br />
mostly used for posting the final form of documents. The<br />
aim of our project is to analyze the layout, to modify the<br />
layout or to re-use elements of PDF documents for<br />
different media. Using PDFEdit in Adobe SDK, we built<br />
tool to study the layout of documents and tool to select<br />
page elements to compose a new page. We demonstrate<br />
problems we encountered and propose possible solutions.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/docsondocs.wordpress.com/68/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/docsondocs.wordpress.com/68/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/docsondocs.wordpress.com/68/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/docsondocs.wordpress.com/68/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/docsondocs.wordpress.com/68/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/docsondocs.wordpress.com/68/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/docsondocs.wordpress.com/68/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/docsondocs.wordpress.com/68/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/docsondocs.wordpress.com/68/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/docsondocs.wordpress.com/68/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/docsondocs.wordpress.com/68/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/docsondocs.wordpress.com/68/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/docsondocs.wordpress.com/68/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/docsondocs.wordpress.com/68/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/docsondocs.wordpress.com/68/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/docsondocs.wordpress.com/68/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=68&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://docsondocs.wordpress.com/2008/09/11/chao-et-al-pdf-document-layout-study-with-page-elements-and-bounding-boxes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/ddb4e9ffde1e700ffe463d3e1c9a35c3?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tamirhassan</media:title>
		</media:content>
	</item>
		<item>
		<title>Chao and Fan: Layout and Content Extraction for PDF Documents</title>
		<link>http://docsondocs.wordpress.com/2008/09/11/chao-and-fan-layout-and-content-extraction-for-pdf-documents/</link>
		<comments>http://docsondocs.wordpress.com/2008/09/11/chao-and-fan-layout-and-content-extraction-for-pdf-documents/#comments</comments>
		<pubDate>Thu, 11 Sep 2008 20:42:11 +0000</pubDate>
		<dc:creator>tamirhassan</dc:creator>
				<category><![CDATA[papers]]></category>
		<category><![CDATA[2004]]></category>
		<category><![CDATA[bottom-up]]></category>
		<category><![CDATA[das]]></category>
		<category><![CDATA[hewlett-packard]]></category>
		<category><![CDATA[line-finding]]></category>
		<category><![CDATA[pdf]]></category>
		<category><![CDATA[score 4]]></category>
		<category><![CDATA[segmentation]]></category>

		<guid isPermaLink="false">http://docsondocs.wordpress.com/?p=66</guid>
		<description><![CDATA[By: Chao, H. and Fan, J. Notes: PDF!!! Uses a somewhat similar method for bottom-up &#8216;clustering&#8217; to form segments (using e.g. line spacing and thresholds). Interesting heuristics which merge lines to form paragraphs and &#8220;flows&#8221;, without the requirement that they be rectangular. Available: http://www.springerlink.com/index/B928PLAETK53AX91.pdf direct download: http://www.springerlink.com/content/b928plaetk53ax91/fulltext.pdf (restricted); also on Google Books Conference: DAS 2004 [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=66&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>By: </strong>Chao, H. and Fan, J.</p>
<p><strong>Notes: </strong>PDF!!! Uses a somewhat similar method for bottom-up &#8216;clustering&#8217; to form segments (using e.g. line spacing and thresholds).  Interesting heuristics which merge lines to form paragraphs and &#8220;flows&#8221;, without the requirement that they be rectangular.</p>
<p><strong>Available: </strong>http://www.springerlink.com/index/B928PLAETK53AX91.pdf<br />
direct download: http://www.springerlink.com/content/b928plaetk53ax91/fulltext.pdf (restricted); also on Google Books</p>
<p><strong>Conference: </strong>DAS 2004</p>
<p><strong>Abstract: </strong>Portable document format (PDF) is a common output format for electronic documents. Most PDF documents are untagged and do not have basic high-level document logical structural information, which makes the reuse or modification of the documents difficult. We developed techniques that identified logical components on a PDF document page. The outlines, style attributes and the contents of the logical components were extracted and expressed in an XML format. These techniques could facilitate the reuse and modification of the layout and the content of a PDF document page.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/docsondocs.wordpress.com/66/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/docsondocs.wordpress.com/66/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/docsondocs.wordpress.com/66/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/docsondocs.wordpress.com/66/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/docsondocs.wordpress.com/66/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/docsondocs.wordpress.com/66/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/docsondocs.wordpress.com/66/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/docsondocs.wordpress.com/66/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/docsondocs.wordpress.com/66/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/docsondocs.wordpress.com/66/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/docsondocs.wordpress.com/66/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/docsondocs.wordpress.com/66/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/docsondocs.wordpress.com/66/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/docsondocs.wordpress.com/66/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/docsondocs.wordpress.com/66/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/docsondocs.wordpress.com/66/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=66&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://docsondocs.wordpress.com/2008/09/11/chao-and-fan-layout-and-content-extraction-for-pdf-documents/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/ddb4e9ffde1e700ffe463d3e1c9a35c3?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tamirhassan</media:title>
		</media:content>
	</item>
		<item>
		<title>Schuermann et al. Document Analysis &#8212; From Pixels to Contents</title>
		<link>http://docsondocs.wordpress.com/2008/09/11/schuermann-et-al-document-analysis-from-pixels-to-contents/</link>
		<comments>http://docsondocs.wordpress.com/2008/09/11/schuermann-et-al-document-analysis-from-pixels-to-contents/#comments</comments>
		<pubDate>Thu, 11 Sep 2008 20:13:42 +0000</pubDate>
		<dc:creator>tamirhassan</dc:creator>
				<category><![CDATA[papers]]></category>
		<category><![CDATA[1992]]></category>
		<category><![CDATA[binarization]]></category>
		<category><![CDATA[bitmap]]></category>
		<category><![CDATA[bottom-up]]></category>
		<category><![CDATA[connected component methods]]></category>
		<category><![CDATA[document analysis]]></category>
		<category><![CDATA[hypotheses]]></category>
		<category><![CDATA[line-finding]]></category>
		<category><![CDATA[morphological operations]]></category>
		<category><![CDATA[multi-level]]></category>
		<category><![CDATA[ocr]]></category>
		<category><![CDATA[Proceedings of the IEEE]]></category>
		<category><![CDATA[score 2]]></category>
		<category><![CDATA[skew detection]]></category>
		<category><![CDATA[top-down]]></category>
		<category><![CDATA[word/space detection]]></category>

		<guid isPermaLink="false">http://docsondocs.wordpress.com/?p=55</guid>
		<description><![CDATA[By: Schuermann, J., Bartneck, N., Bayer, T., Franke, J., Mandler, E., Oberlaender, M. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=156473 (restricted) Notes: The majority of this paper deals with bitmap images and describes an OCR system with some basic logical structure detection. Although the paper claims only to deal with document analysis, some basic document understanding is performed, too. Topics [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=55&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>By: </strong>Schuermann, J., Bartneck, N., Bayer, T., Franke, J., Mandler, E., Oberlaender, M.</p>
<p><strong>Available: </strong>http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=156473 (restricted)</p>
<p><strong>Notes: </strong>The majority of this paper deals with bitmap images and describes an OCR system with some basic logical structure detection.  Although the paper claims only to deal with document analysis, some basic document understanding is performed, too.  Topics of interest include: mentioning of top-down and bottom-up approaches (for segmentation) and the importance that <em>the document be analysed on several levels with the possibility to accept or reject hypotheses at a later stage</em>; binarization, morpohological operations, connected component methods, skew detection, line-finding (based on a pixel image), word/space detection and lots of stuff that doesn&#8217;t really concern us.</p>
<p><strong>Abstract: </strong> The authors present a conceptual framework for solving the task of document analysis, which, in essence, consists in the conversion of the document&#8217;s pixel representation into an equivalent knowledge network representation holding the document&#8217;s content and layout. Starting on the pixel level, the formation of elementary geometric objects on which layout analysis as well as the definition of character objects is based is described. Character recognition accomplishes the mapping from geometric object to character meaning in ASCII representation. On the next level of abstraction words are formed and verified by contextual processing. Modeled knowledge about complete documents and about how their constituents are related to the application forms the highest level of abstraction. The various problems arising at each stage are discussed. The dependencies between the different levels are exemplified and technical solutions put forward</p>
<p><strong>Journal: </strong><a href="http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=5"><strong>Proceedings of the IEEE</strong></a><br />
Publication Date: Jul 1992<br />
Volume: 80,                                                                 <a href="http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=4050&amp;isYear=1992">Issue: 7</a><br />
On page(s): 1101-1119</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/docsondocs.wordpress.com/55/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/docsondocs.wordpress.com/55/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/docsondocs.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/docsondocs.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/docsondocs.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/docsondocs.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/docsondocs.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/docsondocs.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/docsondocs.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/docsondocs.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/docsondocs.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/docsondocs.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/docsondocs.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/docsondocs.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/docsondocs.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/docsondocs.wordpress.com/55/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=docsondocs.wordpress.com&amp;blog=2121339&amp;post=55&amp;subd=docsondocs&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://docsondocs.wordpress.com/2008/09/11/schuermann-et-al-document-analysis-from-pixels-to-contents/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/ddb4e9ffde1e700ffe463d3e1c9a35c3?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">tamirhassan</media:title>
		</media:content>
	</item>
	</channel>
</rss>
