org.apache.lucene.ant
public class HtmlDocument extends java.lang.Object
HtmlDocument
class creates a Lucene Document
from an HTML document.
It does this by using JTidy package. It can take input input
from File
or InputStream
.
Constructor and Description |
---|
HtmlDocument(java.io.File file)
Constructs an
HtmlDocument from a File . |
HtmlDocument(java.io.File file,
java.lang.String tidyConfigFile)
Constructs an
HtmlDocument from a
File . |
HtmlDocument(java.io.InputStream is)
Constructs an
HtmlDocument from an InputStream . |
Modifier and Type | Method and Description |
---|---|
static org.apache.lucene.document.Document |
Document(java.io.File file)
Creates a Lucene
Document from a File . |
static org.apache.lucene.document.Document |
Document(java.io.File file,
java.lang.String tidyConfigFile)
Creates a Lucene
Document from a
File . |
java.lang.String |
getBody()
Gets the bodyText attribute of the
HtmlDocument object. |
static org.apache.lucene.document.Document |
getDocument(java.io.InputStream is)
Creates a Lucene
Document from an InputStream . |
java.lang.String |
getTitle()
Gets the title attribute of the
HtmlDocument
object. |
static void |
main(java.lang.String[] args)
Runs
HtmlDocument on the files specified on
the command line. |
public HtmlDocument(java.io.File file) throws java.io.IOException
HtmlDocument
from a File
.file
- the File
containing the
HTML to parsejava.io.IOException
- if an I/O exception occurspublic HtmlDocument(java.io.InputStream is)
HtmlDocument
from an InputStream
.is
- the InputStream
containing the HTMLpublic HtmlDocument(java.io.File file, java.lang.String tidyConfigFile) throws java.io.IOException
HtmlDocument
from a
File
.file
- the File
containing the
HTML to parsetidyConfigFile
- the String
containing the full path to the Tidy config filejava.io.IOException
- if an I/O exception occurspublic static org.apache.lucene.document.Document Document(java.io.File file, java.lang.String tidyConfigFile) throws java.io.IOException
Document
from a
File
.file
- tidyConfigFile
- the full path to the Tidy
config filejava.io.IOException
public static org.apache.lucene.document.Document getDocument(java.io.InputStream is)
Document
from an InputStream
.is
- public static org.apache.lucene.document.Document Document(java.io.File file) throws java.io.IOException
Document
from a File
.file
- java.io.IOException
public static void main(java.lang.String[] args) throws java.lang.Exception
HtmlDocument
on the files specified on
the command line.args
- Command line argumentsjava.lang.Exception
- Description of Exceptionpublic java.lang.String getTitle()
HtmlDocument
object.public java.lang.String getBody()
HtmlDocument
object.Copyright © 2000-2014 Apache Software Foundation. All Rights Reserved.