PdfExtractor
Inheritance: java.lang.Object, com.aspose.pdf.facades.IVentureLicenseTarget, com.aspose.pdf.facades.Facade
public final class PdfExtractor extends Facade
Class for extracting images and text from PDF document.
Constructors
Constructor | Description |
---|---|
PdfExtractor() | Initializes new PdfExtractor object. |
PdfExtractor(IDocument document) | Initializes new PdfExtractor object on base of the document . |
Methods
Method | Description |
---|---|
getStartPage() | Gets start page in the page range where extracting operation will be performed. |
setStartPage(int value) | Sets start page in the page range where extracting operation will be performed. |
getEndPage() | Gets end page in the page range where extracting operation will be performed. |
setEndPage(int value) | Sets end page in the page range where extracting operation will be performed. |
getExtractTextMode() | Gets the mode for extract text’s result. |
setExtractTextMode(int value) | Sets the mode for extract text’s result. |
getTextSearchOptions() | Gets text search options. |
setTextSearchOptions(TextSearchOptions value) | Sets text search options. |
getExtractImageMode() | Sets the mode for extract images process. |
setExtractImageMode(int value) | Sets the mode for extract images process. |
isBidi() | Is true when text has hebriew or arabic symbols. |
extractText() | Extracts text from a Pdf document. |
extractText(Charset encoding) | Extracts text from a Pdf document using specified encoding. |
extractTextInternal(TextEncodingInternal encoding) | For Internal usage only |
getText(String outputFile) | Saves text to file. |
getText(OutputStream outputStream) | Saves text to stream. |
bindPdf(String inputFile) | Bind input PDF file. |
bindPdf(InputStream inputStream) | Binds PDF document from stream. |
extractImage() | Extract images from PDF file. |
hasNextImage() | Checks if more images are accessible in PDF document. |
getNextImage(String outputFile) | Retreives next image from PDF document. |
getNextImage(String outputFile, ImageType format) | Retreives next image from PDF document with given image format. |
getNextImage(OutputStream outputStream, ImageType format) | Retreive next image from PDF file and stores it into stream with given image format. |
getNextImage(OutputStream outputStream) | Retreive next image from PDF file and stores it into stream. |
getAttachNames() | Returns list of attachments in PDF file. |
extractAttachment() | Extracts attachments from a Pdf document. |
extractAttachment(String attachmentFileName) | Extracts attachment to PDF file by attachment name. |
getAttachment(String outputPath) | Stores attachment into file. |
hasNextPageText() | Indicates that whether can get more texts or not. |
getNextPageText(String outputFile) | Saves one page’s text to file. |
getNextPageText(OutputStream outputStream) | Saves one page’s text to stream. |
getText(OutputStream outputStream, boolean filterNotAscii) | Saves text to stream. |
getAttachment() | Saves all the attachment file to streams. |
getAttachmentInfo() | Gets the list of attachments. |
getResolution() | Gets resolution for extracted images. |
setResolution(int value) | Set resolution for extracted images. |
getPassword() | Gets input file’s password. |
setPassword(String value) | Sets input file’s password. |
extractMarkedContentAsImages(Page page, String path) | Gets all the Marked Content containers as separate images. |
PdfExtractor()
public PdfExtractor()
Initializes new PdfExtractor object.
PdfExtractor(IDocument document)
public PdfExtractor(IDocument document)
Initializes new PdfExtractor object on base of the document .
Parameters:
Parameter | Type | Description |
---|---|---|
document | IDocument | Pdf document. |
getStartPage()
public int getStartPage()
Gets start page in the page range where extracting operation will be performed.
PdfExtractor ext = new PdfExtractor();
ext.bindBdf("sample.pdf");
ext.setStartPage(2);
ext.setEndPage(5);
ext.extractText();
Returns: int - start page in the page range.
setStartPage(int value)
public void setStartPage(int value)
Sets start page in the page range where extracting operation will be performed.
PdfExtractor ext = new PdfExtractor();
ext.bindBdf("sample.pdf");
ext.setStartPage(2);
ext.setEndPage(5);
ext.extractText();
Parameters:
Parameter | Type | Description |
---|---|---|
value | int | start page in the page range. |
getEndPage()
public int getEndPage()
Gets end page in the page range where extracting operation will be performed.
PdfExtractor ext = new PdfExtractor();
ext.bindBdf("sample.pdf");
ext.setStartPage(2);
ext.setEndPage(3);
ext.extractText();
Returns: int - end page.
setEndPage(int value)
public void setEndPage(int value)
Sets end page in the page range where extracting operation will be performed.
PdfExtractor ext = new PdfExtractor();
ext.bindBdf("sample.pdf");
ext.setStartPage(2);
ext.setEndPage(3);
ext.extractText();
Parameters:
Parameter | Type | Description |
---|---|---|
value | int | end page. |
getExtractTextMode()
public int getExtractTextMode()
Gets the mode for extract text’s result.
The example demonstratres the ```
ExtractTextMode
``` property usage in text extraction scenario.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(@"D:\Text\text.pdf");
extractor.setExtractTextMode(1);
extractor.extractText();
extractor.getText(@"D:\Text\text.txt");
Value: 0 is pure text mode and 1 is raw ordering mode. Default is 0.
Returns: int - extract text’s result.
setExtractTextMode(int value)
public void setExtractTextMode(int value)
Sets the mode for extract text’s result.
The example demonstratres the ```
ExtractTextMode
``` property usage in text extraction scenario.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(@"D:\Text\text.pdf");
extractor.setExtractTextMode(1);
extractor.extractText();
extractor.getText(@"D:\Text\text.txt");
Value: 0 is pure text mode and 1 is raw ordering mode. Default is 0.
Parameters:
Parameter | Type | Description |
---|---|---|
value | int | extract text’s result. |
getTextSearchOptions()
public TextSearchOptions getTextSearchOptions()
Gets text search options.
Returns: TextSearchOptions - text search options.
setTextSearchOptions(TextSearchOptions value)
public void setTextSearchOptions(TextSearchOptions value)
Sets text search options.
Parameters:
Parameter | Type | Description |
---|---|---|
value | TextSearchOptions | text search options. |
getExtractImageMode()
public int getExtractImageMode()
Sets the mode for extract images process.
Default value is ExtractImageMode.DefinedInResources that extracts all images defined in resources. To extract actually shown images ExtractImageMode.ActuallyUsed mode should be used.
Returns: int - ExtractImageMode value
setExtractImageMode(int value)
public void setExtractImageMode(int value)
Sets the mode for extract images process.
Default value is ExtractImageMode.DefinedInResources that extracts all images defined in resources. To extract actually shown images ExtractImageMode.ActuallyUsed mode should be used.
Parameters:
Parameter | Type | Description |
---|---|---|
value | int | ExtractImageMode value |
isBidi()
public boolean isBidi()
Is true when text has hebriew or arabic symbols. This case must be specially considered because string functions change their behaviour and start process text from right to left (except numbers and other non text chars).
Returns: boolean - boolean value
extractText()
public void extractText()
Extracts text from a Pdf document.
First example demonstratres how to extract all the text from PDF file.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf("D:\Text\text.pdf");
extractor.extractText();
extractor.getText("D:\Text\text.txt");
Second example demonstratres how to extract each page’s text into one txt file.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(TestPath + "Aspose.Pdf.Kit.Pdf");
extractor.extractText();
String prefix = TestPath + "Aspose.Pdf.Kit";
String suffix = ".txt";
int pageCount = 1;
while (extractor.hasNextPageText())
{
extractor.getNextPageText(prefix + pageCount + suffix);
pageCount++;
}
extractText(Charset encoding)
public void extractText(Charset encoding)
Extracts text from a Pdf document using specified encoding.
First example demonstrates how to extract all the text from PDF file.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf("D:\\Text\\text.pdf");
extractor.extractText(Encoding.Unicode);
extractor.getText("D:\\Text\\text.txt");
Second example demonstrates how to extract each page’s text into one txt file.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(TestPath + "Aspose.Pdf.Kit.Pdf");
extractor.extractText(java.nio.charset.Charset.forName("UTF-8"));
String prefix = TestPath + "Aspose.Pdf.Kit";
String suffix = ".txt";
int pageCount = 1;
while (extractor.hasNextPageText())
{
extractor.getNextPageText(prefix + pageCount + suffix);
pageCount++;
}
Parameters:
Parameter | Type | Description |
---|---|---|
encoding | java.nio.charset.Charset | The encoding of the extracted text. |
extractTextInternal(TextEncodingInternal encoding)
public void extractTextInternal(TextEncodingInternal encoding)
For Internal usage only
Parameters:
Parameter | Type | Description |
---|---|---|
encoding | TextEncodingInternal | The encoding of the extracted text. |
getText(String outputFile)
public void getText(String outputFile)
Saves text to file. see also: ExtractText
Parameters:
Parameter | Type | Description |
---|---|---|
outputFile | java.lang.String | The file path and name to save the text. |
getText(OutputStream outputStream)
public void getText(OutputStream outputStream)
Saves text to stream. see also: ExtractText
Parameters:
Parameter | Type | Description |
---|---|---|
outputStream | java.io.OutputStream | The stream to save the text. |
bindPdf(String inputFile)
public void bindPdf(String inputFile)
Bind input PDF file.
PdfExtractor ext = new PdfExtractor();
ext.bindPdf("sample.pdf");
Parameters:
Parameter | Type | Description |
---|---|---|
inputFile | java.lang.String | PDF fiel to bind |
bindPdf(InputStream inputStream)
public void bindPdf(InputStream inputStream)
Binds PDF document from stream.
PdfExtractor ext = new PdfExtractor();
InputStream stream = new FileInputStream("sample.pdf");
ext.bindPdf(stream);
Parameters:
Parameter | Type | Description |
---|---|---|
inputStream | java.io.InputStream | Stream containing PDF document data |
extractImage()
public void extractImage()
Extract images from PDF file.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf("sample.pdf");
extractor.extractImage();
int i = 1;
while (extractor.HasNextImage())
{
extractor.getNextImage("image-" + i +".pdf");
}
hasNextImage()
public boolean hasNextImage()
Checks if more images are accessible in PDF document. Note: ExtractImage must be called before using of this method.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf("sample.pdf");
extractor.extractImage();
int i = 1;
while (extractor.hasNextImage())
{
extractor.getNextImage("image-" + i +".pdf");
}
Returns: boolean - Trues if more images are accessible
getNextImage(String outputFile)
public boolean getNextImage(String outputFile)
Retreives next image from PDF document. Note: ExtractImage must be called before using of this method.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf("sample.pdf");
extractor.extractImage();
int i = 1;
while (extractor.hasNextImage())
{
extractor.getNextImage("image-" + i +".pdf");
}
Parameters:
Parameter | Type | Description |
---|---|---|
outputFile | java.lang.String | File where image will be stored |
Returns: boolean - True is image is successfully extracted
getNextImage(String outputFile, ImageType format)
public boolean getNextImage(String outputFile, ImageType format)
Retreives next image from PDF document with given image format. Note: ExtractImage must be called before using of this method.
Parameters:
Parameter | Type | Description |
---|---|---|
outputFile | java.lang.String | File where image will be stored |
format | ImageType | ImageType element |
Returns: boolean - True is image is successfully extracted
getNextImage(OutputStream outputStream, ImageType format)
public boolean getNextImage(OutputStream outputStream, ImageType format)
Retreive next image from PDF file and stores it into stream with given image format.
Parameters:
Parameter | Type | Description |
---|---|---|
outputStream | java.io.OutputStream | Stream where image data will be saved |
format | ImageType | The format of the image. |
Returns: boolean - True in case the image is successfully extracted.
getNextImage(OutputStream outputStream)
public boolean getNextImage(OutputStream outputStream)
Retreive next image from PDF file and stores it into stream.
Parameters:
Parameter | Type | Description |
---|---|---|
outputStream | java.io.OutputStream | Stream where image data will be saved |
Returns: boolean - True in case the image is successfully extracted.
getAttachNames()
public List<String> getAttachNames()
Returns list of attachments in PDF file. Note: ExtractAttachments must be called befor using this method.
Example demonstrates how to extract attachment names form PDF file.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(TestSettings.GetInputFile("sample.pdf"));
extractor.ExtractAttachment();
List attachments = extractor.getAttachNames();
for (String name : ```
(Iterable)
```attachments)
System.out.println(name);
Returns: java.util.List<java.lang.String> - List of attachments
extractAttachment()
public void extractAttachment()
Extracts attachments from a Pdf document.
extractAttachment(String attachmentFileName)
public void extractAttachment(String attachmentFileName)
Extracts attachment to PDF file by attachment name.
Parameters:
Parameter | Type | Description |
---|---|---|
attachmentFileName | java.lang.String | Name of attachment to extract |
getAttachment(String outputPath)
public void getAttachment(String outputPath)
Stores attachment into file.
Parameters:
Parameter | Type | Description |
---|---|---|
outputPath | java.lang.String | Directory path where attachment(s) will be stored. Null or empty string means attachment(s) will be placed in the application directory. |
hasNextPageText()
public boolean hasNextPageText()
Indicates that whether can get more texts or not.
The example demonstratres the ```
HasNextPageText
``` property usage in text extraction scenario.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(TestPath + "Aspose.Pdf.Kit.Pdf");
extractor.extractText(Encoding.Unicode);
String prefix = TestPath + "Aspose.Pdf.Kit";
String suffix = ".txt";
int pageCount = 1;
while (extractor.hasNextPageText())
{
extractor.getNextPageText(prefix + pageCount + suffix);
pageCount++;
}
Returns: boolean - Can get more texts or not, true is can, or false.
getNextPageText(String outputFile)
public void getNextPageText(String outputFile)
Saves one page’s text to file.
The example demonstratres the GetNextPageText method usage in text extraction scenario.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(TestPath + @"Aspose.Pdf.Kit.Pdf");
extractor.extractText(Encoding.Unicode);
String prefix = TestPath + @"Aspose.Pdf.Kit";
String suffix = ".txt";
int pageCount = 1;
while (extractor.hasNextPageText())
{
extractor.getNextPageText(prefix + pageCount + suffix);
pageCount++;
}
Parameters:
Parameter | Type | Description |
---|---|---|
outputFile | java.lang.String | The file path and name to save the text. |
getNextPageText(OutputStream outputStream)
public void getNextPageText(OutputStream outputStream)
Saves one page’s text to stream.
The example demonstratres the ```
GetNextPageText
``` method usage in text extraction scenario.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(TestPath + @"Aspose.Pdf.Kit.Pdf");
extractor.extractText(Encoding.Unicode);
String prefix = TestPath + "Aspose.Pdf.Kit";
String suffix = ".txt";
int pageCount = 1;
while (extractor.hasNextPageText())
{
FileInputStream fs = new FileInputStream(prefix + pageCount + suffix, FileMode.Create);
extractor.getNextPageText(fs);
fs.close();
pageCount++;
}
Parameters:
Parameter | Type | Description |
---|---|---|
outputStream | java.io.OutputStream | The stream to save the text. |
getText(OutputStream outputStream, boolean filterNotAscii)
public void getText(OutputStream outputStream, boolean filterNotAscii)
Saves text to stream. see also: ExtractText
Parameters:
Parameter | Type | Description |
---|---|---|
outputStream | java.io.OutputStream | The stream to save the text. |
filterNotAscii | boolean | If this parameter is true all Not ASCII simbols will be removed |
getAttachment()
public ByteArrayOutputStream[] getAttachment()
Saves all the attachment file to streams.
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(path + "Attach.pdf");
extractor.extractAttachment();
IList names = extractor.getAttachNames();
ByteArrayOutputStream[] tempStreams = extractor.getAttachment();
for (int i=0; i<tempStreams.Length; i++)
{
string name = (string)names[i];
OutputStream fs = new FileOutputStream(path + name);
fs.write(tempStreams[i].toByteArray());
fs.close();
}
Returns: java.io.ByteArrayOutputStream[] - The stream array of the attachment file in the pdf document.
getAttachmentInfo()
public List<FileSpecification> getAttachmentInfo()
Gets the list of attachments.
Returns:
java.util.List<com.aspose.pdf.FileSpecification> - Returns an List
getResolution()
public int getResolution()
Gets resolution for extracted images. Default value is 150. Images which have greater resolution value are more clear. However increasing resolution value results in increasing time and memory needed to extract images. Usually to get clear image it’s enough to set resolution to 150 or 300.
Returns: int - int value
setResolution(int value)
public void setResolution(int value)
Set resolution for extracted images. Default value is 150. Images which have greater resolution value are more clear. However increasing resolution value results in increasing time and memory needed to extract images. Usually to get clear image it’s enough to set resolution to 150 or 300.
Parameters:
Parameter | Type | Description |
---|---|---|
value | int | int value |
getPassword()
public String getPassword()
Gets input file’s password.
Returns: java.lang.String - String value
setPassword(String value)
public void setPassword(String value)
Sets input file’s password.
Parameters:
Parameter | Type | Description |
---|---|---|
value | java.lang.String | String value |
extractMarkedContentAsImages(Page page, String path)
public void extractMarkedContentAsImages(Page page, String path)
Gets all the Marked Content containers as separate images.
Every Marked Content will be saved as image with png format named with MCID_.png
Parameters:
Parameter | Type | Description |
---|---|---|
page | Page | Page for process. |
path | java.lang.String | The path where images will be saved. |