Extract Highlighted Text In PDF File
To extract highlighted text in PDF file, you can use the Aspose.PDF for .NET API. This API provides a simple way to retrieve all the text that has been highlighted in a document.
Step 1: Load the PDF document
The first step in extracting highlighted text in PDF file is to load the document using the Aspose.PDF for .NET API. You can do this by creating a new instance of the Document
class and passing the path to the PDF document as a parameter.
// The path to the documents directory.
string dataDir ="YOUR DOCUMENT DIRECTORY";
Document doc = new Document(dataDir + "ExtractHighlightedText.pdf");
Step 2: Loop through all annotations
The next step is to loop through all the annotations in the PDF document. You can do this using a foreach
loop, like so:
foreach (Annotation annotation in doc.Pages[1].Annotations)
{
// Code goes here
}
Step 3: Filter text markup annotations
Inside the foreach
loop, you will need to filter out all the annotations that are not text markup annotations. You can do this by checking if the annotation is an instance of the TextMarkupAnnotation
class.
if (annotation is TextMarkupAnnotation)
{
// Code goes here
}
Step 4: Retrieve highlighted text fragments
Once you have filtered out all the text markup annotations, you can retrieve the highlighted text fragments for each annotation. You can do this by calling the GetMarkedTextFragments()
method on the TextMarkupAnnotation
object.
TextMarkupAnnotation highlightedAnnotation = annotation as TextMarkupAnnotation;
TextFragmentCollection collection = highlightedAnnotation.GetMarkedTextFragments();
Step 5: Display the highlighted text
Finally, you can display the highlighted text to the user. You can do this by looping through each TextFragment
object in the TextFragmentCollection
and calling the Text
property.
foreach (TextFragment tf in collection)
{
Console.WriteLine(tf.Text);
}
Example source code for Extract Highlighted Text using Aspose.PDF for .NET
// The path to the documents directory.
string dataDir ="YOUR DOCUMENT DIRECTORY";
Document doc = new Document(dataDir + "ExtractHighlightedText.pdf");
foreach (Annotation annotation in doc.Pages[1].Annotations)
{
if (annotation is TextMarkupAnnotation)
{
TextMarkupAnnotation highlightedAnnotation = annotation as TextMarkupAnnotation;
TextFragmentCollection collection = highlightedAnnotation.GetMarkedTextFragments();
foreach (TextFragment tf in collection)
{
Console.WriteLine(tf.Text);
}
}
}
Conclusion
In this tutorial, we explored how to extract highlighted text from a PDF document using Aspose.PDF for .NET. By following the step-by-step guide and using the provided C# source code, developers can easily extract and manage highlighted text in their PDF documents.
FAQ’s for extract highlighted text in PDF file
Q: What are text markup annotations in a PDF document?
A: Text markup annotations are annotations that highlight or mark specific text in a PDF document. Examples of text markup annotations include highlights, underlines, and strikethroughs.
Q: Can I extract text from other types of annotations using Aspose.PDF for .NET?
A: Yes, Aspose.PDF for .NET provides various methods to extract text from different types of annotations, including text markup annotations, free text annotations, and more.
Q: Does Aspose.PDF for .NET support extracting text from password-protected PDF files?
A: Yes, Aspose.PDF for .NET supports extracting text from password-protected PDF files. You need to provide the correct password when loading the PDF document using the Document
class.
Q: Can I filter highlighted text based on other criteria, such as color or author?
A: Yes, you can filter highlighted text based on other criteria, such as color, author, or creation date. Aspose.PDF for .NET provides methods to access and filter annotations based on their properties.
Q: Is it possible to save the extracted highlighted text to a separate file?
A: Yes, you can save the extracted highlighted text to a separate file or store it in a data structure for further processing or analysis.