auto_research.survey.paper_reader module
- class Paper(paper_path, model='gpt-4o-mini')[source]
Bases:
object
A class for reading and extracting information from research articles in PDF format.
This class provides functionality to read PDF files using different libraries (PyPDF2 and PyMuPDF), extract specific sections, and analyze the content of research papers.
- Parameters:
- paper.
Example
>>> paper = Paper("example.pdf", model="gpt-4") >>> paper.read_pymupdf() >>> paper.calculate_token_length() >>> print(paper.paper_length) 1234
Notes
The class supports both PyPDF2 and PyMuPDF (fitz) for PDF processing, allowing flexibility in PDF parsing approaches.
- __init__(paper_path, model='gpt-4o-mini')[source]
Initialize the Paper instance with the given PDF path and model.
- read_pypdf2()[source]
Read PDF content using PyPDF2 library.
This method extracts text from each page of the PDF using PyPDF2 and stores it in the whole_paper list.
Example
>>> paper = Paper("example.pdf") >>> paper.read_pypdf2()
- Return type:
None
- read_pymupdf()[source]
Read PDF content using PyMuPDF library.
This method extracts text from each page of the PDF using PyMuPDF (fitz) and stores it in the whole_paper list.
Example
>>> paper = Paper("example.pdf") >>> paper.read_pymupdf()
- Return type:
None
- first_n_pages(n)[source]
Return the concatenated text of the first n pages.
- Parameters:
n (int) – Number of pages to include.
- Returns:
Concatenated text of the first n pages.
- Return type:
Example
>>> paper = Paper("example.pdf") >>> paper.read_pymupdf() >>> first_three = paper.first_n_pages(3)
- get_whole_paper(print_mode=False)[source]
Print the entire paper content with page markers or return it as a formatted string.
This method either prints the content of each page with clear beginning and ending markers for better visualization or returns the entire content as a single string in the same format.
- Parameters:
print_mode (bool) – If True, print the content. If False, return the content as a formatted
string.
- Returns:
If print_mode is False, returns the formatted string. Otherwise, returns None.
- Return type:
Optional[str]
Example
>>> paper = Paper("example.pdf") >>> paper.read_pymupdf() >>> paper.get_whole_paper(print_mode=True) # Prints the content >>> full_text = paper.get_whole_paper( ... print_mode=False ... ) # Returns the content as a string
- static extract_up_to_first_match_exclude_list(a, b_list)[source]
Extract content up to the first occurrence of any marker in the exclude list.
- Parameters:
- Returns:
List of strings from input up to but not including the first occurrence of any substring in b_list.
- Return type:
Example
>>> text_list = ["Page 1", "Page 2", "References", "Page 3"] >>> result = Paper.extract_up_to_first_match_exclude_list(text_list, ["references"])
- extract_ending_pages(page_number=3)[source]
Extract the specified number of ending pages before references section.
- Parameters:
page_number (int) – Number of pages to extract from the end. Defaults to 3.
- Returns:
Concatenated text of the specified number of ending pages.
- Return type:
Example
>>> paper = Paper("example.pdf") >>> paper.read_pymupdf() >>> ending_pages = paper.extract_ending_pages(2)
- calculate_token_length()[source]
Calculate the total number of tokens in the paper using the specified model.
This method uses the tiktoken library to encode and count tokens according to the specified model’s tokenization scheme.
Example
>>> paper = Paper("example.pdf") >>> paper.read_pymupdf() >>> paper.calculate_token_length()
- Return type:
None