auto_research.survey.paper_reader module

class Paper(paper_path, model='gpt-4o-mini')[source]

Bases: object

A class for reading and extracting information from research articles in PDF format.

This class provides functionality to read PDF files using different libraries (PyPDF2 and PyMuPDF), extract specific sections, and analyze the content of research papers.

Parameters:

paper_path (str) – Path to the PDF file containing the research paper.
model (str) – Name of the GPT model to use for token counting. Defaults to ‘gpt-4o-mini’.

paper_path

Path to the PDF file.

Type:: str

whole_paper

List containing the text content of each page.

Type:: list[str]

paper_length

Total number of tokens in the paper based on the specified model.

Type:: int

model

Name of the GPT model used for token counting.

Type:: str

extracted_information

Dictionary containing extracted sections of the

Type:: dict[str, str]

paper.

Example

>>> paper = Paper("example.pdf", model="gpt-4")
>>> paper.read_pymupdf()
>>> paper.calculate_token_length()
>>> print(paper.paper_length)
1234

Notes

The class supports both PyPDF2 and PyMuPDF (fitz) for PDF processing, allowing flexibility in PDF parsing approaches.

__init__(paper_path, model='gpt-4o-mini')[source]

Initialize the Paper instance with the given PDF path and model.

Parameters:

paper_path (str)
model (str)

Return type:

None

read_pypdf2()[source]

Read PDF content using PyPDF2 library.

This method extracts text from each page of the PDF using PyPDF2 and stores it in the whole_paper list.

Example

>>> paper = Paper("example.pdf")
>>> paper.read_pypdf2()

Return type:: None

read_pymupdf()[source]

Read PDF content using PyMuPDF library.

This method extracts text from each page of the PDF using PyMuPDF (fitz) and stores it in the whole_paper list.

Example

>>> paper = Paper("example.pdf")
>>> paper.read_pymupdf()

Return type:: None

first_n_pages(n)[source]

Return the concatenated text of the first n pages.

Parameters:: n (int) – Number of pages to include.
Returns:: Concatenated text of the first n pages.
Return type:: str

Example

>>> paper = Paper("example.pdf")
>>> paper.read_pymupdf()
>>> first_three = paper.first_n_pages(3)

get_whole_paper(print_mode=False)[source]

Print the entire paper content with page markers or return it as a formatted string.

This method either prints the content of each page with clear beginning and ending markers for better visualization or returns the entire content as a single string in the same format.

Parameters:

print_mode (bool) – If True, print the content. If False, return the content as a formatted
string.

Returns:

If print_mode is False, returns the formatted string. Otherwise, returns None.

Return type:

Optional[str]

Example

>>> paper = Paper("example.pdf")
>>> paper.read_pymupdf()
>>> paper.get_whole_paper(print_mode=True)  # Prints the content
>>> full_text = paper.get_whole_paper(
...     print_mode=False
... )  # Returns the content as a string

static extract_up_to_first_match_exclude_list(a, b_list)[source]

Extract content up to the first occurrence of any marker in the exclude list.

Parameters:

a (list[str]) – List of strings to concatenate and search within.
b_list (list[str]) – List of substrings to search for.

Returns:

List of strings from input up to but not including the first occurrence of any substring in b_list.

Return type:

list[str]

Example

>>> text_list = ["Page 1", "Page 2", "References", "Page 3"]
>>> result = Paper.extract_up_to_first_match_exclude_list(text_list, ["references"])

extract_ending_pages(page_number=3)[source]

Extract the specified number of ending pages before references section.

Parameters:: page_number (int) – Number of pages to extract from the end. Defaults to 3.
Returns:: Concatenated text of the specified number of ending pages.
Return type:: str

Example

>>> paper = Paper("example.pdf")
>>> paper.read_pymupdf()
>>> ending_pages = paper.extract_ending_pages(2)

calculate_token_length()[source]

Calculate the total number of tokens in the paper using the specified model.

This method uses the tiktoken library to encode and count tokens according to the specified model’s tokenization scheme.

Example

>>> paper = Paper("example.pdf")
>>> paper.read_pymupdf()
>>> paper.calculate_token_length()

Return type:: None