Extract text content from common document formats. The main dispatcher
read_content() detects the format from the file extension and
delegates to the appropriate helper. Format-specific functions are also
exported for direct use.
Usage
read_content(path, name = basename(path))
read_content_docx(path)
read_content_pptx(path)
read_content_xlsx(path, sheet = NULL)
read_content_pdf(path)
read_content_text(path)Arguments
- path
Character vector of file paths to read.
- name
Character vector of original file names, used for extension detection and output naming. Defaults to
basename(path). This is important when reading Shiny upload temp files whose paths lack meaningful extensions.- sheet
Character vector of sheet names to read, or
NULL(the default) to read all sheets.
Details
read_content_docx() requires the officer package. Paragraphs
are extracted in document order. Tables are formatted as plain text.
read_content_pptx() requires the officer package. Text is
extracted from each slide with slide number headers.
read_content_xlsx() requires the readxl package. Each sheet is
formatted as a plain-text table with a sheet name header.
read_content_pdf() requires the pdftools package. Text is
extracted per page with page number headers.
read_content_text() reads any file as plain text using
readLines(). This is the fallback for unrecognized file extensions.
