Python中pdfminer.six和pdfplumber模块是什么-快上网网站建设公司

Python中pdfminer.six和pdfplumber模块是什么

这篇文章将为大家详细讲解有关Python中pdfminer.six和pdfplumber模块是什么，小编觉得挺实用的，因此分享给大家做个参考，希望大家阅读完这篇文章后可以有所收获。

创新互联服务项目包括黎城网站建设、黎城网站制作、黎城网页制作以及黎城网络营销策划等。多年来，我们专注于互联网行业，利用自身积累的技术优势、行业经验、深度合作伙伴关系等，向广大中小型企业、政府机构等提供互联网行业的解决方案，黎城网站推广取得了明显的社会效益与经济效益。目前，我们服务的客户以成都为中心已经辐射到黎城省份的部分城市，未来相信会继续扩大服务区域并继续获得客户的支持与信任！

pdfminer.six

PDFMiner的操作门槛比较高，需要部分了解PDF的文档结构模型，适合定制开发复杂的内容处理工具。

平时直接用PDFMiner比较少，这里只演示基本的文档内容操作：

import pathlib
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams, LTTextBox, LTFigure, LTImage
from pdfminer.converter import PDFPageAggregator

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情对中国连锁餐饮行业的影响调研报告-中国连锁经营协会.pdf')

with open(f_path, 'rb') as f:
    parser = PDFParser(f)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        layout = device.get_result()
        for x in layout:
            # 获取文本对象
            if isinstance(x, LTTextBox):
                print(x.get_text().strip())
            # 获取图片对象
            if isinstance(x,LTImage):
                print('这里获取到一张图片')
            # 获取 figure 对象
            if isinstance(x,LTFigure):
                print('这里获取到一个 figure 对象')

虽然pdfminer使用门槛较高，但遇到复杂情况，最后还得用它。目前开源模块中，它对PDF的支持应该是最全的了。

下面这个pdfplumber就是基于pdfminer.six开发的模块，降低了使用门槛。

pdfplumber

相比pdfminer.six，pdfplumber提供了更便捷的PDF内容抽取接口。

日常工作中常用的操作，比如：

提取PDF内容，保存到txt文件
提取PDF中的表格到Excel
提取PDF中的图片
提取PDF中的图表

提取PDF内容，保存到txt文件

import pathlib
import pdfplumber

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情对中国连锁餐饮行业的影响调研报告-中国连锁经营协会.pdf')
out_path = path.joinpath('002pdf_out.txt')

with pdfplumber.open(f_path) as pdf, open(out_path ,'a') as txt:
    for page in pdf.pages:
        textdata = page.extract_text()
        txt.write(textdata)

提取PDF中的表格到Excel

import pathlib
import pdfplumber
from openpyxl import Workbook

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情对中国连锁餐饮行业的影响调研报告-中国连锁经营协会.pdf')
out_path = path.joinpath('002pdf_excel.xlsx')

wb = Workbook()
sheet = wb.active
with pdfplumber.open(f_path) as pdf:
    for i in range(19, 22):
        page = pdf.pages[i]
        table = page.extract_table()
        for row in table:
            sheet.append(row)
wb.save(out_path)

上面用到了openpyxl的功能创建了一个Excel文件，之前有单独文章介绍它。

提取PDF中的图片

import pathlib
import pdfplumber
from PIL import Image

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-疫情影响下的中国社区趋势研究-艾瑞.pdf')
out_path = path.joinpath('002pdf_images.png')
with pdfplumber.open(f_path) as pdf, open(out_path, 'wb') as fout:
    page = pdf.pages[10]
    # for img in page.images:
    im = page.to_image()
    im.save(out_path, format='PNG')
    imgs = page.images
    for i, img in enumerate(imgs):
        size = img['width'], img['height']
        data = img['stream'].get_data()
        out_path = path.joinpath(f'002pdf_images_{i}.png')
        with open(out_path, 'wb') as fimg_out:
            fimg_out.write(data)

上面用到了PIL（Pillow）的功能处理图片。

提取PDF中的图表

图表与图像不同，指的是类似直方图、饼图之类的数据生成图。

import pathlib
import pdfplumber
from PIL import Image

path = list(pathlib.Path.cwd().parents)[1].joinpath('data/automate/002pdf')
f_path = path.joinpath('2020-新冠肺炎疫情对中国连锁餐饮行业的影响调研报告-中国连锁经营协会.pdf')
out_path = path.joinpath('002pdf_figures.png')
with pdfplumber.open(f_path) as pdf, open(out_path, 'wb') as fout:
    page = pdf.pages[7]
    im = page.to_image()
    im.save(out_path, format='PNG')
    figures = page.figures
    for i, fig in enumerate(figures):
        size = fig['width'], fig['height']
        crop = page.crop((fig['x0'], fig['top'], fig['x1'], fig['bottom']))
        img_crop = crop.to_image()
        out_path = path.joinpath(f'002pdf_figures_{i}.png')
        img_crop.save(out_path, format='png')
    im.draw_rects(page.extract_words(), stroke='yellow')
    im.draw_rects(page.images, stroke='blue')
    im.draw_rects(page.figures)
im # show in notebook

另外需要说明的是，PDF标准规范由Adobe公司主导。

平时我们不需要参考规范，但如果遇到一些较复杂的场景，尤其是模块没有直接支持，就只能硬着头皮翻阅文档了。文档是公开的，可以去搜索引擎搜索关键词：pdf_reference_1-7.pdf。

关于Python中pdfminer.six和pdfplumber模块是什么就分享到这里了，希望以上内容可以对大家有一定的帮助，可以学到更多知识。如果觉得文章不错，可以把它分享出去让更多的人看到。

网页标题：Python中pdfminer.six和pdfplumber模块是什么
URL地址：http://www.cdkjz.cn/article/pjdhie.html

多年建站经验

多一份参考，总有益处

联系快上网，免费获得专属《策划方案》及报价

咨询相关问题或预约面谈，可以通过以下方式与我们联系

网站建设

网站推广

案例

方案

电商网站开发

微信小程序

我们

联系

精准传达 • 有效沟通

查看其它板块

Python中pdfminer.six和pdfplumber模块是什么

多一份参考，总有益处

联系快上网，免费获得专属《策划方案》及报价

大客户专线成都：13518219792 座机：028-86922220

友情链接交换友情链接

网络推广

Network promotion

网站方案

Solution

电商网站开发

E-commerce & System

我们

About Us

联系

Contact Us

精准传达 • 有效沟通

查看其它板块

Python中pdfminer.six和pdfplumber模块是什么

相关资讯

阿里云服务器开淘宝店铺 阿里云服务器开淘宝店铺怎么开

腾讯云服务器登录后页面 腾讯云服务器为什么连接不了网络?

路由器上网口令 路由器上网口令和确认口令不一致

tp路由器 tp路由器怎么无线桥接

腾讯云服务器怎么绑定多个ip 腾讯云服务器可以挂几个

腾讯云专线连接本地服务器 腾讯云专线接入

电脑网线插路由器哪个口 电脑网线插路由器哪个口～'百兆口,千兆口

脚本执行linux命令 shell脚本执行linux命令

多一份参考，总有益处

联系快上网，免费获得专属《策划方案》及报价

大客户专线 成都：13518219792 座机：028-86922220

友情链接 交换友情链接

阿里云服务器开淘宝店铺阿里云服务器开淘宝店铺怎么开

腾讯云服务器登录后页面腾讯云服务器为什么连接不了网络?

路由器上网口令路由器上网口令和确认口令不一致

腾讯云专线连接本地服务器腾讯云专线接入

电脑网线插路由器哪个口电脑网线插路由器哪个口～'百兆口,千兆口

大客户专线成都：13518219792 座机：028-86922220

友情链接交换友情链接