使用爬虫FSCrawler摄取PDF, MS Office等文本到Elasticsearch_pdf

FSCrawler主要功能：
1.对本地文件系统（或挂载的远程文件系统）的目录进行文件爬取、文本处理、OCR识别（识别率不高）、索引到ElasticSearch，对文件新增和删除进行监控，以更新Elasticsearch索引；（本次介绍配置）
2.可以通过SSH/FTP进行远程文件爬?。?
3.可以通过REST接口，上传文件到Elasticsearch 。
注：FSCrawler使用Tika可以从一千多种不同的文件类型（如DOC、PPT、XLS和PDF）中检测并提取元数据和文本。

FSCrawler的主要配置：

#C:\Users\Administrator\.fscrawler\contract\_settings.yaml---name: "contract"fs:url: "D:\\ContractFiles"update_rate: "10m"excludes:- "*/~*"json_support: falsefilename_as_id: trueadd_filesize: trueremove_deleted: trueadd_as_inner_object: falsestore_source: falseindex_content: trueattributes_support: falseraw_metadata: falsexml_support: falseindex_folders: truelang_detect: falsecontinue_on_error: trueocr:language: "eng chi_sim"enabled: false#启用OCR下面项目才会生效data_path: "C:\\Tesseract-OCR\\tessdata"pdf_strategy: "ocr_and_text"#pdf_strategy: "auto"follow_symlinks: falseelasticsearch:nodes:- url: "https://192.168.1.101:9200"- url: "https://192.168.1.102:9200"- url: "https://192.168.1.103:9200"username: "es的用户名"password: "对应es用户的密码"bulk_size: 100flush_interval: "10s"byte_size: "50mb"ssl_verification: false

【使用爬虫FSCrawler摄取PDF, MS Office等文本到Elasticsearch】配置非常灵活，可参阅官方文档 https://fscrawler.readthedocs.io/en/latest/ 或留言交流。

使用爬虫FSCrawler摄取PDF, MS Office等文本到Elasticsearch

相关经验推荐