- 浏览: 140952 次
- 性别:
- 来自: 上海
文章分类
最新评论
-
zi_wu_xian:
docx格式的word文件虽然是zip格式的,也可以看到xml ...
用Java操作Office 2007 -
MyDreamNotDream:
看代码看到这里很不容易呢。
Java中HashMap的实现原理 -
四书五经:
to 楼上的 SonofGod :这个时候这样去获取:如果(值 ...
Java中HashMap的实现原理 -
SonofGod:
请问 楼主 在疑问3中。多个key的hash值一样的话,存储时 ...
Java中HashMap的实现原理 -
SonofGod:
请问 楼主 在疑问2中。多个可以的hash得到一样的hash值 ...
Java中HashMap的实现原理
OFFICE文档使用POI控件,PDF可以使用PDFBOX0.7.3控件,完全支持中文,用XPDF也行,不过感觉PDFBOX比较好,而且作者也在更新。水平有限,万望各位指正
WORD:
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.poi.hwpf.extractor.WordExtractor;
import java.io.File;
import java.io.InputStream;
import java.io.FileInputStream;
import com.search.code.Index;
public Document getDocument(Index index, String url, String title, InputStream is) throws DocCenterException {
String bodyText = null;
try {
WordExtractor ex = new WordExtractor(is);//is是WORD文件的InputStream
bodyText = ex.getText();
if(!bodyText.equals("")){
index.AddIndex(url, title, bodyText);
}
}catch (DocCenterException e) {
throw new DocCenterException("无法从该Mocriosoft Word文档中提取内容", e);
}catch(Exception e){
e.printStackTrace();
}
}
return null;
}
Excel:
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFCell;
import java.io.File;
import java.io.InputStream;
import java.io.FileInputStream;
import com.search.code.Index;
public Document getDocument(Index index, String url, String title, InputStream is) throws DocCenterException {
StringBuffer content = new StringBuffer();
try{ HSSFWorkbook workbook = new HSSFWorkbook(is);//创建对Excel工作簿文件的引用
for (int numSheets = 0; numSheets < workbook.getNumberOfSheets(); numSheets++) {
if (null != workbook.getSheetAt(numSheets)) {
HSSFSheet aSheet = workbook.getSheetAt(numSheets);//获得一个sheet
for (int rowNumOfSheet = 0; rowNumOfSheet <= aSheet.getLastRowNum(); rowNumOfSheet++) {
if (null != aSheet.getRow(rowNumOfSheet)) {
HSSFRow aRow = aSheet.getRow(rowNumOfSheet); //获得一个行
for (short cellNumOfRow = 0; cellNumOfRow <= aRow.getLastCellNum(); cellNumOfRow++) {
if (null != aRow.getCell(cellNumOfRow)) {
HSSFCell aCell = aRow.getCell(cellNumOfRow);//获得列值
content.append(aCell.getStringCellValue());
}
}
}
}
}
}
if(!content.equals("")){
index.AddIndex(url, title, content.toString());
}
}catch (DocCenterException e) {
throw new DocCenterException("无法从该Mocriosoft Word文档中提取内容", e);
}catch(Exception e) {
System.out.println("已运行xlRead() : " + e );
}
return null;
}
PowerPoint:
import java.io.InputStream;
import org.apache.lucene.document.Document;
import org.apache.poi.hslf.HSLFSlideShow;
import org.apache.poi.hslf.model.TextRun;
import org.apache.poi.hslf.model.Slide;
import org.apache.poi.hslf.usermodel.SlideShow;
public Document getDocument(Index index, String url, String title, InputStream is)
throws DocCenterException {
StringBuffer content = new StringBuffer("");
try{ SlideShow ss = new SlideShow(new HSLFSlideShow(is));//is 为文件的InputStream,建立SlideShow
Slide[] slides = ss.getSlides();//获得每一张幻灯片
for(int i=0;i<slides.length;i++){
TextRun[] t = slides[i].getTextRuns();//为了取得幻灯片的文字内容,建立TextRun
for(int j=0;j<t.length;j++){
content.append(t[j].getText());//这里会将文字内容加到content中去
}
content.append(slides[i].getTitle());
}
index.AddIndex(url, title, content.toString());
}catch(Exception ex){
System.out.println(ex.toString());
}
return null;
}
PDF:
import java.io.InputStream;
import java.io.IOException;
import org.apache.lucene.document.Document;
import org.pdfbox.cos.COSDocument;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.pdmodel.PDDocumentInformation;
import org.pdfbox.util.PDFTextStripper;
import com.search.code.Index;
public Document getDocument(Index index, String url, String title, InputStream is)throws DocCenterException {
COSDocument cosDoc = null;
try {
cosDoc = parseDocument(is);
} catch (IOException e) {
closeCOSDocument(cosDoc);
throw new DocCenterException("无法处理该PDF文档", e);
}
if (cosDoc.isEncrypted()) {
if (cosDoc != null)
closeCOSDocument(cosDoc);
throw new DocCenterException("该PDF文档是加密文档,无法处理");
}
String docText = null;
try {
PDFTextStripper stripper = new PDFTextStripper();
docText = stripper.getText(new PDDocument(cosDoc));
} catch (IOException e) {
closeCOSDocument(cosDoc);
throw new DocCenterException("无法处理该PDF文档", e);
}
PDDocument pdDoc = null;
try { pdDoc = new PDDocument(cosDoc);
PDDocumentInformation docInfo = pdDoc.getDocumentInformation();
if(docInfo.getTitle()!=null && !docInfo.getTitle().equals("")){
title = docInfo.getTitle();
}
} catch (Exception e) {
closeCOSDocument(cosDoc);
closePDDocument(pdDoc);
System.err.println("无法取得该PDF文档的元数据" + e.getMessage());
} finally {
closeCOSDocument(cosDoc);
closePDDocument(pdDoc);
}
return null;
}
private static COSDocument parseDocument(InputStream is) throws IOException {
PDFParser parser = new PDFParser(is);
parser.parse();
return parser.getDocument();
}
private void closeCOSDocument(COSDocument cosDoc) {
if (cosDoc != null) {
try {
cosDoc.close();
} catch (IOException e) {
}
}
}
private void closePDDocument(PDDocument pdDoc) {
if (pdDoc != null) {
try {
pdDoc.close();
} catch (IOException e) {
}
}
}
WORD:
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.poi.hwpf.extractor.WordExtractor;
import java.io.File;
import java.io.InputStream;
import java.io.FileInputStream;
import com.search.code.Index;
public Document getDocument(Index index, String url, String title, InputStream is) throws DocCenterException {
String bodyText = null;
try {
WordExtractor ex = new WordExtractor(is);//is是WORD文件的InputStream
bodyText = ex.getText();
if(!bodyText.equals("")){
index.AddIndex(url, title, bodyText);
}
}catch (DocCenterException e) {
throw new DocCenterException("无法从该Mocriosoft Word文档中提取内容", e);
}catch(Exception e){
e.printStackTrace();
}
}
return null;
}
Excel:
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFCell;
import java.io.File;
import java.io.InputStream;
import java.io.FileInputStream;
import com.search.code.Index;
public Document getDocument(Index index, String url, String title, InputStream is) throws DocCenterException {
StringBuffer content = new StringBuffer();
try{ HSSFWorkbook workbook = new HSSFWorkbook(is);//创建对Excel工作簿文件的引用
for (int numSheets = 0; numSheets < workbook.getNumberOfSheets(); numSheets++) {
if (null != workbook.getSheetAt(numSheets)) {
HSSFSheet aSheet = workbook.getSheetAt(numSheets);//获得一个sheet
for (int rowNumOfSheet = 0; rowNumOfSheet <= aSheet.getLastRowNum(); rowNumOfSheet++) {
if (null != aSheet.getRow(rowNumOfSheet)) {
HSSFRow aRow = aSheet.getRow(rowNumOfSheet); //获得一个行
for (short cellNumOfRow = 0; cellNumOfRow <= aRow.getLastCellNum(); cellNumOfRow++) {
if (null != aRow.getCell(cellNumOfRow)) {
HSSFCell aCell = aRow.getCell(cellNumOfRow);//获得列值
content.append(aCell.getStringCellValue());
}
}
}
}
}
}
if(!content.equals("")){
index.AddIndex(url, title, content.toString());
}
}catch (DocCenterException e) {
throw new DocCenterException("无法从该Mocriosoft Word文档中提取内容", e);
}catch(Exception e) {
System.out.println("已运行xlRead() : " + e );
}
return null;
}
PowerPoint:
import java.io.InputStream;
import org.apache.lucene.document.Document;
import org.apache.poi.hslf.HSLFSlideShow;
import org.apache.poi.hslf.model.TextRun;
import org.apache.poi.hslf.model.Slide;
import org.apache.poi.hslf.usermodel.SlideShow;
public Document getDocument(Index index, String url, String title, InputStream is)
throws DocCenterException {
StringBuffer content = new StringBuffer("");
try{ SlideShow ss = new SlideShow(new HSLFSlideShow(is));//is 为文件的InputStream,建立SlideShow
Slide[] slides = ss.getSlides();//获得每一张幻灯片
for(int i=0;i<slides.length;i++){
TextRun[] t = slides[i].getTextRuns();//为了取得幻灯片的文字内容,建立TextRun
for(int j=0;j<t.length;j++){
content.append(t[j].getText());//这里会将文字内容加到content中去
}
content.append(slides[i].getTitle());
}
index.AddIndex(url, title, content.toString());
}catch(Exception ex){
System.out.println(ex.toString());
}
return null;
}
PDF:
import java.io.InputStream;
import java.io.IOException;
import org.apache.lucene.document.Document;
import org.pdfbox.cos.COSDocument;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.pdmodel.PDDocumentInformation;
import org.pdfbox.util.PDFTextStripper;
import com.search.code.Index;
public Document getDocument(Index index, String url, String title, InputStream is)throws DocCenterException {
COSDocument cosDoc = null;
try {
cosDoc = parseDocument(is);
} catch (IOException e) {
closeCOSDocument(cosDoc);
throw new DocCenterException("无法处理该PDF文档", e);
}
if (cosDoc.isEncrypted()) {
if (cosDoc != null)
closeCOSDocument(cosDoc);
throw new DocCenterException("该PDF文档是加密文档,无法处理");
}
String docText = null;
try {
PDFTextStripper stripper = new PDFTextStripper();
docText = stripper.getText(new PDDocument(cosDoc));
} catch (IOException e) {
closeCOSDocument(cosDoc);
throw new DocCenterException("无法处理该PDF文档", e);
}
PDDocument pdDoc = null;
try { pdDoc = new PDDocument(cosDoc);
PDDocumentInformation docInfo = pdDoc.getDocumentInformation();
if(docInfo.getTitle()!=null && !docInfo.getTitle().equals("")){
title = docInfo.getTitle();
}
} catch (Exception e) {
closeCOSDocument(cosDoc);
closePDDocument(pdDoc);
System.err.println("无法取得该PDF文档的元数据" + e.getMessage());
} finally {
closeCOSDocument(cosDoc);
closePDDocument(pdDoc);
}
return null;
}
private static COSDocument parseDocument(InputStream is) throws IOException {
PDFParser parser = new PDFParser(is);
parser.parse();
return parser.getDocument();
}
private void closeCOSDocument(COSDocument cosDoc) {
if (cosDoc != null) {
try {
cosDoc.close();
} catch (IOException e) {
}
}
}
private void closePDDocument(PDDocument pdDoc) {
if (pdDoc != null) {
try {
pdDoc.close();
} catch (IOException e) {
}
}
}
发表评论
-
微信收货地址共享接口-终极解决
2015-06-25 13:10 8304最近要接入微信的收货地址共享接口,总是不成功,折腾了好 ... -
Java中HashMap的实现原理
2011-04-28 14:30 2728昨天有人来公司面试,因为面试的地方和我坐的地方比较近,所以也听 ... -
java注解(annotation)简介
2010-06-13 10:10 1313[Java 5.0] Annotation – @Deprec ... -
quartz和spring-quartz
2010-06-13 10:03 873quartz和spring-quartz -
Java 线程实例讲解综述
2010-06-13 09:57 981Java 线程实例讲解综述 编写具有多线程能力的程序经常会用 ... -
Java Double 精度问题总结
2010-06-13 09:56 5248使用Java,double 进行运算时,经常出现精度丢失的问题 ... -
eXtremeComponents的eXtremeTable分页特性
2010-05-14 17:27 3281下面是我使用的例子: <ec:table ite ... -
java---final 关键字 和 static 用法
2010-03-17 13:58 840final 关键字 和 static 用法 一、final ... -
java版的escape和unescape方法
2010-03-17 09:21 2501其中unescape方法可以用来解开javascript的es ... -
StatSVN的使用说明
2010-03-04 10:27 999一、 checkout 希望统计的版本或者分支到某个目录(不管 ... -
Velocity语法
2010-03-01 18:01 8361. 变量 (1)变量的 ... -
用KeyTool生成安全证书
2010-02-22 17:14 1094详细请见:Tomcat的帮助文档,:https://local ... -
Spring 注解学习手札
2010-02-10 10:02 778http://snowolf.iteye.com/blog/5 ... -
JDK、JRE、JVM的关系
2010-01-25 11:23 852JDK就是Java Development Kit.简单的说J ... -
类装载器学习
2010-01-22 12:54 773Java的类装载器(Class Loader)和命名空间(Na ... -
Tomcat发布项目方法
2010-01-22 10:46 2566第一种方法:在tomcat中的conf目录中,在server. ... -
理解Java ClassLoader机制
2010-01-21 16:20 848当JVM(Java虚拟机)启动时,会形成由三个类加载器组成的初 ... -
cookie和session的工作机制
2010-01-19 15:19 762转载自:http://hi.baidu.com/jmtbai/ ... -
如何设置Tomcat的JVM虚拟机内存大小
2010-01-18 14:25 925Tomcat本身不能直接在计算机上运行,需要依赖于硬件基础之上 ... -
浅谈设置JVM内存分配的几个妙招
2010-01-18 14:24 1566安装Java开发软件时,默 ...
相关推荐
JAVA读取WORD_EXCEL_POWERPOINT_PDF文件的方法(poi)
整理了用java如何读取word文档,pdf文档的几种方法,含有程序
POI 读取 WORD EXCEL POWERPOINT 2003 2007 java 读取 WORD EXCEL POWERPOINT 2003 2007
JAVA读取PDF、WORD、EXCEL等文件的方法
Java读取Word中的表格(Excel),并导出文件为Excel
可以完整的读取word文件Excel文件PDF文件Txt文件,以文本的形式读出来,简单易懂
java读取word,excel,pdf等文本
java读取word文档.pdf
JAVA读取WORD,EXCEL,POWERPOINT,PDF文件的方法
Java读取Excel内容 v Java读取Excel内容 Java读取Excel内容
里面包含一个word转pdf的jar,和一个读取pdf的jar。可以实现Java读取Word文档的页数。
JAVA读取 excelJAVA读取 excel
通过Java读取word表格中的内容,将内容存到数据库中,将Word中的图片存到硬盘中
java 读取PDF文件中的内容 java 读取PDF文件中的内容
用Java读取Word文档
如何利用java来编写读取excel的方法,代码
java读取pdf文件作者、标题等属性
java 读取Excel文件中的内容 java 读取Excel文件中的内容