博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
[ lucene扩展 ] An Introduction to Apache Lucene for Full-Text Search
阅读量:7114 次
发布时间:2019-06-28

本文共 9049 字,大约阅读时间需要 30 分钟。

In this tutorial I would like to talk a bit about . Lucene is an open-source project that provides Java-based indexing and search technology. Using its API, it is easy to implement . I will deal with the , but bear in mind that there is also a .NET port available under the name , as well as several helpful sub-projects.

I recently read a  about this project, but there was no actual code presented. Thus, I decided to provide some sample code to help you getting started with Lucene. The application we will build will allow you to index your own source code files and search for specific keywords.

First things first, let's download the latest stable version from one of the Apache . The version I will use is 3.0.1 so I downloaded the lucene-3.0.1.tar.gz bundle (note that the .tar.gz versions are significantly smaller than the corresponding .zip ones). Extract the tarball and locate the lucene-core-3.0.1.jar file which will be used later. Also, make sure the  page is open at your browser (the docs are also included in the tarball for offline usage). Next, setup a new Eclipse project, let's say under the name “LuceneIntroProject” and make sure the aforementioned JAR is included in the project's classpath.
Before we begin running search queries, we need to build an index, against which the queries will be executed. This will be done with the help of a class named , which is the class that creates and maintains an index. The IndexWriter receives s as input, where documents are the unit of indexing and search. Each Document is actually a set of s and each field has a name and a textual value. To create an IndexWriter, an  is required. This class is abstract and the concrete implementation that we will use is .
Enough talking already, let's create a class named “SimpleFileIndexer” and make sure a main method is included. Here is the source code for this class:

package com.javacodegeeks.lucene;import java.io.File;import java.io.FileReader;import java.io.IOException;import org.apache.lucene.analysis.SimpleAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.apache.lucene.index.IndexWriter;import org.apache.lucene.store.FSDirectory;public class SimpleFileIndexer {        public static void main(String[] args) throws Exception {                File indexDir = new File("C:/index/");        File dataDir = new File("C:/programs/eclipse/workspace/");        String suffix = "java";                SimpleFileIndexer indexer = new SimpleFileIndexer();                int numIndex = indexer.index(indexDir, dataDir, suffix);                System.out.println("Total files indexed " + numIndex);            }        private int index(File indexDir, File dataDir, String suffix) throws Exception {                IndexWriter indexWriter = new IndexWriter(                FSDirectory.open(indexDir),                 new SimpleAnalyzer(),                true,                IndexWriter.MaxFieldLength.LIMITED);        indexWriter.setUseCompoundFile(false);                indexDirectory(indexWriter, dataDir, suffix);                int numIndexed = indexWriter.maxDoc();        indexWriter.optimize();        indexWriter.close();                return numIndexed;            }        private void indexDirectory(IndexWriter indexWriter, File dataDir,            String suffix) throws IOException {        File[] files = dataDir.listFiles();        for (int i = 0; i < files.length; i++) {            File f = files[i];            if (f.isDirectory()) {                indexDirectory(indexWriter, f, suffix);            }            else {                indexFileWithIndexWriter(indexWriter, f, suffix);            }        }    }        private void indexFileWithIndexWriter(IndexWriter indexWriter, File f,             String suffix) throws IOException {        if (f.isHidden() || f.isDirectory() || !f.canRead() || !f.exists()) {            return;        }        if (suffix!=null && !f.getName().endsWith(suffix)) {            return;        }        System.out.println("Indexing file " + f.getCanonicalPath());                Document doc = new Document();        doc.add(new Field("contents", new FileReader(f)));                doc.add(new Field("filename", f.getCanonicalPath(),            Field.Store.YES, Field.Index.ANALYZED));                indexWriter.addDocument(doc);    }}

  

Let's talk a bit about this class. We provide the location of the index, i.e. where the index data will be saved on the disk (“c:/index/”). Then we provide the data directory, i.e. the directory which will be recursively scanned for input files. I have chosen my whole Eclipse workspace for this (“C:/programs/eclipse/workspace/”). Since we wish to index only for Java source code files, I also added a suffix field. You can obviously adjust those values to your search needs. The “index” method takes into account the previous parameters and uses a new instance of IndexWriter to perform the directory indexing. The “indexDirectory” method uses a simple recursion algorithm to scan all the directories for files with .java suffix. For each file that matches the criteria, a new Document is created in the “indexFileWithIndexWriter” and the appropriate fields are populated. If you run the class as a Java application via Eclipse, the input directory will be indexed and the output directory will look like the one in the following image:

OK, we are done with the indexing, let's move on to the searching part of the equation. For this, an  class is needed, which is a class that implements the main search methods. For each search, a new  object is needed (SQL anyone?) and this can be obtained from a  instance. Note that the QueryParser has to be created using the same type of Analyzer that the index was created with, in our case using a SimpleAnalyzer. A  is also used as constructor argument and is a class that is “Used by certain classes to match version compatibility across releases of Lucene”, according to the JavaDocs. The existence of something like that kind of confuses me, but whatever, let's use the appropriate version for our application (). When the search is performed by the IndexSearcher, a  object is returned as a result of the execution. This class just represents search hits and allows us to retrieve  objects. Using the ScoreDocs we find the Documents that match our search criteria and from those Documents we retrieve the wanted information. Let's see all of these in action. Create a class named “SimpleSearcher” and make sure a main method is included. The source code for this class is the following: 

package com.javacodegeeks.lucene;import java.io.File;import org.apache.lucene.analysis.SimpleAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.queryParser.QueryParser;import org.apache.lucene.search.IndexSearcher;import org.apache.lucene.search.Query;import org.apache.lucene.search.ScoreDoc;import org.apache.lucene.search.TopDocs;import org.apache.lucene.store.Directory;import org.apache.lucene.store.FSDirectory;import org.apache.lucene.util.Version;public class SimpleSearcher {        public static void main(String[] args) throws Exception {                File indexDir = new File("c:/index/");        String query = "lucene";        int hits = 100;                SimpleSearcher searcher = new SimpleSearcher();        searcher.searchIndex(indexDir, query, hits);            }        private void searchIndex(File indexDir, String queryStr, int maxHits)             throws Exception {                Directory directory = FSDirectory.open(indexDir);        IndexSearcher searcher = new IndexSearcher(directory);        QueryParser parser = new QueryParser(Version.LUCENE_30,              "contents", new SimpleAnalyzer());        Query query = parser.parse(queryStr);                TopDocs topDocs = searcher.search(query, maxHits);                ScoreDoc[] hits = topDocs.scoreDocs;        for (int i = 0; i < hits.length; i++) {            int docId = hits[i].doc;            Document d = searcher.doc(docId);            System.out.println(d.get("filename"));        }                System.out.println("Found " + hits.length);            }}  

We provide the index directory, the search query string and the maximum number of hits and then call the “searchIndex” method. In that method, we create an IndexSearcher, a QueryParser and a Query object. Note that QueryParser uses the name of the field that we used to create the Documents with IndexWriter (“contents”) and again that the same type of Analyzer is used (SimpleAnalyzer). We perform the search and for each Document that a match has been found, we extract the value of the field that holds the name of the file (“filename”) and we print it. That's it, let's perform the actual search. Run it as a Java application and you will see the names of the files that contain the query string you provided.

The Eclipse project for this tutorial, including the dependency library, can be downloaded .
UPDATE: You can also check our subsequent post, .
Enjoy!
Read more: 

转载地址:http://mrghl.baihongyu.com/

你可能感兴趣的文章
zabbix自定义监控3(2.4网页报警,邮件报警)
查看>>
我的友情链接
查看>>
Prime
查看>>
XML的DOM解析
查看>>
Android特效(1)----字幕滚动效果
查看>>
java获取当前类的绝对路径
查看>>
邮件服务器互发
查看>>
USB接口新规范USB3.0功能
查看>>
我的友情链接
查看>>
3Python全栈之路系列之RabbitMQ
查看>>
Cisco设备上设置DHCP实例
查看>>
Exchange Online中未送达报告(DSN)的总结与分析
查看>>
洛谷——P2819 图的m着色问题
查看>>
[转载]分布式数据库拆表拆库的常用策略
查看>>
腾讯微博微分享功能,更趋人性化!
查看>>
个人练习 - 单元测试练习
查看>>
我的友情链接
查看>>
python-selenum3 第三天补充——截图和退出
查看>>
在RHEL/CentOS 5/6下停用按下Ctrl-Alt-Del 重启系统的功能
查看>>
linux下删除或重命名乱码文件
查看>>