php+Sphinx分詞中間件的認識和基礎使用（親測）

時間 2021-01-11 標籤 sphinx 分詞 php 中間件

sphinx安裝完畢之後，有兩種方式使用
1.安裝php拓展
2.調用包裏給的sphinxapi.php
這裏只介紹調用api的方式。原因：sphinx的php拓展更新的巨慢，不好做系統升級

首先先上圖介紹一下各大分詞查詢中間件，沒有更優秀，只有更適合。
圖來自：http://www.sphinxsearch.org/archives/492
文章寫的相對全面，通俗易懂，可以看看。

sphinx基礎使用

Windows

http://sphinxsearch.com/downloads/current/
下載Sphinx 3.1.1（目前最新版），貼一個最基礎的配置sphinx.conf

src1：Source 源，對應每次sphinx數據庫操作的內容
test1：Index 索引，執行src1，並生成日誌
searchd：安裝和啓動
src1和test1可以修改

source src1
{
    type                = mysql
    sql_host            = localhost
    sql_user            = root
    sql_pass            = root
    sql_db              = test
    sql_port            = 3306
    sql_query_pre       = SET NAMES utf8
    sql_query_pre       = SET SESSION query_cache_type = OFF
    sql_query           = SELECT * FROM log
	
	############ 查詢的字段，根據不同的表結構修改
	sql_field_string    = type
	sql_field_string    = post_data
	sql_field_string    = http_respon
	sql_field_string    = code
	sql_field_string    = add_time
	
	xmlpipe_field       = post_data
}

index test1
{
    source          = src1
    ############ 目錄不存在的話要自己建，下同
    path            = E:\sphinx-3.1.1\data\test1  
    morphology      = none
    stopwords       =
}

indexer
{
    mem_limit       = 32M
}

searchd
{
    listen              = 9312
    listen              = 9306:mysql41
    log                 = E:\sphinx-3.1.1\log\searchd.log
    query_log           = E:\sphinx-3.1.1\log\query.log
    read_timeout        = 5
    max_children        = 30
    pid_file            = E:\sphinx-3.1.1\log\searchd.pid
}

將此文件放到bin目錄下，我們好來操作和測試。

下面是我的數據表結構

CREATE TABLE `log` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `type` varchar(20) NOT NULL,
  `post_data` varchar(255) NOT NULL,
  `http_respon` varchar(100) NOT NULL,
  `code` varchar(20) NOT NULL,
  `add_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

打開cmd或者powershell，cd到bin目錄，執行命令：

# 這裏的test1可以改成--all，sphinx.conf是默認的
.\indexer.exe [-c sphinx.conf] test1

可以看到有一條寫入，打開配置文件中的path目錄，我這裏設置的是data，會發現生成了一系列的.sp*文件。
這邊就是sphinx給目標數據做的分詞。
我們繼續，依舊在bin目錄下執行（需要超級管理員權限）：

# 安裝，如已安裝則忽略
.\searchd.exe --install 
# 啓動sphinx
.\searchd.exe [-c sphinx.conf]

現在，我們試試查詢，將api目錄裏新建一個sphinxSearch.php
api調用文檔可以看官方wiki：http://sphinxsearch.com/wiki/doku.php?id=sphinx_manual_chinese#通用api方法

<?php

	header("Content-type:text/html;charset=utf-8");
	# 引入sphinx接口文件
	require 'E:\sphinx-3.1.1\sphinxapi.php';
	$keyword = 'create_time';
	
	$sphinx = new SphinxClient();
	$sphinx->SetServer('localhost',9312);
	
	# *表示在所有索引裏面進行搜索
	$result = $sphinx->query($keyword,'*');
	print_r($sphinx->GetLastError());
	print_r($result);die;
	
?>

執行 php sphinxSearch.php

可以看到我想要的結果查出來了，這邊通常的做法就是，sphinx獲取到所有目標的id，再用id去mysql最終結果。
還沒完，要想在不停止searchd的情況下想增加sphinx庫怎麼辦？
sphinx這邊給出了一個增量索引和合併索引的概念
以下是給增量索引用的Source 源和index 索引，可以寫在同一配置文件sphinx.conf下：

source src2
{
    type                = mysql
    sql_host            = localhost
    sql_user            = root
    sql_pass            = root
    sql_db              = test
    sql_port            = 3306
    sql_query_pre       = SET NAMES utf8
    sql_query_pre       = SET SESSION query_cache_type = OFF
    ############ 需要新增一個config數據表來記錄當前sphinx的id，當然也可以找其他方式做新增記錄
    sql_query           = SELECT * FROM log where id > (select value from config where name = 'sphinx_max_id')
    ############ 更新完數據，記錄當前id
    sql_query_post      = update config set value = (select id from log order by id desc limit 1)
	
	sql_field_string    = type
	sql_field_string    = post_data
	sql_field_string    = http_respon
	sql_field_string    = code
	sql_field_string    = add_time
	
	xmlpipe_field       = post_data
}
index test1_1
{
    source              = src2
    path                = E:\sphinx-3.1.1\data\test1_1
    morphology          = none
    stopwords           =
}

數據庫截圖：

命令行放到定時任務（需要超級管理員權限）：

# 增量索引 
.\indexer.exe [-c .\sphinx.conf] test1_1 --rotate 
# 增量索引合併 (test1_1合併到test1)，當清除增量索引的.sp*文件，或再一次執行增量索引，數據會丟失或被覆蓋
.\indexer.exe [-c .\sphinx.conf] --merge test1 test1_1 --rotate 
# 如果按照我上面寫的方式來新增索引，那麼每次增量完一定要合併

當增量執行時，由於數據太多，還沒執行完，就執行了合併，這樣會造成數據丟失。

在windows下大多數都是用來做本地環境，不會有這種困擾。
這邊也建議用Linux來作線上環境，畢竟win搭載個圖像，性能肯定有所下降。

有空來更，要睡覺了

Linux

首先一樣樣，到 http://www.sphinxsearch.org/archives/492 下載源包壓縮。
我這裏用的是vagrant+虛擬機，可以直接下載放入目錄。

貼一下配置文件sphinx.conf，作爲測試我放在bin目錄下。
配置基礎介紹上面有。
裏面log和path路徑根據各自需求修改，保證目錄存在。

source src1
{
    type                = mysql
    sql_host            = localhost
    sql_user            = root
    sql_pass            = root
    sql_db              = test
    sql_port            = 3306
	
    sql_query_pre       = SET NAMES utf8
    sql_query_pre       = SET SESSION query_cache_type = OFF
    sql_query           = SELECT * FROM log
	
	sql_field_string    = type
	sql_field_string    = post_data
	sql_field_string    = http_respon
	sql_field_string    = code
	sql_field_string    = add_time
	
	xmlpipe_field       = post_data
}
source src2
{
    type                = mysql
    sql_host            = localhost
    sql_user            = root
    sql_pass            = root
    sql_db              = test
    sql_port            = 3306
    sql_query_pre       = SET NAMES utf8
    sql_query_pre       = SET SESSION query_cache_type = OFF
    sql_query           = SELECT * FROM log where id > (select value from config where name = 'sphinx_max_id')
    sql_query_post      = update config set value = (select id from log order by id desc limit 1)
	
	sql_field_string    = type
	sql_field_string    = post_data
	sql_field_string    = http_respon
	sql_field_string    = code
	sql_field_string    = add_time
	
	xmlpipe_field       = post_data
}

index test1
{
    source              = src1
    path                = /vagrant/sphinx-3.1.1/data/test1
    morphology          = none
    stopwords           =
}
index test1_1
{
    source              = src2
    path                = /vagrant/sphinx-3.1.1/data/test1_1
    morphology          = none
    stopwords           =
}

indexer
{
    mem_limit           = 32M
}

searchd
{
    listen              = 9312
    listen              = 9306:mysql41
    log                 = /vagrant/sphinx-3.1.1/log/searchd.log
    query_log           = /vagrant/sphinx-3.1.1/log/query.log
    read_timeout        = 5
    max_children        = 30
    pid_file            = /vagrant/sphinx-3.1.1/log/searchd.pid
}

添加索引：

# 這裏-c sphinx.conf 要寫，不然默認它會去找/etc/sphinx/sphinx.conf
indexer -c sphinx.conf --all

啓動：

# 需要sudo權限
searchd -c sphinx.conf

# 停止
searchd -c sphinx.conf --stop

查找類sphinxSearch.php

<?php

	header("Content-type:text/html;charset=utf-8");
	#步驟1：引入sphinx接口文件
	require './sphinxapi.php';
	$keyword = '';
	$sphinx = new SphinxClient();
	$sphinx->SetServer('localhost',9312);
	# *表示在所有索引裏面進行搜索
	$result = $sphinx->query($keyword,'*');
	
	# 要想在'post_data'字段找關鍵字'a'，要下面這麼寫。
	//$result = $sphinx->query("@post_data a",'*');
	
	print_r($sphinx->GetLastError());
	print_r($result);die;
	
?>

其他命令行

# sudo 增量索引 
indexer -c sphinx.conf test1_1 --rotate 
# 合併索引
indexer -c sphinx.conf --merge test1 test1_1 --rotate

上面有說到，當增量數據很大的時候，還沒增量完就合併索引，會導致數據缺失。

解決辦法就是
當增量索引的時候，會產生除了test1_1.tmp.spl之外的<.tmp*>緩存文件，寫一個腳本去判斷是否存在這些文件即可。

常見錯誤

1.大部分錯誤都是權限問題，或者是某些目錄找不到之類的。

2.我在linux上運行php sphinxSearch.php的時候會有個錯誤。

searchd error: clisearchd -c sphinx.conf --stopn version (client is v.1.32, daemon is v.1.31)

我找了半天，後來發現，從官網下下來的linux3.*版本，searchd啓動之後變成了2.2.11，不明所以。

既然服務端變成了這個版本，客戶端我們也去找，客戶端指的就是<sphinxapi.php>這個文件。

找到裏面的api文件，替換掉當前使用的。