PHP头条
热点:

PHP抽取网页题目并剔除不相关的seo关键字


PHP 抽取网页标题并剔除不相关的seo关键字
场景描述:

  过往我们在抽取网页标题的时候,都会直接抽取 之间的内容. 但实际情况是这样,例如javaeye 的一篇文章 http://www.iteye.com/news/21643 ,  的内容为 "10年软件开发教会我最重要的10件事 - 非技术 - ITeye资讯", 但实际引用中我们期望的标题应该为 "10年软件开发教会我最重要的10件事". 所以标题后面堆砌了很多不相关的关键字(应该是为了  seo 吧). 所以我们希望过滤掉这些关键字. 有下面的方法可以参考:


1. 查找 h1 等标签.(分析sina news 一些网站之后, 觉得不可行,会有很多干扰)

2. 从全文去标题后,将 之间的内容切割(按 _ | -)为 a1,a2,a3,a4,然后从最长的词组a3开始从全文查找. 如果查找成功,那么开始向左边迭代查询 a2,a1,直到查询失败为止 。左侧失败后,再继续向右迭代,同理. (这里我采用的是这种方法)



 * @date: 2011-06-18
 * Description: 给定一个网页内容,提取网页的标题. 提取的标题不包括 seo 关键字.
 * e.g: 一篇新闻标题的从直接抽取结果为 "大学英语四六级本周六开考 909万人参考_新浪教育_新浪网",
 *       但我们希望的结果是:"大学英语四六级本周六开考 909万人参考".
 * 适用范围:  文章最终页标题的提取, 不包括专题页等.
 */

class TitlePurify{

    private $matches_preg = '[-_\s|—]';

    function getTitle($contents){/*{{{*/
        $preg = "/<title[^>]*>([\w|\t|\r|\W]*?)<\/title>/i";
        preg_match($preg, $contents, $matches);
        if(count($matches)<=1){
            return "标题抽取失败";
        }
        $title = $matches[1];
        return $this->trimTitle($title, $contents);
    }/*}}}*/

    function trimMeta($contents){/*{{{*/
        // 首先去除 <title> 内容, <meta> 内容.
        $preg       = "/<title[^>]*>([\w|\t|\r|\W]*?)<\/title>/i";
        $contents   = preg_replace($preg, '', $contents);
        $preg       = "/<meta[^>]*>/i";
        $contents   = preg_replace($preg, '', $contents);
        return $contents;
    }/*}}}*/


    // 获取长度最长的 item?所处的index.
    function getMaxIndex($titles){/*{{{*/
        $maxItemIndex   = 0;
        $maxLength      = 0;
        $loop           = 0;
        foreach($titles as $item){
            if(strlen($item)>$maxLength){
                $maxLength      = strlen($item);
                $maxItemIndex   = $loop;
            }        
            $loop++;
        }
        return $maxItemIndex;
    }/*}}}*/

    function trim($title, $titles, $contents, $maxItemIndex){/*{{{*/
        //@todo : 此处可优化contents
        // 如果查找成功. result = tempTitle. 
        $tempTitle  = $titles[$maxItemIndex];
        $result     = $tempTitle;
        $count      = count($titles);
        // while 从当前index 向左进行迭代(直到到达第一个或者匹配失败才中止).
        $leftIndex  = $maxItemIndex-1;
        while(true && $leftIndex>=0){
            // tempTitle+左一个.
            preg_match("/({$this->matches_preg}+{$tempTitle})/i", $title, $matches);
            if(count($matches)>1){
                // temp 用于匹配失败后,进行回滚.
                $temp       = $titles[$leftIndex] . $matches[1];
                $tempTitle  = $titles[$leftIndex] . $matches[1];
                // 继续拿着 tempTitle 去匹配.
                preg_match("/$tempTitle/i", $contents, $matches);
                // 如果查找失败....
                if(count($matches)<1){
                    $tempTitle = $temp;
                    break;
                }else{
                    $result = $tempTitle;
                }
            }else{ //?正常情况下,?不会出现该情况.
                break;
            }
            $leftIndex--;
        }
        // match(current[i-1].[|-].tempTitle), 如果成功, tempTitle = match 成功的值,继续.
        // while 左边失败后, 从右边开始.
        $rightIndex = $maxItemIndex+1;
        while(true && ($rightIndex<=$count)){
            preg_match("/({$tempTitle}{$this->matches_preg}+)/i", $title, $matches);
            if(count($matches)>1){
                // temp 用于匹配失败后,进行回滚.
                $temp       =  $matches[1] . $titles[$rightIndex];
                $tempTitle  =  $matches[1] . $titles[$rightIndex];
                // 继续拿着 tempTitle 去匹配.
                preg_match("/$tempTitle/i", $contents, $matches);
                // 如果查找失败....
                if(count($matches)<1){
                    $tempTitle = $temp;
                    break;
                }else{
                    $result = $tempTitle;
                }
            }else{ //?正常情况下,?不会出现该情况.
                break;
            }
            $rightIndex++;
        }
        return $result;

    }/*}}}*/

    function trimTitle($title, $contents){/*{{{*/
        
        $contents = $this->trimMeta($contents);    
        // 配置切割标题的规则.
        $titles = preg_split("/$this->matches_preg/i", $title);
        $count          = count($titles);
        //var_dump($titles);exit;

        // 将当前最长的 item 从全文查找.
        $maxItemIndex = $this->getMaxIndex($titles);
        $tempTitle   = $titles[$maxItemIndex];
        preg_match("/$tempTitle/i", $contents, $matches);
        // 如果查找失败....
        if(count($matches)<1){
            return $title;
        }
        return $this->trim($title, $titles, $contents, $maxItemIndex);
    }/*}}}*/

}

// -------------   test code ------------------------------
function convertEncoding($contents){
    preg_match("/charset=([\w|\-]+);?/i", $contents, $match);
    $charset = isset($match[1])? $match[1] : 'UTF-8';
    $contents = mb_convert_encoding($contents, 'UTF-8', $charset);
    return $contents;
}

$url = 'http://china.nba.com/news/4/2011/0617/61383331/10451.html';
$contents = file_get_contents($url);
$contents = convertEncoding($contents);

$startTime  = microtime();
$purify     = new TitlePurify();
$title      = $purify->getTitle($contents);
$endTime    = microtime();

echo "标题:        $title ";
echo "cost: " . ($endTime-$startTime);

?>

</pre><br /><br /><br /></p>
<p align="left"><div style="display:none;"><span id="url" itemprop="url">/phprm/29958.html</span><span id="indexUrl" itemprop="indexUrl">www.phpzy.com</span><span id="isOriginal" itemprop="isOriginal">true</span><span id="isBasedOnUrl" itemprop="isBasedOnUrl">/phprm/29958.html</span><span id="genre" itemprop="genre">TechArticle</span><span id="description" itemprop="description">PHP抽取网页题目并剔除不相关的seo关键字 PHP 抽取网页标题并剔除不相关的seo关键字 场景描述: 过往我们在抽取网页标题的时候,都会直接抽取 之间的内容. 但实际情况是这样,例如jav...</span></div></p></div>
<div class="art_confoot"><script src='http://www.phpzy.com/ad/art_confoot.js' type="text/javascript"></script></div>
<div class="page"></div>
<div class="post-related"> <h3 class="tit_3">相关文章</h3><div class="clearfix m_5">
<ul> <li><a href='/phprm/29957.html' title='不要在php5.3上运行dedecms5.6版本' target='_blank'>不要在php5.3上运行dedecms5.6版本</a></li><li><a href='/php/29956.html' title='百万级别知乎用户数据抓取与分析之PHP开发' target='_blank'>百万级别知乎用户数据抓取与分析之PHP开</a></li><li><a href='/php/29955.html' title='PHP防盗链的基本思想防盗链的设置方法' target='_blank'>PHP防盗链的基本思想防盗链的设置方法</a></li><li><a href='/php/29954.html' title='分享3个php获取日历的函数' target='_blank'>分享3个php获取日历的函数</a></li><li><a href='/php/29953.html' title='通过修改配置真正解决php文件上传大小限制问题(nginx+php)' target='_blank'>通过修改配置真正解决php文件上传大小限</a></li><li><a href='/php/29952.html' title='PHP判断上传文件类型的解决办法' target='_blank'>PHP判断上传文件类型的解决办法</a></li></ul></div>
</div>
<div class="option-btns">
<div class="art_confoot"><script src='http://www.phpzy.com/ad/xgart_confoot.js' type="text/javascript"></script></div>
</div>
		
		<div  id="related_reading" class="haman-box">
		<ul class="xgyd clearfix">
 <div class="xgyd_new"><span class="fast-nav-bar"><a href="http://www.phpzy.com/fenlei/list-11-1.html">今日最新</a></span><strong>相关阅读:</strong></div>
 <li><a href="/phprm/29957.html">不要在php5.3上运行dedecms5.6版本</a></li>
<li><a href="/php/29956.html">百万级别知乎用户数据抓取与分析之PHP开发</a></li>
<li><a href="/php/29955.html">PHP防盗链的基本思想防盗链的设置方法</a></li>
<li><a href="/php/29954.html">分享3个php获取日历的函数</a></li>
<li><a href="/php/29953.html">通过修改配置真正解决php文件上传大小限制问</a></li>
<li><a href="/php/29952.html">PHP判断上传文件类型的解决办法</a></li>

 </ul></div>
<footer><div class="hot_c"><span><b>相关频道:</b>
<a href="/fenlei/list-1-1.html" >php教程</a>  <a href="/fenlei/list-2-1.html" >php安全</a>  <a href="/fenlei/list-3-1.html" >php面试题</a>  <a href="/fenlei/list-4-1.html" >php框架</a>  <a href="/fenlei/list-6-1.html" >php入门</a>  <a href="/fenlei/list-7-1.html" >php问答</a>  <a href="/fenlei/list-8-1.html" >php应用</a>  <a href="/fenlei/list-10-1.html" >php职业规划</a>  <a href="/fenlei/list-11-1.html" >今日最新</a>  <a href="/fenlei/list-5-1.html" >php资讯</a>  </span></div> </footer> 
</div>
<div class="info_more" id="info_more"></div>
<div class="clearfix mt10 art_commentstop" id="commentTopAd"><script src='http://www.phpzy.com/ad/art_commentstop.js' type="text/javascript"></script></div>
<div id="hm_t_46468"></div>
<a name="comment"></a><div class="comment"  id="commentTopAd" itemprop="comment"><h3>PHP之友评论</h3></div>
 <div class="wb_comment_box"  id="commentsiframe"><script type="text/javascript" src='http://www.phpzy.com/ad/comments.js'></script></div>
</article>
<div class="syzp mt10" style="overflow:hidden;"><div class="tit_7">今天推荐</div><script type="text/javascript" src="http://www.phpzy.com/ad/left_foot_ad.js"></script></div>  
</div>
<aside class="right" id="main_right">
<div class="art_rightad1"><script src='http://www.phpzy.com/ad/art_rightad1.js' type="text/javascript"></script></div>
<div class="r_bd mt10 pb10">
       <div class="tit_5 tit_6">php入门最近更新</div>
         <ul id="bbsRank_1" class="rank_ul2 rank_dot" style="border-top:1px solid #AAC5F2;margin-top: -1px;">
	<li><a href="/phprm/29958.html">PHP抽取网页题目并剔除不相关的seo关键字</a> </li>
<li><a href="/phprm/29957.html">不要在php5.3上运行dedecms5.6版本</a> </li>
<li><a href="/phprm/29946.html">PHP与Perl操作Memcached速度差异比较</a> </li>
<li><a href="/phprm/29945.html">PHP性能优化的技巧</a> </li>
<li><a href="/phprm/29944.html">PHP的GET/POST等大变量生成过程</a> </li>

</ul></div>
<div class="art_rightad2 mt10"><script src='http://www.phpzy.com/ad/art_rightad2.js' type="text/javascript"></script></div>
<div class="r_bd mt10 pb10">
       <div class="tit_5 tit_6">热门推荐</div>
	   <ul id="bbsRank_1" class="rank_ul2 rank_dot" style="border-top:1px solid #AAC5F2;margin-top: -1px;">
	<li><a href="/phprm/19321.html">php调用接口有关问题</a> </li>
<li><a href="/phprm/2464.html">php自己写mvc框架url重写等如何写</a> </li>
<li><a href="/phprm/13365.html">用MYSQL干运算和用PHP做运算,哪个优</a> </li>
<li><a href="/phprm/9246.html">thinkphp会员登录密码验证md5有关问题</a> </li>
<li><a href="/phprm/1555.html">php接收多选框数据有关问题</a> </li>

	</ul>
        </div>
<div class="r_bd mt10 pb10"><div style="margin-top: 0pt;" class="tit_5 tit_6">有意思</div>
<script type="text/javascript" src="http://www.phpzy.com/ad/right_ad5.js"></script></div>
<div class="art_rightad3"><script src='http://www.phpzy.com/ad/art_rightad3.js' type="text/javascript"></script></div>
<div id="focus_look" class="instant-focus mt10"><div class="instant-focus-header clearfix"><h3>实时看点</h3><span>看啥好</span></div>
<script type="text/javascript" src="http://www.phpzy.com/ad/right_ad6.js"></script></div>
<div class="art_rightad4 mt10"><script src='http://www.phpzy.com/ad/art_rightad4.js' type="text/javascript"></script></div>
</aside></div></div>
<footer id="footer" class="div_body">
<script type="text/javascript" src="http://www.phpzy.com/ad/arc_foot_ad.js"></script>
<script type="text/javascript" src="http://www.phpzy.com/templets/js/foot.js"></script>
<div style="display:none;"><script src='http://www.phpzy.com/ad/tongji.js' type="text/javascript"></script></div>
<div id="roll"></i><a title="回顶部" id="roll_top" href="#top" style="opacity: 0.7;" target="_self" rel="nofllow"></a></div>
</footer>
<script type="text/javascript" src="http://www.phpzy.com/ad/maintop.js?131231"></script>
</body>
</html>