PHP抽取网页标题并剔除不相关的seo关键字_php入门

PHP抽取网页标题并剔除不相关的seo关键字

场景描述:

过往我们在抽取网页标题的时候，都会直接抽取之间的内容. 但实际情况是这样,例如javaeye 的一篇文章 http://www.iteye.com/news/21643 , 的内容为 "10年软件开发教会我最重要的10件事 - 非技术 - ITeye资讯", 但实际引用中我们期望的标题应该为 "10年软件开发教会我最重要的10件事". 所以标题后面堆砌了很多不相关的关键字(应该是为了 seo 吧). 所以我们希望过滤掉这些关键字. 有下面的方法可以参考:

1. 查找 h1 等标签.(分析sina news 一些网站之后, 觉得不可行,会有很多干扰)

2. 从全文去标题后，将之间的内容切割(按 _ | -)为 a1,a2,a3,a4，然后从最长的词组a3开始从全文查找. 如果查找成功,那么开始向左边迭代查询 a2,a1,直到查询失败为止。左侧失败后，再继续向右迭代，同理. (这里我采用的是这种方法)

Php代码
<?php
/**
* @author pqcc <struts.ec@mgail.com>
* @date: 2011-06-18
* Description: 给定一个网页内容，提取网页的标题. 提取的标题不包括 seo 关键字.
* e.g: 一篇新闻标题的从<title>直接抽取结果为 "大学英语四六级本周六开考 909万人参考_新浪教育_新浪网",
*       但我们希望的结果是:"大学英语四六级本周六开考 909万人参考".
* 适用范围: 文章最终页标题的提取, 不包括专题页等.
*/

class TitlePurify{

    private $matches_preg = [-_s|—];

    function getTitle($contents){/*{{{*/
        $preg = "/<title[^>]*>([w| ||W]*?)</title>/i";
        preg_match($preg, $contents, $matches);
        if(count($matches)<=1){
            return "标题抽取失败";
        }
        $title = $matches[1];
        return $this->trimTitle($title, $contents);
    }/*}}}*/

    function trimMeta($contents){/*{{{*/
        // 首先去除 <title> 内容, <meta> 内容.
        $preg       = "/<title[^>]*>([w| ||W]*?)</title>/i";
        $contents   = preg_replace($preg, , $contents);
        $preg       = "/<meta[^>]*>/i";
        $contents   = preg_replace($preg, , $contents);
        return $contents;
    }/*}}}*/

    // 获取长度最长的 item 所处的index.
    function getMaxIndex($titles){/*{{{*/
        $maxItemIndex   = 0;
        $maxLength      = 0;
        $loop           = 0;
        foreach($titles as $item){
            if(strlen($item)>$maxLength){
                $maxLength      = strlen($item);
                $maxItemIndex   = $loop;
            }
            $loop++;
        }
        return $maxItemIndex;
    }/*}}}*/

    function trim($title, $titles, $contents, $maxItemIndex){/*{{{*/
        //@todo : 此处可优化contents
        // 如果查找成功. result = tempTitle.
        $tempTitle = $titles[$maxItemIndex];
        $result     = $tempTitle;
        $count      = count($titles);
        // while 从当前index 向左进行迭代(直到到达第一个或者匹配失败才中止).
        $leftIndex = $maxItemIndex-1;
        while(true && $leftIndex>=0){
            // tempTitle+左一个.
            preg_match("/({$this->matches_preg}+{$tempTitle})/i", $title, $matches);
            if(count($matches)>1){
                // temp 用于匹配失败后,进行回滚.
                $temp       = $titles[$leftIndex] . $matches[1];
                $tempTitle = $titles[$leftIndex] . $matches[1];
                // 继续拿着 tempTitle 去匹配.
                preg_match("/$tempTitle/i", $contents, $matches);
                // 如果查找失败....
                if(count($matches)<1){
                    $tempTitle = $temp;
                    break;
                }else{
                    $result = $tempTitle;
                }
            }else{ // 正常情况下, 不会出现该情况.
                break;
            }
            $leftIndex--;&

相关阅读：

PHP抽取网页标题并剔除不相关的seo关键字

相关文章

PHP之友评论