分析php计算文本字符串相似度函数similar_text()的原理_php教程

分析php计算文本字符串相似度函数similar_text()的原理

PHP有个计算两个文本字符串相似度的函数similar_text()，可以得出一个百分比来表示两个字符串的相似程度。效果如下：

similar_text('aaaa', 'aaaa', $percent);
var_dump($percent);
//float(100)
similar_text('aaaa', 'aaaabbbb', $percent);
var_dump($percent);
//float(66.666666666667)
similar_text('abcdef', 'aabcdefg', $percent);
var_dump($percent);
//float(85.714285714286)

利用这个函数，可以用来做模糊搜索的功能，或者其他需要模糊匹配的功能。最近我在验证码识别研究中的特征匹配一步上涉及到了这个函数。

但这个函数具体使用了怎样的算法呢？我研究了他的底层实现，总结为三步：

（1）找出两个字符串中相同部分最长的一段；
（2）再用同样的方法在剩下的两段中分别找出相同部分最长的一段，以此类推，直到没有任何相同部分；
（3）相似度 = 所有相同部分的长度之和 * 2 / 两个字符串的长度之和；

我研究的源代码版本是PHP 5.4.6，相关的代码位于文件php-5.4.6/ext/standard/string.c的第2951~3031行。以下是我加过注释后源代码。

//找出两个字符串中相同部分最长的一段
static void php_similar_str( const char *txt1, int len1, const char *txt2, int len2, int * pos1, int * pos2, int * max )
{
    char * p, *q;

    char * end1 = (char *) txt1 + len1;

    char * end2 = (char *) txt2 + len2;

    int l;

    *max = 0;

    /* 以第一个字符串为基准开始遍历 */

    for ( p = (char *) txt1; p < end1; p++ )
    {
        /* 遍历第二个字符串 */

        for ( q = (char *) txt2; q < end2; q++ )
        {
            /* 发现有字符相同，继续循环找，l为相同部分的长度 */

            for ( l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++ )
                ;

            /* 冒泡方法找出最长的一个l，并记住相同部分的开始位置 */

            if ( l > *max )
            {
                *max    = l;
                *pos1    = p - txt1;
                *pos2    = q - txt2;
            }
        }
    }
}

//计算两个字符串的相同部分的总长度
static int php_similar_char( const char *txt1, int len1, const char *txt2, int len2 )
{
    int sum;

    int pos1, pos2, max;

    /* 找出两个字符串相同部分最长的一段 */

    php_similar_str( txt1, len1, txt2, len2, &pos1, &pos2, &max );

    /* 这里是对sum的初始赋值，也是对max值的判断 */

    /* 如果max为零，表示两个字符串没有任何相同的字符，也就会跳出if */

    if ( (sum = max) )
    {
        /* 对前半段递归，相同段长度累加 */

        if ( pos1 && pos2 )
        {
            sum += php_similar_char( txt1, pos1,

                         txt2, pos2 );
        }

        /* 对后半段递归，相同段长度累加 */

        if ( (pos1 + max < len1) && (pos2 + max < len2) )
        {
            sum += php_similar_char( txt1 + pos1 + max, len1 - pos1 - max,

                         txt2 + pos2 + max, len2 - pos2 - max );
        }
    }

    return(sum);
}

//PHP函数定义
PHP_FUNCTION( similar_text )
{
    char *t1, *t2;

    zval **percent = NULL;

    int ac = ZEND_NUM_ARGS();

    int sim;

    int t1_len, t2_len;

    /* 检查参数合法性 */

    if ( zend_parse_parameters( ZEND_NUM_ARGS() TSRMLS_CC, "ss|Z", &t1, &t1_len, &t2, &t2_len, &percent ) == FAILURE )
    {
        return;
    }

    /* 如果有第三个参数 */

    if ( ac > 2 )
    {
        convert_to_double_ex( percent );
    }

    /* 如果两个字符串长度都为0，返回0 */

    if ( t1_len + t2_len == 0 )
    {
        if ( ac > 2 )
        {
            Z_DVAL_PP( percent ) = 0;
        }

        RETURN_LONG( 0 );
    }

    /* 调用上面的函数，计算两个字符串的相似度 */

    sim = php_similar_char( t1, t1_len, t2, t2_len );

    /* 可以看到percent的计算公式 */

    if ( ac > 2 )
    {
        Z_DVAL_PP( percent ) = sim * 200.0 / (t1_len + t2_len);
    }

    RETURN_LONG( sim );
}

另外，PHP还提供了另外一个计算字符串相似度的函数levenshtein()，通过计算两个字符串的编辑距离来表示字符串相似度，这也是一种很常见的算法。

levenshtein()的性能相比similar_text()要好一些，因为通过前面的代码分析可以看到，similar_text()的复杂度是O(n^3)，n表示最长字符串的长度，而levenshtein()的复杂度为O(m*n)，m与n分别为两个字符串的长度。

相关阅读：

分析php计算文本字符串相似度函数similar_text()的原理

相关文章

PHP之友评论