N元语言模型的训练方法

news/2024/7/10 3:56:25 标签: 语言, 算法, 优化, 工具, 微软, c
cle class="baidu_pl">
cle_content" class="article_content clearfix">
content_views" class="htmledit_views"> color:rgb(51,51,51); font-family:Arial; font-size:14px; line-height:26px">

------------------------------------------------------------------
color:rgb(32,88,103)">大家好࿰c;我是Bright࿰c;class="tags" href="/tags/WeiRuan.html" title=微软>微软拼音的开发工程师。我之前介绍了color:rgb(0,102,167)">class="tags" href="/tags/YuYan.html" title=语言>语言模型的基本概念color:rgb(32,88,103)">࿰c;本文介绍一下N-gramclass="tags" href="/tags/YuYan.html" title=语言>语言模型的训练方法。
------------------------------------------------------------------

模型的训练也称为模型的参数估计࿰c;参数可以用下式估计:

c="https://bjgmsg.bay.livefilestore.com/y1m9Ir8aMmBDYJ9H39O-lAb35_T2rHHLgPkIbuQJCybpHx_34WC7De7ChKdcUVEjlNjgV_KqIZUISnKNiKA6QOfkhW4nIfCixJyulKoJDrkOIienHUUCZwCe4WxMJKdPn3ORZIlHcSjAWPthTbg6i_c9w/%E7%AE%97%E5%BC%8F1_1_thumb[7].png?download&psid=1" border="0" alt="算式1_1" width="436" height="16" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />                  (1)

这样的模型是以词语为基本单位࿰c;但是汉语文本没有空格分隔࿰c;因此需要先对汉语文本进行分词处理࿰c;再在分好词的语料上统计n元对的出现次数。

class="tags" href="/tags/YuYan.html" title=语言>语言模型的质量依赖于分词语料的质量。为了获得良好的分词语料࿰c;可以先用分词class="tags" href="/tags/GongJu.html" title=工具>工具对未分词语料(生语料)进行自动化的分词标注࿰c;然后对其中可能分词错误的地方进行人工校对࿰c;最后得到的语料称为熟语料。根据是否需要熟语料࿰c;训练方法分为有监督和无监督的两种方式。

有监督的训练方法

有监督的训练方法比较简单。先统计n元对的出现次数࿰c;然后采用最大似然估计的方法对参数进行估计(如公式1)。

无监督的训练方法

无监督的训练方法需要适当规模的生语料和词表࿰c;然后采用EMclass="tags" href="/tags/SuanFa.html" title=算法>算法迭代地对class="tags" href="/tags/YuYan.html" title=语言>语言模型的参数进行调整。EM class="tags" href="/tags/SuanFa.html" title=算法>算法是 Dempster Laind Rubin 于 1977 年提出的求参数极大似然估计的一种方法࿰c;它可以从非完整数据集中对参数进行估计࿰c;是一种非常简单实用的学习class="tags" href="/tags/SuanFa.html" title=算法>算法。

假设我们有一组语料࿰c;其表示为c="https://bjgmsg.bay.livefilestore.com/y1me7usE-7gTiSSD2ZioT1NkpK9N2YC43ohrkivi2atq8qeVx-RNFW14RVhv8UzUmSTZ3JhSuE6Ta7j3ozryS7Ejm8kJImzEpTwuWKKpQ-l-pUm7RBVCuV3imNKNB2KlxkHZ48cQxpLKKa1V9iIMtxU8w/%E7%AE%97%E5%BC%8F2_thumb[9].png?download&psid=1" border="0" alt="算式2" width="90" height="14" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />࿰c;词表c="https://bjgmsg.bay.livefilestore.com/y1mhAU0Z2tXYgcfGneVb3hai4qjQh8UviXw-CAZ2F1Z5Y6nhQnSWn4WLqyjKZPDMc1Zvf9Xz_RimXEE2fA4w_nFnNYKUvaIUgvPWeQKP0IR_oU94hm6JW9NCROkTN_V-7pqi-sB9hetUww2VWc32Ku84A/%E7%AE%97%E5%BC%8F3_thumb[8].png?download&psid=1" border="0" alt="算式3" width="138" height="14" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />。我们期望将语料分成最理想的形式c="https://bjgmsg.bay.livefilestore.com/y1mC9-jHGIkLHcGZ7oNE7NOqsO3UN12WnxnX-_ld6eJYhmlZVTTnmtN8pypS0cg9I_clGV7mJOmVhCmN-GmY9ap7d41lWT62C9o84HRV9tTFY43kUaNPG6YZqY-Zp1FUOtLd6S2D0-coYHUqPKE8ZylEA/%E7%AE%97%E5%BC%8F4_thumb[9].png?download&psid=1" border="0" alt="算式4" width="155" height="13" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />࿰c;我们的class="tags" href="/tags/YouHua.html" title=优化>优化目标可以表示为c="https://bjgmsg.bay.livefilestore.com/y1muQ8uhwQ3eT-KfTP0tVX2U-C5r8Z-S4ixdFH-H6Fgpwam6FN1Kn54uObIEtJ7EziAn3Ms8CdZT-aCb4Z0JjSckvTChRzEyFhX-OLqeh5BoUGrbcs4tWh0R73RW5WR1cCFUobG2Vu4N-5RI5ykR_YJQg/Capture18_thumb[5].png?download&psid=1" border="0" alt="Capture18" width="82" height="13" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />࿰c;它就是原始语料在分词结果c="https://bjgmsg.bay.livefilestore.com/y1mkOgS3Gai6ET0_MG0I1laCngLBsLHc4iufB67a46u-trSkQAH0nW5qYPWIbfbnwbcqEy1pJPZ39KtXsIqgUTwOjC_-LiUsIaJdIlf3_UJV1z0atm3FQbEbA4-PjvXwR73ilvoLfgJD18CbvjkPAx6hg/%E7%AE%97%E5%BC%8F5_thumb[13].png?download&psid=1" border="0" alt="算式5" width="86" height="14" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />时的概率࿰c;我们期望这个概率最大࿰c;因此࿰c;最佳的c="https://bjgmsg.bay.livefilestore.com/y1mn8hgPhJdXW8NMd5TdYG6wYqg27MTg32eUFCZDuSMD1VRTfLpRPDQhmpn2HpIu4d4-PX5prHzVi0R0vYYPFJyR4Et8z-dWM2gHic5H2TyCmQUPsWMxZycI4V5QYVWZrHi4ZhpHVuJ39xPrKeLEIMNRw/%E7%AE%97%E5%BC%8F0_thumb[8].png?download&psid=1" border="0" alt="算式0" width="16" height="15" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />为:

c="https://bjgmsg.bay.livefilestore.com/y1mBSkws16GPsQc2jp00tZru2rMOMDqfpcYDOWX3SyZ0DU-iZlMlHJXgDhP-aQzM1r7fPmqsBPXIqHXN-TNCWipws5pWr9rz34MLdzs6pxjYIpWk3HoNm-8KrVilIa4EZhhR9p-MPjqbEB3X-c-ObGzTA/%E7%AE%97%E5%BC%8F6_thumb[12].png?download&psid=1" border="0" alt="算式6" width="347" height="19" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />                                  (2)

基于EMclass="tags" href="/tags/SuanFa.html" title=算法>算法的LM参数估计过程如下:

初始化:首先随机地初始化或者根据某种先验知识初始化模型的参数θ。基于词表获得原始语料得各种分词结果c="https://bjgmsg.bay.livefilestore.com/y1mgkZLp1SWmbozsYesmMhCzTpVjVG2tVX4G1QSWQ3DSc6LmmxrPMYI_Nqx7SuPFoHIByjWD1P7HEYXKqbeaAAdDA5vIhcaEYvW2ywfoqir6qI2YH1Zvyu7dwng0w8TCRvjB_Y1seTUV6BLQ7E13UvykQ/%E7%AE%97%E5%BC%8F7_thumb[6].png?download&psid=1" border="0" alt="算式7" width="73" height="15" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />。

E-step:利用当前class="tags" href="/tags/YuYan.html" title=语言>语言模型的参数 ࿰c;估计各个分词结果的概率c="https://bjgmsg.bay.livefilestore.com/y1mAvNbZ65WO4XJhhMvTJe1SJaOY1FUYF978dIH9XQuK1GOJPFCr9vVl6pTjJmDOuYrFD0tZsc6mFBBaCTuLneUrmYQieFntbFSVJKKLTk-preuHsp3PGFZ44pHA1zhGPvy6i86LuK41zOQALGwJ43W3g/%E7%AE%97%E5%BC%8F8_thumb[11].png?download&psid=1" border="0" alt="算式8" width="85" height="13" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />。

M-step:对于原始语料的各种分词形式c="https://bjgmsg.bay.livefilestore.com/y1mUJflVEnVDGqlv5bAhL9yTbWWyXdrl8c9FXu_NxkQsLXASEjtqrDJuvPQxzx7We2G5o9Fsfw3BA2h3BTYAUy60Bw4cCcyalGrEsOblcAggixMUIHMnaeF7igCm0JdBRl4SbfsP6P03RdxDRK7J1EbFA/%E7%AE%97%E5%BC%8F7_thumb[7].png?download&psid=1" border="0" alt="算式7" width="73" height="15" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />࿰c;统计n元对的数量࿰c;并按照概率c="https://bjgmsg.bay.livefilestore.com/y1mryolO_AOhTsrqLeeDXCZWmDatof1m_jjUdDJWdzkOtVB3X9lqn4uJz-gql3iEgtxu7Uh192ARzkmiSXkzv8XSJvJej56uqFU7_twpS2PhNKG0URdUwhky8fa9WmpEsg_sQcktiS6g5YGsVDz6PjMbQ/%E7%AE%97%E5%BC%8F8_thumb[12].png?download&psid=1" border="0" alt="算式8" width="85" height="13" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />进行加权࿰c;然后更新模型的参数࿰c;得到新的参数θ。

如此反复执行E-step和M-step࿰c;直到达到模型收敛。

收敛的条件可以是:在原始语料上的最佳分词结果的概率(或者迷惑度)在两次相邻两次迭代的变化小于某固定阈值。

由于原始语料的所有可能的分词结果数量很大࿰c;c="https://bjgmsg.bay.livefilestore.com/y1myOIjcNx7NIKDIC7grk1d7KQSOsuU9MuBynEvl8-WNWwk5MVUa0kU0CfE8YiZm83Zxm4pK6YbnhwTatgh_-FxxLojGleSokIGk6EIEF_D9Kqonf1Ek6IPaHS7ptCYuDx0BvV3Nr0_mVhytl0O4gta6g/%E7%AE%97%E5%BC%8F9_thumb[6].png?download&psid=1" border="0" alt="算式9" width="241" height="21" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />计算各分词结果的概率就变得很困难。为此可以采用“前向-后向”(动态规划)class="tags" href="/tags/SuanFa.html" title=算法>算法࿰c;其可以较快速地实现n元对的统计。对unigram来说࿰c;假设我们想统计词语Wi的次数࿰c;基于“前向-后向”动态规划的统计class="tags" href="/tags/SuanFa.html" title=算法>算法如下:

对于语料CWi出现的所有可能位置࿰c;如c="https://bjgmsg.bay.livefilestore.com/y1mrPKmdGASHDthNgudCNgQw9S38OCEZha1CXIohwHUs2tSgzD_fMIodCcDT6al7Uv1MPmpRX3ji190jDWu3BQ4rRNvtUvbrF-_KAy_BsVT8UeaVn7eLNkHMsQZIAS25HfQyXqBzAxPzMPIeenmHA-LWQ/%E7%AE%97%E5%BC%8F10_thumb[9].png?download&psid=1" border="0" alt="算式10" width="121" height="15" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />。计算如下:

c="https://bjgmsg.bay.livefilestore.com/y1mGnlzkNIDfov3rlTSC06z0JmZN72OGLjz-o9_tJ1lP9HqTBXSrLKCPmV12h7g5BieDKjQS8AMUAJvuq1AcEw1zmu4ctL1Zdt4PIpSUo8uxDvEaSZsF_ZVowO-TF8A0CgWGiddRGjXPcd6GpzSgbzttg/%E7%AE%97%E5%BC%8F11PNG_thumb[10].png?download&psid=1" border="0" alt="算式11PNG" width="292" height="26" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />                                  (3)

                  c="https://bjgmsg.bay.livefilestore.com/y1mFw143J19dND8tyYA5pPHv3qOg15GlEULXJU03LoJP1uikJYYvGl_Mkm4I1AV1gbzPNKhnKqs0TR4rDtggChzYgd0lf5t6JDJiQYQsA37h5S2QqMQl78kmRqt7AvuNTyjJlaC-3brl-TgoiMkPWFmjg/%E7%AE%97%E5%BC%8F12_thumb[7].png?download&psid=1" border="0" alt="算式12" width="213" height="24" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />                                    (4)

                  c="https://bjgmsg.bay.livefilestore.com/y1mFw_amsIXkKdtYkACb6D_aQd28k0KrPSWj21uvgx8w37Jl0LMjsgI8bidRueqDWhii3unCFcCOuJWyLuz_kgdHrgM4oBGqKzY1ZFDYiNykwV1TtCUAo1FpxAVbL7LoUDPb0UCdr9a7yBH2MI1HlyKGA/%E7%AE%97%E5%BC%8F13_thumb[8].png?download&psid=1" border="0" alt="算式13" width="262" height="25" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />                        (5)

Z是概率归一化因子࿰c;保证语料所有可能的切分结果的概率总和为1:

c="https://bjgmsg.bay.livefilestore.com/y1m5D7N8_mT6CslNnEtelFDTex3sZDn_OieZbAUh2eO27O1fl5txRBJDEe99K2h6NkVWctpTDguFEkQlnniz7jvPMm8JH9891H0F-iUXoWippejRBD8qvXYNoJY10GADCSjyNXWriqUGclcVUdkaEfDDQ/%E7%AE%97%E5%BC%8F14_thumb[6].png?download&psid=1" border="0" alt="算式14" width="167" height="23" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />     (6)

其中c="https://bjgmsg.bay.livefilestore.com/y1mA2WoxwvuETLRN3mgqCjjkpdzvieHk9M03FTGp9f5vPytQbLY9zL-ofVD3WqmScCVqhldcuk6Wst-eEP_8FnmM9EWJOCY9eW3aLyOmSIIsQleHcwAk27q8pqGxMMdruCWflHabp2ryvNxTmf1-gLbMg/%E7%AE%97%E5%BC%8F18_thumb[1].png?download&psid=1" border="0" alt="算式18" width="54" height="24" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />称为前向概率࿰c;可能前向class="tags" href="/tags/SuanFa.html" title=算法>算法计算࿰c;c="https://bjgmsg.bay.livefilestore.com/y1m96ESF29O2aehyeoZ2olJn-6fD29cSY7DUY943gf-sLro4v9TFUDgblYMGUElHZ8ftqUkXTmkob6wpQFYBBt4VuIvCYKgEBqMELMpwYkaJdFNftdyg-568-P5LmxemGc1PacDixsVg-YxsLy4uBv2IQ/%E7%AE%97%E5%BC%8F19_thumb[1].png?download&psid=1" border="0" alt="算式19" width="62" height="25" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />称为后向概率࿰c;可能后向class="tags" href="/tags/SuanFa.html" title=算法>算法计算。

前向class="tags" href="/tags/SuanFa.html" title=算法>算法和后向class="tags" href="/tags/SuanFa.html" title=算法>算法很类似࿰c;一个是从前向后边扫描边计算࿰c;一个是从后向前边扫描边计算࿰c;都是动态规划class="tags" href="/tags/SuanFa.html" title=算法>算法。这里仅给出前向class="tags" href="/tags/SuanFa.html" title=算法>算法࿰c;前向class="tags" href="/tags/SuanFa.html" title=算法>算法的递归函数定义为:

c="https://bjgmsg.bay.livefilestore.com/y1mEjTkfbE7tdoCEFXXdZr0Jb_ftU-Nmaho0U1ex99YBdPe2Eo7z8nPv3h50UvFSZeOr_kZ0aJIfFWKQ-Qq5DetZSrKF0m8yK09Ki2Q5rAaNMrCDThMZAfUKJYZzAb2_0Hr9n5rV7JM7kNUlSgbH4bWlA/%E7%AE%97%E5%BC%8F15_thumb[3].png?download&psid=1" border="0" alt="算式15" width="391" height="70" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />       (7)

其中c="https://bjgmsg.bay.livefilestore.com/y1mDi-E1TnnuR3YbuZmPKLqFfRVwOpmdXsgUPHaSEHEbOTNOAKOBMe61Ola1BmOcHMtcKedfzUdzlhiebDhKDY-ObRs5fg93BO672pozjRd2wToNy0Z-AKZ7SY8IjX5Hnel84pARBKfb4axf7ap5ve63A/%E7%AE%97%E5%BC%8F16_thumb[6].png?download&psid=1" border="0" alt="算式16" width="126" height="16" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />为词语c="https://bjgmsg.bay.livefilestore.com/y1mufN6XDCHwM_ssHuVrWvrowNUwmPq_WIjMccYNy63cwXMFm--CljbBZv_MAeO1O7LeeLjOgD4Ua_XDwOQrxX7djtWR4cLOUKloeMfHxWcvJQZpEjxa9JUz0dUFCtYfU76VNEGdUYt7e_qZGWKktuWCQ/%E7%AE%97%E5%BC%8F17_thumb[10].png?download&psid=1" border="0" alt="算式17" width="83" height="13" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />的unigram概率࿰c;如果c="https://bjgmsg.bay.livefilestore.com/y1mMS0zD0d51Iuhh62D2lWKVCc4XTUDiwGRHCTxU6hwyEp9JxP7FNBz7ifWahRVZt2viwCZWoqZaeu4Nw_H_MpHXcuvHl5-UrHU4xOzPihCmpGL64RmBLdyEXyIZs_86feTVcXvwJMd1SfqVU51QFEofw/%E7%AE%97%E5%BC%8F17_thumb[11].png?download&psid=1" border="0" alt="算式17" width="83" height="13" style="border-top-style:none; border-right-style:none; border-bottom-style:none; border-left-style:none; border-width:initial; border-color:initial; padding-left:0px; padding-right:0px; display:inline; padding-top:0px; border-top-width:0px; border-right-width:0px; border-bottom-width:0px; border-left-width:0px" />不是一个词语࿰c;其概率为0。

上面是unigram的“前向-后向”动态规划统计class="tags" href="/tags/SuanFa.html" title=算法>算法࿰c;读者有兴趣可以自己推导一下bigram或者更高阶ngram的动态规划统计class="tags" href="/tags/SuanFa.html" title=算法>算法。


转自:http://blog.csdn.net/MSPinyin/article/details/6141556

cle>

http://www.niftyadmin.cn/n/1534773.html

相关文章

QueryRunner类实战

先上一个登录代码---判断登录是否成功 1、c3p0-config.xml <?xml version"1.0" encoding"UTF-8"?> <c3p0-config><default-config><property name"user">root</property><property name"password"…

肉沫豆腐做法

切成肉沫&#xff0c;加入盐&#xff0c;生抽&#xff0c;料酒&#xff0c;淀粉腌制五分钟将豆腐切成豆腐块&#xff0c;然后&#xff0c;过热水爆炒蒜&#xff0c;葱&#xff0c;豆瓣酱&#xff0c;火锅底料&#xff0c;肉沫加入水&#xff0c;加入豆腐块加入盐&#xff0c;十…

语言模型训练工具SRILM

SRILM是著名的约翰霍普金斯夏季研讨会&#xff08;Johns Hopkins Summer Workshop&#xff09;的产物&#xff0c;诞生于1995年&#xff0c;由SRI实验室的Andreas Stolcke负责开发维护。 关于SRILM的安装&#xff0c;我已经在前面关于moses平台搭建的文章&#xff08;参见&…

Ubuntu 11.04 32位系统下 SRILM 的配置详解

首先&#xff0c;安装依赖的软件包&#xff1a; 1&#xff0e;C/C compiler&#xff1a;编译器gcc 3.4.3及以上版本&#xff0c;本机实验环境为gcc 4.5.2 2&#xff0e;GNU make&#xff1a;构建和管理工程的工具&#xff0c;解释Makefile里的指令&#xff0c;描述了整个工程所…

使用皮皮影院搭建电影展

ti提纲&#xff1a; 1. 准备工作&#xff0c;下载程序 2. 配置环境 3. 配置cdn 准备工作&#xff0c;买了阿里云服务器&#xff0c;选好域名&#xff0c;电影站 http://movie.ioreq.com 配置百度cdn&#xff0c;购买流量包转载于:https://www.cnblogs.com/webgrip/p/7848338.h…

使用vc编译srilm库

Srilm的全称是Stanford Research Institute Language Modeling Toolkit 他被用来构建和应用统计语言模型&#xff0c;主要用于语音识别&#xff0c;统计标注和切分&#xff0c;以及机器翻译等工作由于需要我要在win32下编译她可是在网上找到很多都是基于cygwin的交叉式编译使用…

Software--WCF

Design Pattern -- Server 服务端&#xff1a; 在售票系统中的使用WCF 为了让客户端能够使用服务&#xff0c;需要创建一个代理。 服务门面的作用是简化客户端与服务之间的交互。 客户端应用程序不需要了解消息传送模式以及如何与服务代理通信。 以 TicketServiceFacade 两个方…