博客 (3)

传统的 LIKE 模糊查询(前置百分号)无法利用索引,特别是多个关键词 OR,或在多个字段中 LIKE,更是效率低下。本文研究对文章进行分词以提高检索的准确度和查询效率。


根据自己的编程语言选择一款合适的中文分词组件,我在 ASP.NET 平台下选择了 jieba.NET。


设想的步骤:

  1. 分别对文章标题、标签、正文进行分词,保存到一张分词表上。该表把“文章 ID”和“词语”设为联合主键,用 3 个字段记录该词语分别在标题、标签、正文中出现的次数,另外还可以按需要添加文章分类 ID、文章创建时间等字段。

  2. 当用户输入关键词进行检索时,先将关键词分词,在分词表中用 in 语法查询到所有相关的记录;

  3. 使用 group by 语法对查询结果按文章 ID 分组;

  4. 关键在排序上,理想的排序是:

    a. 先按搜索关键词中不同词语的出现量排序,即:若搜索关键词分词后是 3 个词语,那么全部包含这 3 个词的文章优先,只匹配其中 2 个词语的其次;

    b. 再按搜索关键词在文中累计出现的次数排序(考虑权重),即:我们先假定标题和标签的分词权重为 5(意思是一个分词在标题中出现 1 次相当于在正文中出现 5 次),那么累加每个分词在标题、标签、正文的权重次数,得分高的优先;

    c. 再进一步考虑文章的发布时间,即将文章的发布时间距离最早一篇文章的发布时间(或一个较早的固定日期)相隔的天数,乘以一个系数加入到权重中,这个系数按不同文章分类(场景)不同,比如新闻类的大一点,情感类的小一点)。乘以系数时一篇文章只加权一次,不要加权到每个分词。

    d. 根据需求还可以加入文章热度(阅读数)的权重。


根据上述逻辑对一个有 18 万篇文章的内容管理系统进行改造,循环所有文章进行分词统计,得到一张包含 5 千万条记录的分词表(系统中部分文章只有标题、标签和外链,没有正文,否则更多)。

由于查询中包含 in、group by、count、sum、运算等,再若分类是无级限的,即文章分类 ID 也是 in 查询,然后分页,即使创建索引,效率也只能呵呵了。


简化:

不对正文进行分词;

不按权重进行排序;


那么分词表的记录数降到 250 万条,同样用 in 查询分词,先按搜索关键词中不同词语的出现量排序,再按发布时间排序,分页后获得一页的文章 ID 集合,再去文章表中 in 获取详细信息(注意保持一页中的排序)。

添加相关索引后,查询一个包含 3 个分词的关键词仅需十几毫秒。因为 in 的内容比较离散,所以索引的利用率比较高。



xoyozo 6 年前
4,397

本文未完成,部分测试方法、条件或结果可能有误,请谨慎参考! :)

本文基于 MySQL 的 InnoDB BTREE 方法的索引进行测试。

以一张包含 2000 万条记录的表做实验:

CREATE TABLE `dt_read`  (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `time` datetime(0) NOT NULL,
  `a_id` int(11) NOT NULL,
  PRIMARY KEY (`id`) USING BTREE
);

这张表是用于记录文章点击量的,

`id` 为主键,int(11) 自增;

`time` 为非空 datetime,表示文章打开时间,测试数据是从 2017-03-11 至 2018-04-28;

`a_id` 为非空 int(11),表示文章 ID,在此表中不唯一,测试数据是从 1 至 260218。


体验“全表扫描”


首先来体验一下什么是全表扫描,执行下面语句:

SELECT * FROM `dt_read` WHERE `time` < '2020-1-1' LIMIT 10

> 时间: 0.012s


SELECT * FROM `dt_read` WHERE `time` < '2000-1-1' LIMIT 10

> 时间: 7.317s


表中数据是按主键从小到大排列的,当查询条件为 `time` < '2020-1-1' 时,能很快地从表的前端找到 10 条满足条件的数据,所以不再继续判断后面的记录,立刻返回结果,耗时 0.012 秒;但当条件改为 `time` < '2000-1-1' 时,同样逐条判断,直到最后一条也没有找到,这种情况就是所谓的“全表扫描”,耗时 7 秒。


索引对 ORDER BY 的 ASC 和 DESC 的影响


我们给 `time` 建一个索引,同样执行刚才需要全表扫描的语句:

SELECT * FROM `dt_read` WHERE `time` < '2000-1-1' LIMIT 10

> 时间: 0.012s


创建 `time` 的索引后,相当于生成了一张按 `time` 字段排列的新表,这时 MySQL 就能够很快地定位并找到符合条件的记录,避免了全表扫描。


试试按 `time` 倒序排:

SELECT * FROM `dt_read` WHERE `time` < '2000-1-1' ORDER BY `time` DESC LIMIT 10

> 时间: 0.013s


结论:索引对 ORDER BY 的顺序(ASC)和倒序(DESC)都是有效的。


索引字段的次序对 WHERE 和 ORDER BY 的影响


删除所有索引,创建一个新的索引,字段依次为 `time`, `a_id`。

分别执行以下查询:


SELECT * FROM `dt_read` WHERE `time` < '2000-1-1' AND `a_id` < 0 LIMIT 10

> 时间: 0.013s


SELECT * FROM `dt_read` WHERE `a_id` < 0 AND `time` < '2000-1-1' LIMIT 10

> 时间: 0.013s


结论:MySQL 会自动优化 WHERE 条件的次序来匹配最合适的索引。

但在 ORDER BY 中却不是这么回事了:


SELECT * FROM `dt_read` ORDER BY `time`, `a_id` LIMIT 10

> 时间: 0.013s


SELECT * FROM `dt_read` ORDER BY `a_id`, `time` LIMIT 10

> 时间: 14.066s


原因也很好理解,对两个字段进行排序,先后次序肯定会影响结果集,因此只能以 SQL 语句指定的字段次序来 ORDER BY,这样,按索引的字段次序进行 ORDER BY 查询无疑是更快的。


索引中的字段必须依次使用


保持上例创建的索引不变,即 `time`, `a_id`。


SELECT * FROM `dt_read` WHERE `time` < '2000-1-1' AND `a_id` < 0 LIMIT 10

> 时间: 0.013s


SELECT * FROM `dt_read` WHERE `a_id` < 0 LIMIT 10

> 时间: 6.438s


上句合理利用了索引的字段,而下句跳过了 `time`,直接 WHERE 了 `a_id`,这是不受该索引支持的。

我们可以想象一下这张由索引生成的虚拟表,其实就是一张普通的平面二维表格,按索引指定的字段次序进行了排序,所以全表中仅仅是索引指定的第一个字段是按大小排列的,第二个字段是在第一个字段值相同的区域内按大小排列,后同。所以,跳过索引指定的第一个字段直接对第二个字段进行检索,是无法应用该索引的。这个结论也同样也体现在 ORDER BY 语句中:


SELECT * FROM `dt_read` ORDER BY `time`, `a_id` LIMIT 10

> 时间: 0.013s


SELECT * FROM `dt_read` ORDER BY `a_id` LIMIT 10

> 时间: 29.566s


WHERE 和 ORDER BY 混合


保持上例创建的索引不变,即 `time`, `a_id`。


先来执行这两句:


SELECT * FROM `dt_read` ORDER BY `a_id` LIMIT 10

> 时间: 12.29s


SELECT * FROM `dt_read` WHERE `time` < '2000-1-1' ORDER BY `a_id` LIMIT 10

> 时间: 0.013s


仅仅 WHERE 了一个 `time`,对 ORDER BY `a_id` 的效率却有质的提升,是因为 WHERE 中的 `time` 和 ORDER BY 中的 `a_id` 一起找到了索引吗?答案是否定的。

我们把时间改大,让它能马上找到符合条件的数据:


SELECT * FROM `dt_read` WHERE `time` < '2020-1-1' ORDER BY `a_id` LIMIT 10

> 时间: 22.34s


为什么这个语句就不走索引了呢?

其实,一个简单的 SELECT 查询语句,首先执行 WHERE,然后 ORDER BY,最后是 LIMIT。每一步都独自去找了索引,而非 WHERE 和 ORDER BY 混在一起去找索引。必须保证每一步是快的,最终才是快的。

当 `time` < '2000-1-1' 时,WHERE 用到了索引,所以很快,ORDER BY 却没有用到索引,但为什么也很快呢?因为 WHERE 的结果集非常小(示例中为 0 条)。

当 `time` < '2020-1-1' 时,WHERE 也用到了索引,但其结果集非常大(示例中为所有记录),再 ORDER BY `a_id` 就非常慢了,因为我们没有创建以 `a_id` 开头的索引。


现在把索引改成只有 `time` 一个字段。


SELECT * FROM `dt_read` WHERE `time` < '2020-1-1' ORDER BY `a_id` LIMIT 10

> 时间: 6.033s


因为索引里有 `


SELECT * FROM `dt_read` WHERE `time` < '2000-1-1' ORDER BY `a_id` LIMIT 10

> 时间: 0.013s


SELECT * FROM `dt_read` WHERE `a_id` < 0 ORDER BY `time` LIMIT 10

> 时间: 6.033s


第二句先 WHERE `a_id`,后 ORDER BY `time` 是不能匹配所建的索引的。


索引中的字段越多越好


分别在创建索引(`time`)和索引(`time`, `a_id`)的情况下执行下面语句:

本例使用 ORDER BY 而不是 WHERE 来测试是因为,在 WHERE 的多个条件下,如果符合前一条件的筛选结果集过小会导致判断第二条件时数据量不足,无法判断索引是否起作用。


SELECT * FROM `dt_read` ORDER BY `time` LIMIT 10


仅创建索引(`time`)的情况下:

> 时间: 0.013s


仅创建索引(`time`, `a_id`)的情况下:

> 时间: 0.013s


SELECT * FROM `dt_read` ORDER BY `time`, `a_id` LIMIT 10


仅创建索引(`time`)的情况下:

> 时间: 15.015s


仅创建索引(`time`, `a_id`)的情况下:

> 时间: 0.014s


可以看到,在索引字段依次使用的前提下,索引字段数不少于查询字段数才能避免全表扫描。

虽然索引中的字段越多越好,但必须依次使用,否则也是无效索引。


索引对 INSERT / UPDATE / DELETE 的效率影响


分别在创建索引(`time`)和索引(`time`, `a_id`)的情况下执行下面语句:


INSERT INTO `dt_read` (`time`, `a_id`) VALUES ('2018-4-28', 260218)


不建索引的情况下:

> 时间: 0.01s


仅创建索引(`time`)的情况下:

> 时间: 0.01s


同时创建索引(`time`)和索引(`time`, `a_id`)的情况下:

> 时间: 0.01s


UPDATE `dt_read` SET `time` = '2018-4-28' WHERE `id` = 20000000(注:存在该 id 值的记录)


不建索引的情况下:

> 时间: 0.01s


仅创建索引(`time`)的情况下:

> 时间: 0.01s


同时创建索引(`time`)和索引(`time`, `a_id`)的情况下:

> 时间: 0.01s


虽然在 INSERT / UPDATE / DELETE 时数据库会更新索引,但从实测数据来看,索引对其效率的影响可忽略不计。


一些误区


“in 语法效率很低”?

in 语法也是应用索引的,网传 in 会比一个一个 WHERE OR 要慢得多的说法是不靠谱的。in 主键和 in 索引同理。


另外:

对于字符串类型,LIKE '%abc%' 是不能应用索引的,但 LIKE 'abc%' 可以。更多关于字符串类型的索引,请查阅全文索引(FULLTEXT)。

索引的字段是可以指定长度的,类似字符串索引指定前面若干唯一字符就可以优化效率。


本文系个人实践总结,欢迎批评指正!


xoyozo 7 年前
3,775

Apple’s newest devices feature the Retina Display, a screen that packs double as many pixels into the same space as older devices. For designers this immediately brings up the question, “What can I do to make my content look outstanding on these new iPads and iPhones?”. First there are a few tough questions to consider, but then this guide will help you get started making your websites and web apps look amazingly sharp with Retina images!

retina image comparison

Things to Consider When Adding Retina Images

The main issue with adding retina images is that the images are double as large and will take up extra bandwidth (this won’t be an issue for actual iOS apps, but this guide is covering web sites & web apps only). If your site is mostly used on-the-go over a 3G network it may not be wise to make all your graphics high-definition, but maybe choose only a select few important images. If you’re creating something that will be used more often on a WI-FI connection or have an application that is deserving of the extra wait for hi-res graphics these steps below will help you target only hi-res capable devices.

Simple Retina Images

The basic concept of a Retina image is that your taking a larger image, with double the amount of pixels that your image will be displayed at (e.g 200 x 200 pixels), and setting the image to fill half of that space (100 x 100 pixels). This can be done manually by setting the height and width in HTML to half the size of your image file.

<img src="my200x200image.jpg" width="100" height="100">

If you’d like to do something more advanced keep reading below for how you can apply this technique using scripting.

Creating Retina Icons for Your Website

When users add your website or web app to their homescreen it will be represented by an icon. These sizes for regular and Retina icons (from Apple) are as follows:
ios icon samples

iPhone 57 x 57
Retina iPhone 114 x 114
iPad 72 x 72
Retina iPad 144 x 144

For each of these images you create you can link them in the head of your document like this (if you want the device to add the round corners remove -precomposed):

<link href="touch-icon-iphone.png" rel="apple-touch-icon-precomposed" />
<link href="touch-icon-ipad.png" rel="apple-touch-icon-precomposed" sizes="72x72" />
<link href="touch-icon-iphone4.png" rel="apple-touch-icon-precomposed" sizes="114x114" />
<link href="touch-icon-ipad3.png" rel="apple-touch-icon-precomposed" sizes="144x144" />

If the correct size isn’t specified the device will use the smallest icon that is larger than the recommended size (i.e. if you left out the 114px the iPhone 4 would use the 144px icon).

Retina Background Images

Background images that are specified in your CSS can be swapped out using media queries. You’ll first want to generate two versions of each image. For example ‘bgPattern.png’ at 100px x 100px and ‘bgPattern@2x.png’ at 200px x 200px. It will be useful to have a standard naming convention such as adding @2x for these retina images. To add the new @2x image to your site simply add in the media query below (You can add any additional styles that have background images within the braces of the same media query):

.repeatingPattern {
     background: url(../images/bgPattern.png) repeat;
     background-size: 100px 100px;
}

@media only screen and (-webkit-min-device-pixel-ratio: 2) {
     .repeatingPattern {
          background: url(../images/bgPattern@2x.png) repeat;
     }
}

JavaScript for Retina Image Replacement

For your retina images that aren’t backgrounds the best option seems to be either creating graphics with CSS, using SVG, or replacing your images with JavaScript. Just like the background images, you’ll want to create a normal image and one ‘@2x’ image. Then with JavaScript you can detect if the pixel ratio of the browser is 2x, just like you did with the media query:

if (window.devicePixelRatio == 2) {

//Replace your img src with the new retina image

}

If you’re using jQuery you could quickly replace all your images like this very basic example below. It’s a good idea to add a class to identify the images with hi-res versions so you don’t replace any others by mistake. I’ve added a class=”hires” for this example. Also make sure you have the standard (non-retina) image height and width set in the HTML:

<img class="hires" alt="" src="search.png" width="100" height="100" />
<script type="text/javascript">
$(function () {

	if (window.devicePixelRatio == 2) {

          var images = $("img.hires");

          // loop through the images and make them hi-res
          for(var i = 0; i < images.length; i++) {

            // create new image name
            var imageType = images[i].src.substr(-4);
            var imageName = images[i].src.substr(0, images[i].src.length - 4);
            imageName += "@2x" + imageType;

            //rename image
            images[i].src = imageName;
          }
     }

});
</script>

Server-Side Retina Images

If you’d like to implement a server-side retina image solution, I recommend checking out Jeremy Worboys’ Retina Images (which he also posted in the comments below). His solution uses PHP code to determine which image should be served. The benefit of this solution is that it doesn’t have to replace the small image with the retina one so you’re using less bandwidth, especially if you have lots of images that you’re replacing.

Website Optimization for Retina Displays

If you’re looking for additional information on creating Retina images, I’ve recently had a short book published called Website Optimization for Retina Displays that covers a range of related topics. It contains some of what is above, but also includes samples for many different situations for adding Retina images. It explains the basics of creating Retina images, backgrounds, sprites, and borders. Then it talks about using media queries, creating graphics with CSS, embedding fonts, creating app icons, and more tips for creating Retina websites.

K
转自 Kyle Larson 13 年前
4,393