分类未分类下的文章

[CMU15445] 锁与事务隔离级别

作者: rin
时间: August 16, 2023
分类: 未分类
评论

[CMU15445] bustub 2023 Project3

作者: rin
时间: May 20, 2023
分类: 未分类
评论

1. `ExecutorContext *exec_ctx` 是什么？
2. 如何根据plan中的table_name找到要seq_scan的table？（or table_heap?）
    1. 查找到catalog.h中有关于table_name的信息
    2. The Catalog is a non-persistent catalog that is designed for use by executors within the DBMS execution engine. It handles table creation, **table lookup**, index creation, and index lookup.
    3. 查到`ExecutorContext`中包含了catalog，所以应该可以从`exec_ctx_`中获得table

## insert
3. 参考projection_executor写了insert_exector的构造函数及init、next
    1. 使用InsertTuple插入时，需要TupleMeta和Tuple，后者可以从child_executor->Next中获取，前者呢？——只需手动设置TupleMeta的那三个字段
    2. 需要通过\*tuple返回插入的数量，如何构造这个Tuple？构造tuple时也需要对应的schema，这个从何而来？——tuple需要一个vector\<Value> ，这个Value可以通过ValueFactory::GetIntegerValue()构造，而schema已经由planner传过来了，即output_schema，代表这个sql输出的schema是什么样
    3. 按照迭代器模型（volcano模型）的思想，我觉得在单次调用insert_executor->Next()的时候，在该executor内部应该也只应该调用一次child_executor->Next()，直到它返回false时我再返回false。但是在调用者（execution_engine.h:PollExecutor()）的代码中，它每一次执行我的Next时，都会往最终的result_set中插入一次tuple，这会导致该语句输出的结果中有多个row
    ```
    <main>:8
    insert into t1 values (0, 'Emoji', 10), (1, 'EmojiEmoji', 11), (2, 'EmojiEmojiEmoji', 12), (3, 'EmojiEmojiEmojiEmoji', 13), (4, 'EmojiEmojiEmojiEmojiEmoji', 14);
    --- YOUR RESULT ---
    1 
    2 
    3 
    4 
    5 
    
    --- EXPECTED RESULT ---
    5
    ```
    因此我一次性在Next()中使用循环将child_executor中产生的所有tuples处理完

---
在insert后，还需要插入index。从`catalog->GetTableIndexes()`可以获得该table所有的index。但是每个index覆盖的字段也不一样，无脑将新插入的这个tuple原样往index里插会导致segmentfault，要如何根据特定index schema去insert新的index？
可以根据`index_info->key_scheme_`获得这个index的scheme，然后又根据这个table的scheme进行对应，选择需要的column生成key tuple。在tuple.h中已经实现了`Tuple::KeyFromTuple()`可以直接使用。

## delete
4. 在delete_executor中同样使用循环处理所有的tuples，在TupleMeta中修改删除标记。在seq_search_executor->Next()中对于已被标记删除的tuple，应该忽略它，而不是立刻返回false，因为可能还有别的tuples要输出

## update
6. 在update_executor中，检查测试样例`explain update t1 set v3 = 445 where v1 >= 3;`的结果为
  ```
  === BINDER ===
BoundUpdate {
  table=BoundBaseTableRef { table=t1, oid=22 },
  filter_expr=(t1.v1>=3),
  target_expr=[(t1.v3, 445)],
}
=== PLANNER ===
Update { table_oid=22, target_exprs=[#0.0, #0.1, 445] } | (__bustub_internal.update_rows:INTEGER)
  Filter { predicate=(#0.0>=3) } | (t1.v1:INTEGER, t1.v2:VARCHAR, t1.v3:INTEGER)
    SeqScan { table=t1 } | (t1.v1:INTEGER, t1.v2:VARCHAR, t1.v3:INTEGER)
=== OPTIMIZER ===
Update { table_oid=22, target_exprs=[#0.0, #0.1, 445] } | (__bustub_internal.update_rows:INTEGER)
  SeqScan { table=t1, filter=(#0.0>=3) } | (t1.v1:INTEGER, t1.v2:VARCHAR, t1.v3:INTEGER)
```

对应的，代码中的child_executor即Filter，向update_executor提供需要更新的tuples。然后新值存储在plan中的plan_->target_expressions_中。即`target_exprs=[#0.0, #0.1, 445]`
经过plan_->target_expressions_->Evaluate()后可以得到各列实际的值：`[3,EmojiEmojiEmojiEmoji,445]`
```
update t1 set v3 = 445 where v1 >= 3;
2023-05-19 18:47:41 [update_executor.cpp:40:Next] DEBUG - Update from (3, EmojiEmojiEmojiEmoji, 13) Tuple size is 41
2023-05-19 18:47:41 [update_executor.cpp:41:Next] DEBUG - To
2023-05-19 18:47:41 [update_executor.cpp:43:Next] DEBUG - Type: 4       Val: #0.0        Evaluated Val:3
2023-05-19 18:47:41 [update_executor.cpp:43:Next] DEBUG - Type: 7       Val: #0.1        Evaluated Val:EmojiEmojiEmojiEmoji
2023-05-19 18:47:41 [update_executor.cpp:43:Next] DEBUG - Type: 4       Val: 445         Evaluated Val:445
```

## index_scan
在初始化BPlusTreeIndexIteratorForTwoIntegerColumn（IndexIterator）时，出现链接错误，找不到IndexIterator，即使对特定的那几个类型进行了explicit instantiation也不行，原因仅仅是因为这里使用了无参的构造函数，但是在index_iterator.h中没有定义，补上定义即可。

## agg
首先根据注释，给aggregation_executor.h添加成员`SimpleAggregationHashTable aht_;`和`SimpleAggregationHashTable::Iterator aht_iterator_;`，并在构造函数上初始化。

1. `SimpleAggregationHashTable`是什么？——是一个`AggregateKey->AggregateValue`的hashtable
2. `AggregateKey`代表“ a key in an aggregation operation”，包含一个`std::vector<Value> group_bys_;`。这个key应该与aggregate的columns有关。
3. `AggregateValue`里有一个`std::vector<Value> aggregates_;`，应该是aggregate后的值。
    1. 但是aggregate 函数如`CountStarAggregate, CountAggregate, SumAggregate, MinAggregate, MaxAggregate`这些也只产生一个值，或许vector里存了多个group产生的值？——不是，vector里存的是不同聚集函数产生的值，第i个元素值代表`agg_types_[i]`这个类型的聚集函数
    2. 在`aggregation_executor->Next()`中，像其他task一样，从child获取一个tuple然后进行处理。处理过程大概是构造AggregateKey和AggregateValue然后添加到某处
        1. 查看`aggregation_executor.h`，其中有`MakeAggregateKey(),MakeAggregateValue(),aht_.InsertCombine`几个函数。构造Key的这个函数从plan_中获得**group_by_的expr**，使用evaluate求出keys；构造Value的函数从plan_中获得**aggregate的expr**，使用evalutae求出values。
        2. 以第二个样例`insert into t1 values (-99999), (99999), (0), (1), (2), (3); select count(*), min(v1), max(v1), count(v1), sum(v1) from t1;`为例，第一次Next返回-99999这个tuple，MakeAggregateValue会用`count,min,max,count,sum`这五个函数产生`1,-99999,-99999,-99999,-99999`这五个值
        3. 在`CombineAggregateValues()`中，对于这五个值，就需要根据对于`agg_expr`的类型(`agg_types_[i]`)进行聚合处理。
        4. 注意区分`count(*), count(column)`，前者统计总行数，后者统计该列中非空的数量。对于空表，前者为0；对于空列或者该全为null的列，后者为null。对于其他聚集函数来说也是，input是null时跳过处理
        5. executor的返回值是group_by columns + aggregation columns
        > The output schema consists of the group-by columns followed by the aggregation columns.
            如果我还select了其他column，但是没有orderby它，会怎样？——这种情况可以是不允许的，参考MySQL的`only_full_group_by`。因此不用处理非聚集的值和非groupby的值
            
例如，`__mock_table_1.colA:INTEGER`是groupby columns，`agg#0:INTEGER, agg#1:INTEGER, agg#2:INTEGER`是aggregate column
```
EXPLAIN SELECT max(colA), colA, MIN(colB), max(colB) FROM __mock_table_1 GROUP BY colA;
Agg { types=[max, min, max], aggregates=[#0.0, #0.1, #0.1], group_by=[#0.0] } | (__mock_table_1.colA:INTEGER, agg#0:INTEGER, agg#1:INTEGER, agg#2:INTEGER)
```
---
注意当表为空的时候，child的next不会返回tuples，aht_.InsertCombine()就不会执行。
1. 如果查询中没有groupby，此时agg_executor->Next()仍要返回一些默认值，如函数`GenerateInitialAggregateValue()`中所指定的。为了复用`GenerateInitialAggregateValue(), InsertCombine()`这两个函数，我在处理空表时，对于`plan_->aggregates_`中使用到的各聚集函数，均构造一个null值作为agg_val，调用InsertCombine试图插入到hashtable中，此时`InsertCombine()`会对hashtable设置初值，而`CombineAggregateValues()`会忽略掉我构造的这些null值，不动初值。但这些只是aggregate函数们的默认值，如果查询语句有group by，那么还要构造这些列的空值作为默认值。
2. 查询中有groupby，不返回值
---
在测试`select v4, min(v1) + sum(v2) + max(v3), count(*) from t1 group by v4;`的时候，aggregate_executor->Next()中的executed_无法等于true，导致这个Next()被无限调用。
原因是我在Next()中修改hashtable后做了多余的更新迭代器的操作`aht_iterator_ = aht_.Begin();`导致无法满足结束的条件。只需在首次更新hashtable之后更新begin迭代器即可，即`Init()`和`if(!aggregated_)`中。（在首次更新之前，hashtable为空，Begin返回的是end，这不是想要的）

[CMU15445] bustub 2023 Project2

作者: rin
时间: April 23, 2023
分类: 未分类
评论

Codes: [b_plus_tree.cpp](https://git.lo-li.net/rin/bustub/src/branch/master/src/storage/index/b_plus_tree.cpp "b_plus_tree.cpp")
# Task1 B+Tree Pages 
## Page types
There are 4 types of headers (classes) named as `b_plus_*_page_.h`
1. `b_plus_tree_page.h`
2. `b_plus_tree_header_page.h`
3. `b_plus_tree_internal_page.h`
4. `b_plus_tree_leaf_page.h`

Only `internal_page` and `leaf_page` are inherited from the base class defined in `b_plus_tree_page`.

At first, I was confused by the words `RootPage` and `HeaderPage`. 
In fact, `header_page` does not belong to the inheritance tree. It stores the metadata of a `BPlusTree`  object, which is `root_page_id` for now.

`header_page` always exists in an initialized `BPlusTree`. But `root_page_id` may be `INVALID_PAGE_ID` at some time, meaning the tree is **empty** (tree exists but no data).

---

## Page structure

Both 2 types of pages share the common member variables `page_type_`, `size_` and `max_size_` defined in `BPlusTreePage`.
But the data area `MappingType array_[0]` is defined in `leaf_page.h` and `internal_page.h` separately.

This is because the array is a *Flexible array* and it must be placed at the end of an object. But in `leaf_page`, there is an extra member `next_page_id_` after common members. The class structure is as below:
```
|| page_type_ | size_ | max_size_ | next_page_id_ | array_....... ||
|| page_type_ | size_ | max_size_ | array_....... ||
```
In later codes, a pointer of `internal_page` will be `reinterpret_cast` to a point of `leaf_page`. And `page_type_` will be accessed to check the type of this page.
Due to the common memory structure in both pages' headers, this operation is safe.

### Internal page
```
| KEY(0)+PAGE_ID(0) | KEY(1)+PAGE_ID(1) | KEY(2)+.... |
  ⬆[INVALID]
```

There are $$m+1$$ pairs of `<key, page_id_t>` in an internal page. The 0th pair has an invalid key. So an Internal Page stores **m** ordered keys and **m+1** child pointers. `max_size_ == m+1`

The assignment specification recommends:
> split an internal node when number of values reaches `max_size` before insertion

### Leaf page
Every key in a leaf page is valid.

> split a leaf node when the number of values reaches `max_size` after insertion

So in any stable situation,  max_size_ < size_

Manacher实例

作者: rin
时间: July 18, 2020
分类: 未分类
评论

[前坑](https://lo-li.net/1311-manacher-%E7%AE%97%E6%B3%95%E6%B1%82%E6%9C%80%E9%95%BF%E5%9B%9E%E6%96%87%E5%AD%90%E4%B8%B2.html)

![manacher2.png](https://static.lo-li.net/typecho/2020/07/2154660342.png)
符号说明：
* `i`为字符串中的下标，`S`是预处理后的字符串
* `P[i]`代表以`i`为中心（含）的地方，回文的半径长度。例如`P[1]=1`意为以`'#'`为中心的最长回文子串为`"#"`，半径长度为1；同样，`P[16]=4`意为以`'x'`为中心的最长回文子串`"#p#x#p#"`的半径`"#p#x"/"x#p#"`的长度4
* `id`代表当前已知的最长回文子串的中点；`mx`代表该子串的半径长度+1
* 带圈数字代表步骤

①、起始状态，`i=0,mx=0,id=0`
此时`i=1 >= mx=0`，即此位置已经超过了已知的最长回文范围（因为才开始嘛），同时也能确定`P[1]=至少=1`。
因此现在需要以`i=1`为中心，从其半径范围外两边开始逐一比对。（两边就是`i±P[i]`）
发现`'$'≠'a'`，扩展不下去了便停止，更新`id=1,mx=id+P[id]=1+1=2`

②、`i=2,mx=2,id=1`
此时`i=2 >= mx=2`，同样超过了已知的最长回文范围（注意`mx`是边界+1），同理能确定`P[2]=至少=1`。
继续扩展发现`'#'='#'`，则`P[2]++`，再继续无法扩展`$≠b`。
更新`id=2,mx=id+P[id]=2+2=4`

③、`i=3,mx=4,id=2`
此时`i < mx`了，说明`i`在**已知最长的回文子串（id=2的那个）**的势力范围内了。那么根据对称性，`i=3`在前面就有一个“分身”`j=2*id-i=1`
现在想求的`P[i]`就和`P[j]`有一种关系：`P[i]`=**可能**=`P[j]`，那么此时以`i`为中心的回文子串最右的下标就**有可能**扩展到`i+P[i]-1`，例如此时：
![manacher1.png](https://static.lo-li.net/typecho/2020/07/3248822752.png)
`i+P[i]-1=3+1-1=3 < mx=4`，`3`仍处在**能保证对称的<span style="background-color:#ffe699">势力范围内</span>**，那么上面的“可能”变成“一定”。
（当然现在串还太短效果不是很明显，可以参考后面的⑥；同时“不可能”的情况请参考文末）
④、⑤同理
⑥、`i=6,mx=10,id=5`
`P[6]=P[4]=2`，现在能从下标`6-2=4`和`6+2=8`的位置开始对比，而对称区域内的都不用再比了。所以就利用了已知的信息跳过了很多次对比。
然后`'b'≠'a'`停止，`6+2=8 < mx=10`，说明仍然处于最长的那个子串内部，所以不更新`mx,id`
……
如此便能算出所有的`P[i]`并且能得到最长的长度。

------------
“不可能”的情况
换一个例子
![manacher3.png](https://static.lo-li.net/typecho/2020/07/31183235.png)
此时`i=11,id=8,mx=14`
根据预判，`P[11]=P[5]=5`。但是下标`11+5=16 >= mx=14`，预判超过了**能保证对称**的部分，于是剩余的部分`14,15`就要和`8,7`暴力对比。

Codeforces Round #384 (Div. 2)

作者: rin
时间: December 15, 2016
分类: 未分类
评论

~~这个灰名选手一直坚持他蓝名梦想~~

------------

[A. Vladik and flights](http://codeforces.com/problemset/problem/743/A)
有$$n$$个机场，要从$$a$$飞到$$b$$
每个机场有航空公司$$0$$或$$1$$，同航空公司机场间移动无费用，不同的之间为$$|ni-nj|$$，问从$$a$$到$$b$$最少花费是多少。

因为同公司无费用，所以如果$$a$$和$$b$$地相同的话就输出$$0$$；
如果不相同，那么从$$a$$一定能移动到一个地方，是$$0$$和$$1$$的交界，然后在交界换一家公司飞到b，花费是$$1$$
```c++
int main(){
    int n, a, b;
    char aType, bType;
    scanf("%d %d %d", &n, &a, &b);
    getchar();
    for (int i = 1; i <= n; i++) {
        if (i == a) scanf("%c", &aType);
        else if (i == b) scanf("%c", &bType);
        else scanf("%*c");
    }
    puts((aType == bType || a == b) ? "0" : "1");
    return 0;
}
```

------------

[B. Chloe and the sequence](http://codeforces.com/problemset/problem/743/B)
初始有一个序列$$[1]$$
然后将它本身加到后面变成$$[1,1]$$
再将还未使用的最小的正整数加到正中的位置$$[1,2,1]$$
重复$$n-1$$次
问这之后第$$k$$个数字是啥

由~~观察瞎猜脑洞~~，~~可以看出~~，这个序列代表从$$1$$到某个数，(二进制)从右数起第一个$$1$$的位置，比如

| 数（二进制）  | 1的位置  |
| ------------ | ------------ |
|  1(1) | 1  |
|  2(10) | 2  |
|3(11)|1|
|4(100)|3|
|5(101)|1|
|6(110)|2|
|...|...|
那么第$$k$$位是啥，就是$$k$$的右起第一个$$1$$位置。
可以直接用glibc里的`__builtin_ffs()`函数
考虑到$$k$$超`int`用其`long long`版本 ` __builtin_ffsll()`
```c++
int main(){
    long long k;
    scanf("%*lld %lld", &k);
    printf("%d\n", __builtin_ffsll(k));
    return 0;
}
```