[1] data structures #182

Open
lomakinae wants to merge 11 commits from lomakinae/2026-rff_mp:data_structures into develop
16 changed files with 580 additions and 0 deletions

2
lomakinae/.gitignore vendored Normal file
View File

@ -0,0 +1,2 @@
__pycache__/
*.pyc

View File

@ -0,0 +1,86 @@
# Отчёт. Задание 1 - структуры данных
## Цель
Реализовать три структуры данных (связный список, хеш-таблица, BST) и экспериментально сравнить их
производительность на операциях insert / find / delete при случайном и отсортированном порядке входных
данных.
## Параметры эксперимента
| Параметр | Значение |
| ----------- | ------------------------------------ |
| N (записей) | 10 000 |
| Повторений | 5 |
| Поисков | 100 существующих + 10 несуществующих |
| Удалений | 50 |
---
## Результаты
### Случайные данные (shuffled)
![shuffled](data/01/attachments/plot_shuffled.png)
### Отсортированные данные (sorted)
![sorted](data/01/attachments/plot_sorted.png)
---
## Анализ
### Деградация BST на отсортированных данных
![bst comparison](data/01/attachments/plot_bst_comparison.png)
На случайных данных BST - самая быстрая структура: insert 0.027 с. Случайный порядок вставки даёт сбалансированное
дерево глубиной ~log N, поэтому каждый новый узел находит своё место за O(log N) шагов. На отсортированных - 12.77 с,
то есть в ~473 раз медленнее. Причина: при последовательной вставке отсортированных ключей каждый новый узел уходит в
правое поддерево предыдущего. Дерево вырождается в цепочку глубиной N, и каждая вставка требует O(N) шагов вместо O(log N).
### Хеш-таблица нечувствительна к порядку
HashTable показывает практически одинаковое время в обоих режимах (insert: 0.033 с против 0.032 с). Это ожидаемо:
индекс бакета вычисляется через `hash(name)`, который не зависит от порядка вставки. Операции работают за O(1)
при любом входе.
### Связный список медленен при поиске
LinkedList не имеет никакой структуры для навигации - единственный способ найти запись это пройти список от головы до нужного узла.
Find всегда O(N) независимо от порядка данных: shuffled 0.041 с, sorted 0.039 с. Insert O(N^2) для всей выборки - перед каждой вставкой
нужно пройти весь список для проверки дубликата. На отсортированных данных LinkedList не меняет поведение, тогда как BST деградирует
до 12.77 с - в этом единственном сценарии LinkedList оказывается быстрее BST.
### Удаление
LinkedList - чтобы удалить узел, нужно пройти список от головы до нужного элемента и перешить next предшественника.
Это O(N) в любом случае, порядок данных не имеет значения. Shuffled: 0.027 с, sorted: 0.026 с - разница в пределах погрешности.
HashTable - вычисляем индекс бакета через `hash(name)`, затем удаляем узел из связного списка этого бакета. Порядок вставки не
влияет на то, в каком бакете лежит запись, поэтому время стабильно в обоих режимах: 0.00033 с.
BST - ищем узел спуском по дереву, затем обрабатываем три случая: нет потомков, один потомок, два потомка. На случайных данных
дерево сбалансировано, глубина ~log N, удаление занимает 0.00014 с. На отсортированных данных дерево вырождено в
цепочку - каждый узел уходил в правое поддерево при вставке, поэтому поиск удаляемого узла проходит через всю цепочку O(N).
Результат: 0.061 с, то есть в ~435 раз медленнее.
---
## Вывод
**Частые вставки** - HashTable. Время вставки не зависит от порядка и объёма данных: индекс бакета вычисляется за O(1),
вставка в бакет - тоже O(1). Подтверждают цифры: 0.033 с на shuffled и 0.032 с на sorted при N=10000.
**Частый поиск** - HashTable. По той же причине: `hash(name)` сразу указывает на нужный бакет, линейный перебор не нужен.
Find: 0.00057 с на shuffled, 0.00070 с на sorted - стабильно при любом входе.
**Получить данные в отсортированном порядке** - BST при случайном порядке вставки. Элементы размещаются по правилу BST
(слева меньшие корня, справа большие корня), поэтому обход по схеме левое поддерево -> корень -> правое поддерево возвращает
все записи в алфавитном порядке без дополнительной сортировки. Важное условие: данные должны вставляться в случайном
порядке, иначе дерево вырождается (см. деградацию BST на отсортированных данных).
LinkedList сам по себе проигрывает по всем операциям из-за O(N\*\*2) на вставку и O(N) на поиск. Однако его идея лежит в основе
хеш-таблицы: каждый бакет - это связный список, через который разрешаются коллизии. Как самостоятельная структура данных для
справочника он неэффективен, но как строительный блок внутри HashTable - незаменим.

View File

@ -0,0 +1,9 @@
# Задание 1: структуры данных
## Как запустить
```sh
python main.py
```
## Результаты
Графики генерируются автоматически в папку `attachments/`.

Binary file not shown.

After

Width:  |  Height:  |  Size: 66 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 128 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 132 KiB

View File

@ -0,0 +1,6 @@
from src.experiment import main_experiment
from src.plot import build_plots
if __name__ == "__main__":
main_experiment()
build_plots()

View File

@ -0,0 +1,91 @@
structure,mode,operation,run,time_sec
LinkedList,shuffled,insert,1,3.25562
LinkedList,shuffled,find,1,0.040773
LinkedList,shuffled,delete,1,0.026344
HashTable,shuffled,insert,1,0.033497
HashTable,shuffled,find,1,0.000593
HashTable,shuffled,delete,1,0.000348
BST,shuffled,insert,1,0.024071
BST,shuffled,find,1,0.000218
BST,shuffled,delete,1,0.000136
LinkedList,shuffled,insert,2,3.454281
LinkedList,shuffled,find,2,0.040282
LinkedList,shuffled,delete,2,0.026526
HashTable,shuffled,insert,2,0.031691
HashTable,shuffled,find,2,0.000568
HashTable,shuffled,delete,2,0.000338
BST,shuffled,insert,2,0.024978
BST,shuffled,find,2,0.000213
BST,shuffled,delete,2,0.000135
LinkedList,shuffled,insert,3,3.453681
LinkedList,shuffled,find,3,0.0404
LinkedList,shuffled,delete,3,0.026843
HashTable,shuffled,insert,3,0.031902
HashTable,shuffled,find,3,0.000536
HashTable,shuffled,delete,3,0.000319
BST,shuffled,insert,3,0.025369
BST,shuffled,find,3,0.000219
BST,shuffled,delete,3,0.000138
LinkedList,shuffled,insert,4,3.417185
LinkedList,shuffled,find,4,0.040816
LinkedList,shuffled,delete,4,0.027023
HashTable,shuffled,insert,4,0.037826
HashTable,shuffled,find,4,0.000582
HashTable,shuffled,delete,4,0.00033
BST,shuffled,insert,4,0.036423
BST,shuffled,find,4,0.000227
BST,shuffled,delete,4,0.00014
LinkedList,shuffled,insert,5,3.4723
LinkedList,shuffled,find,5,0.040734
LinkedList,shuffled,delete,5,0.027866
HashTable,shuffled,insert,5,0.031981
HashTable,shuffled,find,5,0.000546
HashTable,shuffled,delete,5,0.000332
BST,shuffled,insert,5,0.024578
BST,shuffled,find,5,0.000227
BST,shuffled,delete,5,0.000146
LinkedList,sorted,insert,1,3.271489
LinkedList,sorted,find,1,0.038886
LinkedList,sorted,delete,1,0.026646
HashTable,sorted,insert,1,0.030995
HashTable,sorted,find,1,0.000625
HashTable,sorted,delete,1,0.000302
BST,sorted,insert,1,13.000812
BST,sorted,find,1,0.128239
BST,sorted,delete,1,0.06369
LinkedList,sorted,insert,2,3.384572
LinkedList,sorted,find,2,0.03915
LinkedList,sorted,delete,2,0.026683
HashTable,sorted,insert,2,0.032596
HashTable,sorted,find,2,0.0006
HashTable,sorted,delete,2,0.000315
BST,sorted,insert,2,12.593249
BST,sorted,find,2,0.10657
BST,sorted,delete,2,0.058763
LinkedList,sorted,insert,3,3.27816
LinkedList,sorted,find,3,0.038938
LinkedList,sorted,delete,3,0.025567
HashTable,sorted,insert,3,0.03168
HashTable,sorted,find,3,0.000631
HashTable,sorted,delete,3,0.00031
BST,sorted,insert,3,12.809241
BST,sorted,find,3,0.110947
BST,sorted,delete,3,0.062604
LinkedList,sorted,insert,4,3.277437
LinkedList,sorted,find,4,0.039812
LinkedList,sorted,delete,4,0.025627
HashTable,sorted,insert,4,0.031844
HashTable,sorted,find,4,0.000917
HashTable,sorted,delete,4,0.000383
BST,sorted,insert,4,12.722063
BST,sorted,find,4,0.111841
BST,sorted,delete,4,0.060014
LinkedList,sorted,insert,5,3.261706
LinkedList,sorted,find,5,0.037981
LinkedList,sorted,delete,5,0.025241
HashTable,sorted,insert,5,0.032067
HashTable,sorted,find,5,0.000742
HashTable,sorted,delete,5,0.000342
BST,sorted,insert,5,12.713176
BST,sorted,find,5,0.108333
BST,sorted,delete,5,0.059109
1 structure mode operation run time_sec
2 LinkedList shuffled insert 1 3.25562
3 LinkedList shuffled find 1 0.040773
4 LinkedList shuffled delete 1 0.026344
5 HashTable shuffled insert 1 0.033497
6 HashTable shuffled find 1 0.000593
7 HashTable shuffled delete 1 0.000348
8 BST shuffled insert 1 0.024071
9 BST shuffled find 1 0.000218
10 BST shuffled delete 1 0.000136
11 LinkedList shuffled insert 2 3.454281
12 LinkedList shuffled find 2 0.040282
13 LinkedList shuffled delete 2 0.026526
14 HashTable shuffled insert 2 0.031691
15 HashTable shuffled find 2 0.000568
16 HashTable shuffled delete 2 0.000338
17 BST shuffled insert 2 0.024978
18 BST shuffled find 2 0.000213
19 BST shuffled delete 2 0.000135
20 LinkedList shuffled insert 3 3.453681
21 LinkedList shuffled find 3 0.0404
22 LinkedList shuffled delete 3 0.026843
23 HashTable shuffled insert 3 0.031902
24 HashTable shuffled find 3 0.000536
25 HashTable shuffled delete 3 0.000319
26 BST shuffled insert 3 0.025369
27 BST shuffled find 3 0.000219
28 BST shuffled delete 3 0.000138
29 LinkedList shuffled insert 4 3.417185
30 LinkedList shuffled find 4 0.040816
31 LinkedList shuffled delete 4 0.027023
32 HashTable shuffled insert 4 0.037826
33 HashTable shuffled find 4 0.000582
34 HashTable shuffled delete 4 0.00033
35 BST shuffled insert 4 0.036423
36 BST shuffled find 4 0.000227
37 BST shuffled delete 4 0.00014
38 LinkedList shuffled insert 5 3.4723
39 LinkedList shuffled find 5 0.040734
40 LinkedList shuffled delete 5 0.027866
41 HashTable shuffled insert 5 0.031981
42 HashTable shuffled find 5 0.000546
43 HashTable shuffled delete 5 0.000332
44 BST shuffled insert 5 0.024578
45 BST shuffled find 5 0.000227
46 BST shuffled delete 5 0.000146
47 LinkedList sorted insert 1 3.271489
48 LinkedList sorted find 1 0.038886
49 LinkedList sorted delete 1 0.026646
50 HashTable sorted insert 1 0.030995
51 HashTable sorted find 1 0.000625
52 HashTable sorted delete 1 0.000302
53 BST sorted insert 1 13.000812
54 BST sorted find 1 0.128239
55 BST sorted delete 1 0.06369
56 LinkedList sorted insert 2 3.384572
57 LinkedList sorted find 2 0.03915
58 LinkedList sorted delete 2 0.026683
59 HashTable sorted insert 2 0.032596
60 HashTable sorted find 2 0.0006
61 HashTable sorted delete 2 0.000315
62 BST sorted insert 2 12.593249
63 BST sorted find 2 0.10657
64 BST sorted delete 2 0.058763
65 LinkedList sorted insert 3 3.27816
66 LinkedList sorted find 3 0.038938
67 LinkedList sorted delete 3 0.025567
68 HashTable sorted insert 3 0.03168
69 HashTable sorted find 3 0.000631
70 HashTable sorted delete 3 0.00031
71 BST sorted insert 3 12.809241
72 BST sorted find 3 0.110947
73 BST sorted delete 3 0.062604
74 LinkedList sorted insert 4 3.277437
75 LinkedList sorted find 4 0.039812
76 LinkedList sorted delete 4 0.025627
77 HashTable sorted insert 4 0.031844
78 HashTable sorted find 4 0.000917
79 HashTable sorted delete 4 0.000383
80 BST sorted insert 4 12.722063
81 BST sorted find 4 0.111841
82 BST sorted delete 4 0.060014
83 LinkedList sorted insert 5 3.261706
84 LinkedList sorted find 5 0.037981
85 LinkedList sorted delete 5 0.025241
86 HashTable sorted insert 5 0.032067
87 HashTable sorted find 5 0.000742
88 HashTable sorted delete 5 0.000342
89 BST sorted insert 5 12.713176
90 BST sorted find 5 0.108333
91 BST sorted delete 5 0.059109

View File

View File

@ -0,0 +1,72 @@
import time
from .ll import ll_insert, ll_find, ll_delete
from .ht import ht_new, ht_insert, ht_find, ht_delete
from .bst import bst_insert, bst_find, bst_delete
def _build_ll(records):
head = None
for name, phone in records:
head = ll_insert(head, name, phone)
return head
def _build_ht(records):
buckets = ht_new()
for name, phone in records:
ht_insert(buckets, name, phone)
return buckets
def _build_bst(records):
root = None
for name, phone in records:
root = bst_insert(root, name, phone)
return root
def _time_insert(build_fn, records):
start = time.perf_counter()
structure = build_fn(records)
end = time.perf_counter()
elapsed = end - start
return elapsed, structure
def _time_find(find_fn, structure, names):
start = time.perf_counter()
for name in names:
find_fn(structure, name)
end = time.perf_counter()
elapsed = end - start
return elapsed
def _time_delete(delete_fn, structure, names):
start = time.perf_counter()
for name in names:
result = delete_fn(structure, name)
if result is not None:
structure = result
end = time.perf_counter()
elapsed = end - start
return elapsed, structure
def run_once(records, search_names, delete_names):
results = []
structures = {
'LinkedList': (_build_ll, ll_find, ll_delete),
'HashTable': (_build_ht, ht_find, ht_delete),
'BST': (_build_bst, bst_find, bst_delete),
}
for label, (build_fn, find_fn, delete_fn) in structures.items():
t_insert, structure = _time_insert(build_fn, records)
t_find = _time_find(find_fn, structure, search_names)
t_delete, structure = _time_delete(delete_fn, structure, delete_names)
results.append((label, t_insert, t_find, t_delete))
return results

View File

@ -0,0 +1,65 @@
def _bst_new_node(name, phone):
return {'name': name, 'phone': phone, 'left': None, 'right': None}
def bst_insert(root, name, phone):
if root is None:
return _bst_new_node(name, phone)
if name == root['name']:
root['phone'] = phone
elif name < root['name']:
root['left'] = bst_insert(root['left'], name, phone)
else:
root['right'] = bst_insert(root['right'], name, phone)
return root
def bst_find(root, name):
if root is None:
return None
if name == root['name']:
return root['phone']
elif name < root['name']:
return bst_find(root['left'], name)
else:
return bst_find(root['right'], name)
def _bst_min_node(root):
node = root
while node['left'] is not None:
node = node['left']
return node
def bst_delete(root, name):
if root is None:
return None
if name < root['name']:
root['left'] = bst_delete(root['left'], name)
elif name > root['name']:
root['right'] = bst_delete(root['right'], name)
else:
# node found — three cases
if root['left'] is None:
return root['right']
if root['right'] is None:
return root['left']
# two children: replace node with in-order successor
successor = _bst_min_node(root['right'])
root['name'] = successor['name']
root['phone'] = successor['phone']
root['right'] = bst_delete(root['right'], successor['name'])
return root
def bst_list_all(root):
if root is None:
return []
return bst_list_all(root['left']) + [(root['name'], root['phone'])] + bst_list_all(root['right'])

View File

@ -0,0 +1,58 @@
import csv
import random
import sys
from pathlib import Path
from .generator import generate_records, shuffle_records, sort_records, sample_existing, sample_nonexistent
from .bench import run_once
N = 10000
RUNS = 5
SEARCH_K = 100
SEARCH_MISSING_K = 10
DELETE_K = 50
sys.setrecursionlimit(15000)
BASE_DIR = Path(__file__).resolve().parent.parent
RESULT_PATH = BASE_DIR / "results.csv"
def run_experiment(records, mode):
search_names = sample_existing(records, SEARCH_K) + sample_nonexistent(SEARCH_MISSING_K)
delete_names = sample_existing(records, DELETE_K)
all_rows = []
for run_i in range(1, RUNS + 1):
print(f" [{mode}] run {run_i}/{RUNS} ...")
run_results = run_once(records, search_names, delete_names)
for label, t_insert, t_find, t_delete in run_results:
all_rows.append([label, mode, 'insert', run_i, round(t_insert, 6)])
all_rows.append([label, mode, 'find', run_i, round(t_find, 6)])
all_rows.append([label, mode, 'delete', run_i, round(t_delete, 6)])
return all_rows
def main_experiment():
random.seed(52)
records_base = generate_records(N)
records_shuffled = shuffle_records(records_base)
records_sorted = sort_records(records_base)
rows = []
rows += run_experiment(records_shuffled, 'shuffled')
rows += run_experiment(records_sorted, 'sorted')
header = ['structure', 'mode', 'operation', 'run', 'time_sec']
output_path = RESULT_PATH
with open(output_path, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(rows)
print(f"Done. Results saved to {output_path.name}")
print(f"Total rows: {len(rows)}")
if __name__ == "__main__":
main_experiment()

View File

@ -0,0 +1,27 @@
import random
def generate_records(n):
records = [(f"User_{i:05d}", f"252-{i:05d}") for i in range(n)]
return records
def shuffle_records(records):
shuffled = records[:]
random.shuffle(shuffled)
return shuffled
def sort_records(records):
sorted_records = sorted(records)
return sorted_records
def sample_existing(records, k):
names = [name for name, _ in random.sample(records, k)]
return names
def sample_nonexistent(k):
ghosts = [f"None_{i:05d}" for i in range(k)]
return ghosts

View File

@ -0,0 +1,34 @@
from .ll import ll_insert, ll_find, ll_delete, ll_list_all
DEFAULT_SIZE = 128
def ht_new(size=DEFAULT_SIZE):
return [None] * size
def _ht_index(buckets, name):
return hash(name) % len(buckets)
def ht_insert(buckets, name, phone):
i = _ht_index(buckets, name)
buckets[i] = ll_insert(buckets[i], name, phone)
def ht_find(buckets, name):
i = _ht_index(buckets, name)
return ll_find(buckets[i], name)
def ht_delete(buckets, name):
i = _ht_index(buckets, name)
buckets[i] = ll_delete(buckets[i], name)
def ht_list_all(buckets):
records = []
for head in buckets:
records.extend(ll_list_all(head))
return sorted(records)

View File

@ -0,0 +1,50 @@
def _ll_new_node(name, phone):
return {'name': name, 'phone': phone, 'next': None}
def ll_insert(head, name, phone):
node = head
while node is not None:
if node['name'] == name:
node['phone'] = phone
return head
node = node['next']
new_node = _ll_new_node(name, phone)
new_node['next'] = head
return new_node
def ll_find(head, name):
node = head
while node is not None:
if node['name'] == name:
return node['phone']
node = node['next']
return None
def ll_delete(head, name):
if head is None:
return None
if head['name'] == name:
return head['next']
node = head
while node['next'] is not None:
if node['next']['name'] == name:
node['next'] = node['next']['next']
return head
node = node['next']
return head
def ll_list_all(head):
records = []
node = head
while node is not None:
records.append((node['name'], node['phone']))
node = node['next']
return sorted(records)

View File

@ -0,0 +1,80 @@
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
BASE_DIR = Path(__file__).resolve().parent.parent
DATA_PATH = BASE_DIR / "results.csv"
ATTACHMENTS_DIR = BASE_DIR / "attachments"
def shuffled_sorted_plots(structures, df):
for mode in ['shuffled', 'sorted']:
mode_title = 'перемешанные данные' if mode == 'shuffled' else 'отсортированные данные'
fig, axes = plt.subplots(1, 3, figsize=(15, 6))
fig.suptitle(f'Производительность — {mode_title}')
for ax, op in zip(axes, ['insert', 'find', 'delete']):
subset = df[(df['mode'] == mode) & (df['operation'] == op)]
structures_average = subset.groupby('structure')['time_sec'].mean()
means = [structures_average[s] for s in structures]
bars = ax.bar(structures, means, color=['#4C72B0', '#55A868', '#C44E52'])
ax.set_title(op)
ax.set_ylabel('t (с)')
ax.set_yscale('log')
ax.grid(True, axis='y', alpha=0.3)
for bar, val in zip(bars, means):
label = f'{val:.5f}' if val > 0.0001 else f'{val:.1e}'
ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height(),
label, ha='center', va='bottom', fontsize=9)
plt.tight_layout()
save_path = ATTACHMENTS_DIR / f'plot_{mode}.png'
plt.savefig(save_path, dpi=300)
plt.close()
print(f'Saved: {save_path.name}')
def bst_shuffled_vs_sorted(df):
fig, ax = plt.subplots(figsize=(7, 5))
fig.suptitle('Производительность bst_insert(): перемешанные и упорядоченные данные')
bst_insert = df[(df['structure'] == 'BST') & (df['operation'] == 'insert')]
modes_average = bst_insert.groupby('mode')['time_sec'].mean()
modes = ['shuffled', 'sorted']
means = [modes_average[m] for m in modes]
bars = ax.bar(modes, means, color=['#4C72B0', '#C44E52'])
ax.set_ylabel('t (c)')
ax.set_yscale('log')
ax.grid(True, axis='y', alpha=0.3)
for bar, val in zip(bars, means):
label = f'{val:.5f}' if val > 0.0001 else f'{val:.1e}'
ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height(),
label, ha='center', va='bottom', fontsize=10)
plt.tight_layout()
save_path = ATTACHMENTS_DIR / 'plot_bst_comparison.png'
plt.savefig(save_path, dpi=300)
plt.close()
print(f'Saved: {save_path.name}')
def build_plots():
ATTACHMENTS_DIR.mkdir(exist_ok=True)
if not DATA_PATH.exists():
raise ValueError(f"File not found: {DATA_PATH}")
df = pd.read_csv(DATA_PATH)
STRUCTURES = ['LinkedList', 'HashTable', 'BST']
shuffled_sorted_plots(STRUCTURES, df)
bst_shuffled_vs_sorted(df)
if __name__ == "__main__":
build_plots()