External Sorting
Introduction
- Quicksort on a disk is not simple
- To get a[i] on internal memory —
- To get a[i] on hard disk — device-dependent
- find the track
- find the sector
- find a[i] and transmit
- If only one tape drive is available to perform the external sorting, then the tape access time for any algorithm will be
- Tool: Mergesort
- To simplify
- Store data on tapes (can only be accessed sequentially)
- Can use at least 3 tape drives
Q: Given 10,000,000 records of 128 bytes each, and the size of the internal memory is 4MB. How many passes we have to do?
A: 1+ log(320 runs) = 1 + 9
- Concerns
- Seek time —
- Time to read or write one block of records
- Time to internally sort records
- Time to merge records from input buffers to the output buffer(Computer can carry out I\O and CPU processing in parallel)
- Targets
- Reduction of the number of passes
- Run merging
- Buffer handling for parallel operation
- Run generation
Pass Reduction
k-way merge
- Require 2k tapes
Use 3 tapes for a 2-way merge
- Split unevenly
- Claim: If the number of runs is a Fibonacci number , then the best way to distribute them is to split them into and
- Claim: For a k-way merge, , where
- Polyphase Merge
- Require k + 1 tapes only
Buffer Handling
- For a k-way merge we need 2k input buffers and 2 output buffers for parallel operations
- k⬆ num of input buffers⬆ buffer size⬇ block size on disk⬇ seek time⬆
- Beyond a certain k value, the I\O time would actually increase despite the decrease in the number of passes being made. The optimal value for k clearly depends on disk parameters and the amount of internal memory available for buffers.
Run Generation
Replacement Selection
- Powerful when input is often nearly sorted for external sorting
Minimize the Merge Time
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 OE.Heart's Blog!