Introduction

  • Quicksort on a disk is not simple
  • To get a[i] on internal memory — O(1)O(1)
  • To get a[i] on hard disk — device-dependent
    • find the track
    • find the sector
    • find a[i] and transmit
  • If only one tape drive is available to perform the external sorting, then the tape access time for any algorithm will be Ω(N2)\Omega(N^2)
  • Tool: Mergesort
  • To simplify
    • Store data on tapes (can only be accessed sequentially)
    • Can use at least 3 tape drives
image-20210704101343528

Q: Given 10,000,000 records of 128 bytes each, and the size of the internal memory is 4MB. How many passes we have to do?

A: 1+ log(320 runs) = 1 + 9

  • Concerns
    • Seek time — O(number of passes)O(\text{number of passes})
    • Time to read or write one block of records
    • Time to internally sort MM records
    • Time to merge NN records from input buffers to the output buffer(Computer can carry out I\O and CPU processing in parallel)
  • Targets
    • Reduction of the number of passes
    • Run merging
    • Buffer handling for parallel operation
    • Run generation

Pass Reduction

k-way merge

image-20210704102452540
  • Require 2k tapes

Use 3 tapes for a 2-way merge

  • Split unevenly
  • Claim: If the number of runs is a Fibonacci number FNF_N, then the best way to distribute them is to split them into FN1F_{N–1} and FN2F_{N–2}
  • Claim: For a k-way merge, FN(k)=FN1(k)+FN2(k)F^{(k)}_N=F^{(k)}_{N-1}+F^{(k)}_{N-2}, where F0(k)=0(0Nk2),Fk1(k)=1F^{(k)}_0=0(0\leq N\leq k-2),F^{(k)}_{k-1}=1
    • Polyphase Merge
    • Require k + 1 tapes only

Buffer Handling

  • For a k-way merge we need 2k input buffers and 2 output buffers for parallel operations
  • k⬆ \rarr num of input buffers⬆ \rarr buffer size⬇ \rarr block size on disk⬇ \rarr seek time⬆
  • Beyond a certain k value, the I\O time would actually increase despite the decrease in the number of passes being made. The optimal value for k clearly depends on disk parameters and the amount of internal memory available for buffers.

Run Generation

Replacement Selection

  • Lavg=2ML_{avg}=2M
  • Powerful when input is often nearly sorted for external sorting

Minimize the Merge Time

image-20210704110009527