1541 isn't fast enough to decode in realtime.
— MagerValp
GCR decoding on the fly
The little routine presented here is, with all due humbleness, hands down the best code I ever wrote. It is esoteric and might not appeal to the masses, and it certainly has no commercial value. But in terms of personal satisfaction at a job well done, this is about as good as it gets. Here's why:
- The code solves a real, practical problem that people have been struggling with for 30 years.
- The key to the solution lies in looking at the problem from a novel angle.
- The implementation would not be possible without plenty of trickery and clever coding techniques. Several instructions do more than one thing.
- The routine is incredibly compact, and leaves very little room for variation; it is THE solution.
Due to the strict timing requirements, there was no way of testing the routine before it was complete. In contrast with the prevalent fashion of trial-and-error software development, writing this code was an exercise of the mind, and my only tools were pen, paper and text editor.
Acknowledgements
I would like to thank Krill of Plush for creating an excellent loader, documenting some of its inner workings and releasing the entire source code, and Ninja of The Dreams for his concise reference material All about your 1541.
The problem
Commodore compatible 5.25" floppy disks store data as magnetic polarity reversals along 35 concentric tracks on the surface. The drive positions the read head (a coil and an amplifier) over a track, spins the disk, and reads off a stream of ones and zeroes. A switch in polarity indicates a one, whereas no switch in polarity for a predetermined amount of time indicates a zero. But because the spindle motors in different drives won't go at exactly the same speed, it is not permitted to store more than two consecutive zeroes on a track. Otherwise there's a risk that the floppy drive counts off too many or too few zeroes, and the data comes out all wrong.
Hence the data is encoded using a scheme called GCR (Group Code Recording). For every four bits of data (also known as a nybble), five bits are written to the disk surface, chosen from a table. Let's refer to them as a quintuple. The encoding is designed to ensure that, no matter how you order the quintuples, there won't ever be more than two zeroes in a row.
Nybble | Quintuple | Nybble | Quintuple |
---|---|---|---|
0000 | 01010 | 1000 | 01001 |
0001 | 01011 | 1001 | 11001 |
0010 | 10010 | 1010 | 11010 |
0011 | 10011 | 1011 | 11011 |
0100 | 01110 | 1100 | 01101 |
0101 | 01111 | 1101 | 11101 |
0110 | 10110 | 1110 | 11110 |
0111 | 10111 | 1111 | 10101 |
When reading a disk, the floppy drive must convert the incoming bit stream, five bits at a time, into the original nybbles. The bits are, however, lumped together eight at a time, and handed off as bytes to a 6502 CPU. To indicate that a new byte has arrived, the overflow flag in the processor status register is set by means of the SO pin. It is then up to the CPU to extract the quintuples from the bytes, decode them according to the GCR scheme, and join the resulting nybbles into bytes. This is a realtime job, and the deadlines are very tight: On tracks with the highest storage density, a new byte of raw GCR-encoded data arrives every 26 microseconds.
11111222 22333334 44445555 56666677 77788888
The quintuples are packed into bytes, so that each byte contains parts of up to three different quintuples. The least common multiple of five and eight is 40, so after 40 bits we're back in phase. Thus, a GCR decoder generally contains a loop that deals with five bytes in each iteration.
Commodore's approach
Some of the floppy drives produced by Commodore sported 2 MHz processors and large ROM chips with lookup-tables to facilitate the GCR decoding. However, Commodore also made the 1541, a low-cost drive, which turned out to be immensely popular, and has earned its place in the standard setup used for C64 demo compos (an unexpanded C64 and a 1541 floppy drive). The 1541 contains a single 6502 CPU clocked at 1 MHz, 2 kB of RAM and 16 kB of ROM.
With such modest specifications, Commodore concluded that the 1541 was not capable of decoding the GCR data in realtime. Instead, the raw data for a complete 256-byte sector is stashed away in a buffer (of 320 bytes, because of the 5:4 expansion). Only afterwards does the CPU revisit each byte in the buffer, use a series of shifting and masking operations to bring each quintuple into the lower part of an index register, and perform a lookup in a ROM table of 32 entries. Actually, there are two such ROM tables, one containing low nybbles and one containing high nybbles, to be joined into a byte using bitwise-or. During this non-realtime pass, the sector checksum is also computed and verified.
loop bvc loop clv lda $1c01 sta ($30),y iny bne loop
The fetch loop in the 1541 ROM simply reads all the bytes and stores them in a buffer. First it busy-waits until the overflow flag is set. Then it clears the flag in anticipation of the next byte. Then, the current byte is fetched from the hardware register ($1c01) and stored into memory. This snippet only reads 256 bytes, but it is followed by a similar loop that reads the remaining bytes.
The loop needs 19 clock cycles to complete one iteration (assuming the bvc instruction falls through immediately). This leaves several cycles of slack, to account for differences in motor speed across drives.
A loop with more than 26 cycles would miss bytes, whereas a 25-cycle loop would work well in practice. At exactly 26 cycles you would be pushing your luck.
People were not happy with the performance of the original 1541 ROM routines, not only because of the slow GCR decoding, but also because of the dreadfully slow serial protocol used to transmit the decoded data to the computer. The one thing that Commodore got right in that protocol, though, is that they provided a way to upload code into the 1541, and thus to take over the floppy drive completely. The replacement code and any data buffers must fit in the 2 kB of RAM, but this is enough if all you want to do is load data as quickly as possible from the disk and transmit it as quickly as possible to the computer. Such software is called a fastloader.
Krill's approach
As a shining example of a modern fastloader, let's have a look at how GCR decoding is performed in Krill's loader. As expected, there's a loop that receives five bytes of GCR encoded data, and performs as much partial decoding as possible in between receiving the bytes (i.e. while the sector is passing the read head). The goal is to minimise the execution time of the second, non-realtime decoding pass, without overstepping the maximum execution time of the fetch loop, 130 cycles (26 cycles times five). The partially decoded data is stored in two buffers (512 bytes in total), corresponding to the partially decoded high and low nybbles.
But Krill also realised that there is no need to synchronise with the overflow flag for every byte.
Each track has a statically assigned bit rate. As the disk spins at a constant angular velocity (300 rpm), and the bit density of the medium is constant, it is possible to store more bits on the outermost tracks. Commodore defined four bit rates: A byte arrives every 26, 28, 30 or 32 cycles.
After a synchronisation, the first, second and third byte of data are present in the hardware register according to the following schedule. Thus, if we make sure to read the register during the safe areas indicated on the last row, we can read up to three bytes after a bvc loop.
cycle bit rate 0 10 20 30 40 50 60 70 80 90 0 111111111111111111111111111111112222222222222222222222222222222233333333333333333333333333333333 1 111111111111111111111111111111222222222222222222222222222222333333333333333333333333333333 2 111111111111111111111111111122222222222222222222222222223333333333333333333333333333 3 111111111111111111111111112222222222222222222222222233333333333333333333333333 safe areas 111111111111111111111----- ---222222222222----- ---333333-----
Because of slight variations in the timing (due to the three-cycle jitter introduced by the bvc loop, and the fact that the bit detection timer is reset after every one-bit), it is advisable to read the hardware register a couple of cycles into the safe area, but not much later.
As the loop deals with five incoming bytes per iteration, it is sensible to synchronise twice: Once in anticipation of three bytes, and once in anticipation of two bytes.
After a byte has been read, it is subjected to a number of shifting and masking operations, and the partially decoded quintuples are stored away. During the non-realtime pass, the decoding is completed and the checksum is verified. Interestingly, in Krill's loader, the nybbles are not joined together into bytes, but remain in separate buffers. This is because they are going to be transmitted serially anyway, and thus split up into even smaller parts.
The epiphany
As part of my efforts to learn C64 demo programming, I became interested in GCR decoding. One day, in a flash of inspiration, I started thinking about what would happen if all the shifting operations were removed. Consider the second quintuple. It arrives partly in the first byte, partly in the second, and normally you'd try to right-justify it in an index register. For the second quintuple, it's easiest to shift to the left, like so:
11111222 22333334 -----222 22------ mask ---22222 -------- shift
(The digits represent data from the different GCR-encoded quintuples. The dashes are zero-bits.)
But what if you, instead of shifting, combined the masked parts into a single byte using bitwise-or?
11111222 22333334 -----222 22------ mask 22---222 bitwise-or
That is to say, the three most significant bits remain in the low bits while the two least significant bits remain in the high bits. We could of course use this value as an index into a full-page lookup table, but it would be a sparse table. First, we've masked out three of the bits, so we can only reach 32 out of the 256 entries in the table. But actually, because these are GCR encoded values, we'll only access 16 out of those 32 entries, corresponding to the valid quintuples.
Suppose we do this with all eight quintuples, obtaining eight index values:
11111--- 22---222 --33333- 4444---4 5---5555 -66666-- 777---77 ---88888
Now, if we were to allocate a full page of RAM for each of these sparse tables, we'd use up the entire 2 kB. But here's the kicker: The whole point of GCR encoding is to ensure that there can never be a sequence of three consecutive zeroes in the bit stream. Since we're masking out three bits, we are introducing a sequence of three zeroes! So every index contains exactly one sequence of three or more zero-bits, and this can be regarded as a key that selects one out of eight sets of table entries. If these sets are non-overlapping, we can combine all eight tables into a single page of memory.
Unfortunately the sets overlap, for the following reason: A GCR-encoded quintuple may begin and/or end with a single zero bit. If the index value contains a sequence of exactly three zeroes, we know that these are the key. If there is a sequence of five zeroes, we know that the middle bits are the key. But if the index value contains a sequence of four zeroes, then we don't know if it's the key followed by a GCR-bit, or a GCR-bit followed by the key. Example: Index 00010110 contains a sequence of four zero-bits (bits 0, 7, 6 and 5, wrapping around from the right edge to the left edge). We don't know if this should be interpreted as --33333- (the quintuple would be 01011 which is decoded as 0001) or ---88888 (the quintuple would be 10110 which is decoded as 0110).
However, if we use two pages of memory, one for the odd quintuples and one for the even quintuples, then there can be no collisions: If there are four consecutive zeroes in an index value, we know that the three that make up the key must begin at an odd or even bit position. Continuing the example from the previous paragraph, at index 00010110 we'd put the decoded nybble 0001 in the odd table and the decoded nybble 0110 in the even table.
As a bonus, remember that the quintuples are alternating between the high and low nybbles of the decoded bytes. In the odd table, we can shift all the values into the high nybble. This way, to form the final byte out of the first two nybbles, all we have to do is compute the bitwise-or of odd_table[11111---] and even_table[22---222].
This is what the tables look like (entries marked "--" are not accessed):
Even quintuples
-- -- -- -- -- -- -- -- -- 08 00 01 -- 0c 04 05 -- -- 02 03 -- 0f 06 07 -- 09 0a 0b -- 0d 0e -- -- 02 -- -- 08 -- -- -- 00 -- -- -- 01 -- -- -- -- 03 -- -- 0c -- -- -- 04 -- -- -- 05 -- -- -- -- -- 08 0c -- 0f 09 0d 02 -- -- -- 03 -- -- -- -- 0f -- -- 0f -- -- -- 06 -- -- -- 07 -- -- -- -- 06 -- -- 09 -- -- -- 0a -- -- -- 0b -- -- -- -- 07 -- -- 0d -- -- -- 0e -- -- -- -- -- -- -- -- -- 00 04 02 06 0a 0e -- -- -- -- -- -- -- -- 08 09 -- -- -- -- -- -- -- -- -- -- -- -- -- -- 00 0a -- -- -- -- -- -- -- -- -- -- -- -- -- -- 01 0b -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 01 05 03 07 0b -- -- -- -- -- -- -- -- -- 0c 0d -- -- -- -- -- -- -- -- -- -- -- -- -- -- 04 0e -- -- -- -- -- -- -- -- -- -- -- -- -- -- 05 -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Odd quintuples
-- -- -- -- -- 00 -- 40 -- 20 -- 60 -- a0 -- e0 -- -- 80 -- 00 -- 10 -- -- -- c0 -- 40 -- 50 -- -- 80 -- 90 20 -- 30 -- -- -- f0 -- 60 -- 70 -- -- -- 90 -- a0 -- b0 -- -- -- d0 -- e0 -- -- -- -- 00 20 a0 -- -- -- -- 80 -- -- -- -- -- -- -- 00 -- -- -- -- -- -- -- 10 -- -- -- -- -- -- -- -- 10 30 b0 -- -- -- -- c0 -- -- -- -- -- -- -- 40 -- -- -- -- -- -- -- 50 -- -- -- -- -- -- -- -- -- -- -- 80 10 c0 50 -- 30 f0 70 90 b0 d0 -- 20 -- -- -- -- -- -- -- 30 -- -- -- -- -- -- -- -- c0 f0 d0 -- -- -- -- f0 -- -- -- -- -- -- -- 60 -- -- -- -- -- -- -- 70 -- -- -- -- -- -- -- -- 40 60 e0 -- -- -- -- 90 -- -- -- -- -- -- -- a0 -- -- -- -- -- -- -- b0 -- -- -- -- -- -- -- -- 50 70 -- -- -- -- -- d0 -- -- -- -- -- -- -- e0 -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Beautiful, aren't they?
Let's recap the algorithm:
Read five bytes.
11111222 22333334 44445555 56666677 77788888
Extract twelve fragments using bitwise-and.
11111--- -----222 --33333- -------4 ----5555 -66666-- ------77 ---88888 22------ 4444---- 5------- 777-----
Combine the fragments into eight bytes using bitwise-or.
11111--- 22---222 --33333- 4444---4 5---5555 -66666-- 777---77 ---88888
Retrieve decoded nybbles from the tables.
The digits represent decoded bits from now on.
1111---- 3333---- 5555---- 7777---- ----2222 ----4444 ----6666 ----8888
Join them into bytes using bitwise-or.
11112222 33334444 55556666 77778888
Store the four bytes in a buffer.
You might be thinking: This is a novel and quite interesting approach and all, but there's nothing to suggest that it would be faster than ordinary shifting. And, indeed, much work remains.
The tricks
We'll use plenty of self-modifying code. This is not really a trick, but an established technique on the 6502 processor. In order to shave off a cycle from every instruction that modifies another one, the entire loop will be placed in the zero-page.
Rather than storing the decoded bytes using an indexed addressing mode (five cycles per byte, plus a few more to advance the index register), we'll push them on the stack (three cycles per byte, pointer automatically decremented, and both index registers are free for other purposes). The 6502 stack is restricted to one page, so a complete sector fits perfectly. The 257th byte (checksum) is handled as a special case after the loop. The downside of this method is of course that we cannot use the stack for anything else. In particular, we can't make any subroutine calls.
For masking, we can use the and instruction, which puts the result in the accumulator. However, sometimes we'd rather preserve the contents of the accumulator. Using unintended (also known as "illegal") opcodes, there are a couple of ways to achieve this: The sax instruction is a mix of sta and stx; it tries to write the contents of both A and X to memory, and due to the NMOS process used to manufacture the 6502, zeroes are stronger than ones, so bus contentions resolve into bitwise-and operations. Thus, we can use sax to store a subset of the bits in A (or X) to memory without clobbering the rest of the register. Furthermore, we can perform a bitwise-and operation using the sbx instruction (a mix of cpx and tax) with an operand of zero, which will leave the result in X.
We can obtain two copies of each incoming byte: The unintended opcode lax is a mix of lda and ldx, and can be used to fetch an encoded byte into both A and X.
To combine index parts, we can use a bitwise-or operation as described above, but of course addition would work equally well. Addition can be computed implicitly as part of an indexed memory access: Since the odd and even tables are page-aligned, we can store some of the index bits in the least significant byte of the instruction operand, and the remaining bits in an index register.
Finally, consider quintuple number four: It arrives in two parts, one of which is simply the least significant bit of the second incoming byte. Rather than mask this bit and keep it in a register, we can use the lsr instruction (the only shifting operation in the entire GCR decoder) to place it in the carry flag. To combine it with the other part of the index, we simply adc (add-with-carry) a constant zero. This frees up a register, but that's not all: As a side effect, the addition operation will clear the overflow flag. By executing the adc at the right moment, we can get rid of one clv instruction, saving two cycles.
Armed with these tools, we are faced with a magnificent puzzle of instructions, registers, bits and cycles. When taking on this challenge, I had no idea if it would be possible to find a solution. By the time I had only a few cycles left to optimise, I could hardly think of anything else for two days. But then everything clicked into place.
The code
The following loop decodes GCR on the fly. The checksum must still be verified in a separate pass.
Start at the label begin_reading_here; that is the logical beginning of the loop. In practice, the first iteration is handled partly outside the loop, which is entered at zpc_entry ("zpc" is short for zero-page code).
zpc_loop ; This nop is needed for the slowest bit rate, because it is ; not safe to read the third byte already at cycle 65. ; However, with the nop, the best case time for the entire loop ; is 130 cycles, which leaves absolutely no slack for motor speed ; variance at the highest bit rate. ; Thus, we modify the bne instruction at the end of the loop to ; either include or skip the nop depending on the current ; bit rate. nop lax $1c01 ; 62 63 64 65 44445555 and #$f0 ; 66 67 adc #0 ; 68 69 A <- C, also clears V tay ; 70 71 zpc_mod3 lda oddtable ; 72 73 74 75 lsb = --33333- ora eventable,y ; 76 77 78 79 y = 4444---4, lsb = -------- ; in total, 80 cycles from zpc_b1 zpc_b2 bvc zpc_b2 ; 0 1 pha ; 2 3 4 second complete byte (nybbles 3, 4) zpc_entry lda #$0f ; 5 6 sax zpc_mod5+1 ; 7 8 9 lax $1c01 ; 10 11 12 13 56666677 and #$80 ; 14 15 tay ; 16 17 lda #$03 ; 18 19 sax zpc_mod7+1 ; 20 21 22 lda #$7c ; 23 24 sbx #0 ; 25 26 zpc_mod5 lda oddtable,y ; 27 28 29 30 y = 5-------, lsb = ----5555 ora eventable,x ; 31 32 33 34 x = -66666--, lsb = -------- pha ; 35 36 37 third complete byte (nybbles 5, 6) lax $1c01 ; 38 39 40 41 77788888 clv ; 42 43 and #$1f ; 44 45 tay ; 46 47 ; in total, 48 cycles from b2 zpc_b1 bvc zpc_b1 ; 0 1 lda #$e0 ; 2 3 sbx #0 ; 4 5 zpc_mod7 lda oddtable,x ; 6 7 8 9 x = 777-----, lsb = ------77 ora eventable,y ; 10 11 12 13 y = ---88888, lsb = -------- pha ; 14 15 16 fourth complete byte (nybbles 7, 8) begin_reading_here lda $1c01 ; 17 18 19 20 11111222 ldx #$f8 ; 21 22 sax zpc_mod1+1 ; 23 24 25 and #$07 ; 26 27 tay ; 28 29 ldx #$c0 ; 30 31 lda $1c01 ; 32 33 34 35 22333334 sax zpc_mod2+1 ; 36 37 38 ldx #$3e ; 39 40 sax zpc_mod3+1 ; 41 42 43 lsr ; 44 45 4 -> C zpc_mod1 lda oddtable ; 46 47 48 49 lsb = 11111--- zpc_mod2 ora eventable,y ; 50 51 52 53 lsb = 22------, y = -----222 pha ; 54 55 56 first complete byte (nybbles 1, 2) tsx ; 57 58 BNE_WITH_NOP = (zpc_loop - (* + 2)) & $ff BNE_WITHOUT_NOP = (zpc_loop + 1 - (* + 2)) & $ff zpc_bne .byt $d0,BNE_WITH_NOP ; 59 60 61 bne zpc_loop
Final remarks
Of course, once I had figured out the solution, I needed to write a fastloader around it to see whether it would really work. The loader grew into the Spindle trackmo toolchain, which I am saving for another article. To see the GCR decoder in action, have a look at my demo Shards of Fancy.
Posted Sunday 31-Mar-2013 14:40
Discuss this page
Disclaimer: I am not responsible for what people (other than myself) write in the forums. Please report any abuse, such as insults, slander, spam and illegal material, and I will take appropriate actions. Don't feed the trolls.
Jag tar inget ansvar för det som skrivs i forumet, förutom mina egna inlägg. Vänligen rapportera alla inlägg som bryter mot reglerna, så ska jag se vad jag kan göra. Som regelbrott räknas till exempel förolämpningar, förtal, spam och olagligt material. Mata inte trålarna.
Tue 2-Apr-2013 19:59
Jesper Öqvist
Tue 2-Apr-2013 20:12
Sun 7-Apr-2013 01:01
Sun 7-Apr-2013 16:19
best regards - wegi
Tue 9-Apr-2013 18:49
Can't help it - don't see a relation to that compress link in the comments there :)
Fri 12-Apr-2013 04:17
Fri 12-Apr-2013 04:21
Fri 12-Apr-2013 13:03
Mon 15-Apr-2013 19:31
Since the values of the even table use only the low nybble and those of the odd use only the high nybble, you could overlay these into a single table. I guess there's no reason to do this, though, since it would require more masking and you were able to fit both tables into the 2 kB of RAM.
Tue 16-Apr-2013 21:11
But in all honesty that might only have been possible on the Amiga ;-).
Built some efficient MFM loaders for that myself too.
@reumerd
Sat 29-Jun-2013 01:08
I've just watched quite a detailed talk about the 6502, that I thought you'd sure like (https://www.youtube.com/watch?v=K5miMbqYB4E). More info is available at visual6502.org. They've analyzed and simulated the chip based on high-res photographs.
-- jn
Tue 2-Jul-2013 20:40
Linus Åkesson
Thu 4-Jul-2013 21:33
Not including serial transfer, because my goal was to make an IRQ loader. This adds some overhead to the serial protocol, as it must be able to cope with arbitrary delays on the C64 side. I am however down to three revolutions for tracks 18-35 and four for tracks 1-17, as measured on a static display with 25 badlines and no sprites.
Sun 7-Jul-2013 21:13
Your short routine that fits into the 1541-RAM is fantastic.
Carlos Duarte do Nascimento
Tue 14-Jan-2014 05:43
Fri 25-Apr-2014 16:04
Take a look at the code from Epyx Winter Games if you haven't already. Other than the directory track, the disk doesn't use GCR. And I'm pretty sure (memory isn't perfect after almost 30yrs) it transferred 253 or 254 bytes at a time, pushing them on the stack on the c64 side in order to save a cycle vs sta.
Wed 5-Nov-2014 20:31
In interest of pure speed, I would love to see this used with a non-IRQ/no-display loader for 2 revolution track transfer. I've looked into it and don't think it's possible on highest density tracks unfortunately.
From the end of this processing until the sector header 2 away you have:
- current sector tail - 2 00s + 4 or more gap = 6+ bytes
- skipped sector header - 5 sync + 10 contents + 9 gap = 24 bytes
- skipped sector data - 5 sync + 325 data + 4 or more gap = 334+ bytes
- for total of 364+ bytes (note these are all in terms of encoded (GCR) not decoded bytes)
So 364 is the minimum number of disk bytes that may fly by in the time you need to transfer the sector just read. After that you need to go back to reading a sector header, etc.
At highest density setting, this is 26us per byte, or 26 * 364 = 9464us.
The fastest serial bit-bang transfer I think has ever been done is Vorpal V1 (like used in Winter Games). It transfers 3 bytes in a 113us loop, for 37.67us / byte.
We need to transfer 257 bytes (256+checksum) in 9464us. This means each byte must be transferred in 9464us / 257 = 36.82us.
So I'd expect Vorpal style serial transfer to allow you to squeeze by with 2 revolutions per track on lower-densities, but just miss the mark and still require 3 revolutions per track on highest density.
I would still love to tweak this code to be more density dependent to eliminate that BVC or two though. That should free enough cycles to replace PHA with STA $1801 and allow a (non-IRQ/no-display) 1 revolution parallel port track transfer :).
Aaron
Fri 27-Feb-2015 08:06
Also, do all the bits occur in order to select some subset of the bits in the GCR? For example, just mask every other bit and call that your decoded data.
Extend this bit mangling even further, make up a new encoding that gives even 2 decoded bits per byte that are stuffed directly to the output (in a 2bit transfer). This is quite wasteful of space however and reduces the data rate quite a bit, but is trivially fast to send and decode.
Linus Åkesson
Fri 27-Feb-2015 12:03
Sun 23-Apr-2017 20:56
Q: when do you actually check checksums? I would do a transfer and calculate, and after sending a sector, signal if it was a good sector or not.
Cheers,
Stefan (St0fF/Neoplasia^theObsessedManiacs)
Sun 23-Apr-2017 21:52
Sun 30-Jul-2017 10:53
David Galloway
Sun 30-Jul-2017 10:58
I have to say that when I saw the code in a disassembly for the Apple I was both awed and disappointed that I never allowed myself to think outside the box.
I also worked on the C64 but not on the 1541... It's such an amazing system to have a second '6502' in charge of the disk drive and then a tiny straw to connect to the C64...
David Galloway
Sun 30-Jul-2017 11:22
Frank Thomas
Sun 26-Nov-2017 15:35
And I can't actually believe that someone in 2013 thinks about improving software written >30 years ago.
Great work!!! Big applause!!
Mon 11-Dec-2017 08:33
Wed 13-Dec-2017 07:25
Wed 4-Jul-2018 14:22
Wed 15-Aug-2018 14:39
thanks to this work, i've been inspired to come up with the next step.
Full on-the-fly GCR block read+decode+checksumming on a stock 1541: https://csdb.dk/release/?id=167152
I'd like to write a follow-up guest article about it at some point. :)
Regards,
Krill
Wed 1-May-2019 01:47
thanks to this work, i've been inspired to come up with the next step.
Full on-the-fly GCR block read+decode+checksumming on a stock 1541: https://csdb.dk/release/?id=167152
I'd like to write a follow-up guest article about it at some point. :)
Regards,
Krill
Awesome work here I agree, and I just had a chance to look @ something you recently did some work on (Space Beer), and Fuck Me that is a fast loader. I re-ran the demo 3 times to make sure I wasn't just high or something. Which loader is that?
Sun 9-Jun-2019 23:50
Wed 19-Jun-2019 08:45
Now for some improvements!!
1) instead of storing $0100,$01FF->$0101, I store $01FF->$0100 so the data does not have a wrap. This is accomplished by starting the stack at $FF instead of $00, and also changing the final few instructions in the zero-page from pha before tsx to tsx before pha. This works because pha does not change the zero flag.
2) if the first byte of data is not $55, I give up after 3 tries and report error 04 for that sector and move on. also, if parity fails, give up after 3 tries and report error 05. This is because my project is a disk utility and not a loader :) I cannot "hang" trying forever.
3) save/restore the last 8B of stack, and SP, just before/after this routine, so I can still jsr/rts everywhere else in the code.
4) save/restore a portion of ZP before/after reading all the sectors I wanted, so I can return to KERNAL when done. Returning to KERNAL is important for my utility project. I am using $86-$E5 for the ZP code and only save/restore $99-$B4 -- the rest is just zero'd out.
Cheers!
Mon 1-Jul-2019 02:56
Also important: The "first byte" and "last byte" decoding code is not shown on this blog, but it is visible in the Shards of Fancy demo. I found that it is unnecessary to separately handle the first byte and enter the zero-page code in the middle. You can start at the top if your "first byte" handling is as follows:
; assuming you already found the sector header of interest, and
; skipped the 'sync' that follows it, here is what you do next:
@byte0: bvc @byte0
clv
lda $1C01
cmp #$55 ; "DATA" signature
bne fail
ldx #$FF
txs ; set SP = $1FF
jmp decoder
fail: rts ; wait for another header, etc...
decoder:
nop ; required delay
@byte1: bvc @byte1
clv
lda $1C01
tax
lsr A ; set Carry Flag per first decode
txa
and #$3E
sta zaddr0+1 ; address of first "oddtable" access
nop ; required delay
@byte2: bvc @byte2
jmp zpdecode+1 ; top of zero page code (skip the nop)
Good luck all!
--Paul Nelsen
--Oaktree Innovations
Joe Peter
Tue 12-Nov-2019 05:14
Good luck all!
--Paul Nelsen
--Oaktree Innovations
Paul are you Scott's brother? Were you at the star trek parties we attended?
I tried searching for you guy but I guess I spelled it Nelson :) and it has been almost 30 years.
While doing warpeed I was floored by vorpal and how it was possible but at the time was too busy to look further into it and only when I met Scott did I find out it was not regular GCR code.
Joe Peter
Tue 12-Nov-2019 05:27
thanks to this work, i've been inspired to come up with the next step.
Full on-the-fly GCR block read+decode+checksumming on a stock 1541: https://csdb.dk/release/?id=167152
I'd like to write a follow-up guest article about it at some point. :)
Regards,
Krill
Actually did this for the copy protection system I wrote for ATG. It also used a kind of scatter loading where each sector had the address of where it should load at. The file did not load sequentially. It read the bits, decoded it and if the checksum was ok, transferred it immediately to computer memory. It did take 2 or more revolutions to transfer the track since the CPU was not fast enough to read and transfer data on the fly even if undecoded. Vorpal which did read, decode and transfer on the fly did not use standard GCR coding.. I thought Scott was the one who did it but maybe Paul can help.
Preference was for the protection than the speed. But also error correction was important since unlike normal data, the encrypted data would not decode properly with any bit errors. Hence it sometimes took 3 revs to load if it encountered any errors. Also the disk had to be able to be duplicated by standard duplicating machines. Even if standard drives error'ed out and stopped loading, older copy protection skipped over some bit errors and it usually was not noticeable since real code was only 5-10% of the data involved. This was done so people could still load their disk even if some bits got erased. But for a full wrapper this would not work because the bit errors would trash any future decoded data which depended on previously decoded values.
Joe Peter
Tue 12-Nov-2019 05:40
Ah thanks :) after decades someone admits to going through my code... and cracking it too.. That was the protection system used for an automated process where the company sent us the files and we stuck a disk into the drive and a program took those files and protected it and wrote it out onto another disk which could be sent out to the duplicators..
I did not know about Heureka Sprint until today or you might have seen a real time version of that protection scheme like what Vorpal did. Since 5 to 4 gave the most, Vorpal's 5 to 3 bits was not enticing for most companies as they needed the space more than the speed. But a 6 to 5 decoder would not lose so much space on the disk.. extra disks for most companies meant like $50,000 loss as the min run rate for a release.
Wed 15-Jan-2020 09:54
Or a Quibble.
Tue 4-Aug-2020 01:28
This is a common misconception.
CLV resets the external shift register. The external shift register sets again SOE upon completition of 8 shifts since the CLV. It's not "anticipation" but exactly where it should be.
Tue 4-Aug-2020 01:44
Now for some improvements!!
1) instead of storing $0100,$01FF->$0101, I store $01FF->$0100 so the data does not have a wrap. This is accomplished by starting the stack at $FF instead of $00, and also changing the final few instructions in the zero-page from pha before tsx to tsx before pha. This works because pha does not change the zero flag.
2) if the first byte of data is not $55, I give up after 3 tries and report error 04 for that sector and move on. also, if parity fails, give up after 3 tries and report error 05. This is because my project is a disk utility and not a loader :) I cannot "hang" trying forever.
3) save/restore the last 8B of stack, and SP, just before/after this routine, so I can still jsr/rts everywhere else in the code.
4) save/restore a portion of ZP before/after reading all the sectors I wanted, so I can return to KERNAL when done. Returning to KERNAL is important for my utility project. I am using $86-$E5 for the ZP code and only save/restore $99-$B4 -- the rest is just zero'd out.
Cheers!
where can we find this disk utility?
Thu 5-Nov-2020 19:57
Lunatico is my favorite demo - of course on a real floppy disk
keep doing demo!
8bit lives matter
Tue 29-Dec-2020 05:35
I just heard that Krill has now posted a x50 fastloader on a stock 1541:
https://csdb.dk/release/?id=197710
Regards,
George
Linus Åkesson
Mon 11-Jan-2021 13:46
Thanks for the link! That is quite impressive and inspiring. I'm more into IRQ loaders myself, but some of the new techniques are applicable there too.
Wed 31-Mar-2021 07:20
It turns out with a few instructions overhead the same code can do both reorders by an accident of the architecture and the right code structure that mixes string loads and stores (which change direction with the direction flag) with register swaps (which don't change direction).
That is, it was my favorite until I found this. This is pretty badass and I've never even owned a C64.
Fri 14-Jan-2022 22:17
@reumerd
Wed 24-Jul-2024 08:14
I especially like the idea of using the stack as your block buffer, keeping registers free and having yet another pointer available. This, along with all the other tricks really make this a piece of art. If anyone is looking for a definition of out-of-the-box-thinking, link to this article.
Jens Schönfeld / iComp