Safe VSP
I contributed this one-filer to the C64 Demo compo at Datastorm 2013. It ended up on 7th place, which I consider quite good for a technical proof of concept.
One of the tricks you can do on the C64 involves manipulating the video chip into reading the graphics data at an offset from where it's usually located. This allows you to scroll the display horizontally, and the trick is called VSP for Variable Screen Position. However, some machines crash when you attempt this, and the reason for that has always been a mystery. Not anymore.
Some say this forum thread reads like a thriller. Zer0-X managed to capture a VSP crash using a logic analyser and posted 15 MB of data. A year later I started looking into it and discovered the root cause. The proposed workaround is not very practical, but it supports my hypothesis, because people have tried it on their crash prone machines and so far it hasn't crashed.
A technical explanation appears in the demo as a 10-minute scroller, but the same text is provided below, for your convenience.
However, I found this an excellent opportunity to compose a 10-minute SID epic heavily inspired by Martin Galway's Parallax. In particular, I borrowed its musical structure that I affectionately think of as starter; main course; dessert — something a bit weird, followed by something substantial and straight-forward, followed by a sweet melodic part. The three courses are quite distinct, but complete each other. My tune is called Sideways in reference to both Parallax and the VSP trick.
- lft-safe-vsp (C64 executable file, 12.7 kB)
- Sideways (SID tune, 8.7 kB)
- Linus Akesson - Sideways (MP3, 8.8 MB)
Safe VSP has a csdb page and a pouët page, and was featured on Hacker News.
Technical lowdown:
The dreaded VSP crash is caused by a metastability condition in the DRAM. Some have speculated that it has to do with refresh cycles, but hopefully the detailed explanation in this scroller will crush that myth once and for all.
But first, this is what the machine behaves like from a programmer's point of view. Let us call memory locations ending in 7 or f fragile. Sometimes when VSP is performed, several fragile memory cells are randomly corrupted according to the following rule: Each bit in a fragile memory cell might be changed into the corresponding bit of another fragile cell within the same page.
This specific behaviour can be exploited in several ways: One approach is to ensure that every fragile byte in a page is identical. If the page contains code, for instance, corruption is avoided if all the fragile bytes are $ea (nop). Similarly, in font definitions, the bottom line of each character could be blank.
Another technique is to simply avoid all fragile memory locations. The undocumented opcode $80 (nop immediate) can be used to skip them. Data structures can be designed to have gaps in the critical places.
This latter technique is used in this demo, including the music player of course. Data that cannot have gaps, i.e. graphics, is continuously restored from safe copies elsewhere in memory. You can use shift lock to disable this repair, and eventually you should see garbage accumulating on the screen. And yet the code will keep running.
Thus, for the first time, the VSP crash has been tamed.
Now for the explanation. The C64 accesses memory twice in every clock cycle. Each memory access begins with the LSB of the address (also known as the row address) being placed on an internal bus connected to the DRAM chips. As soon as the row address is stable, the row address strobe (RAS) signal is given. Each DRAM chip now latches the row address into a register, and this register controls a multiplexer which connects the selected memory row to a set of wires called sense lines. Each sense line connects to a single bit of memory.
The sense lines have been precharged to a voltage in between logical zero and logical one. The charge stored in the memory cell affects the sense line towards a slightly lower or higher voltage depending on the bit value. A feedback amplifier senses the voltage difference and exaggerates it, so that the sense line reaches the proper voltage representing either zero or one. Because the memory cell is connected (through the multiplexer) to the sense line, the amplified charge will also flow back and refresh the memory cell. Hence, a memory row is refreshed whenever it is opened.
VSP is achieved by triggering a badline condition during idle mode in the visible part of a rasterline. When this happens, the VIC chip gets confused about what memory address to access during the half-cycle following the write to $d011. It sets the internal bus lines to 11111111 in preparation for an idle fetch, but suddenly changes its mind and tries to read from an address with an LSB of 00000111.
Now, since electrical lines can't change voltage instantaneously, there is a brief moment of time when each of the changing bits (bit 3 through 7) is neither a valid one nor a valid zero. But because the VIC chip changes the address at an abnormal time, there is now a risk that the RAS signal, which is generated independently by another part of the VIC chip, is sent while one or more bus lines is within the undefined voltage range.
When an undefined voltage is latched into a register, the register enters a metastable state, which means that its output will flicker rapidly between zero and one several times before settling. This has catastrophic consequences for a DRAM: The row multiplexer will connect several different memory rows, one at a time, to the same sense lines. But as soon as some charge has moved from a memory cell to the sense line, the amplifier will pull it all the way to a one or a zero. If, at this point, another memory row is connected, then the charge will travel from the sense line into this other memory cell. In short, one memory cell gets refreshed with the bit value of a different memory cell.
Note that because the bus lines change from $ff to $07, only memory rows with an address ending in three ones are at risk of being opened simultaneously. This explains why corruption can only occur in memory locations ending in 7 or f.
Finally, this phenomenon hinges on the exact timing of the RAS signal at the nanosecond level, and on many machines the critical situation simply doesn't occur. The timing (and thus the probability of a crash) depends on factors such as temperature, VIC revision, parasitic capacitance and resistance of the traces on the motherboard, power supply ripple and interference with other parts of the machine such as the phase of the colour carrier with respect to the dotclock. The latter is assigned randomly at power-on, by the way, which could be the reason why a power-cycle sometimes helps.
This is lft signing off.
Posted Wednesday 20-Mar-2013 22:23
Discuss this page
Disclaimer: I am not responsible for what people (other than myself) write in the forums. Please report any abuse, such as insults, slander, spam and illegal material, and I will take appropriate actions. Don't feed the trolls.
Jag tar inget ansvar för det som skrivs i forumet, förutom mina egna inlägg. Vänligen rapportera alla inlägg som bryter mot reglerna, så ska jag se vad jag kan göra. Som regelbrott räknas till exempel förolämpningar, förtal, spam och olagligt material. Mata inte trålarna.
Sun 16-Jun-2013 10:16
Sat 24-Aug-2013 05:14
Sun 25-Jan-2015 00:14
Tue 24-Jan-2017 15:22
I guess what is banked out would not suffer?
Sat 18-Feb-2017 18:18
Please could you explain why you can say this:
"the phase of the colour carrier with respect to the dotclock. The latter is assigned randomly at power-on".
I believe the MOS-8701 works with fixed delays to generate the dot clock, so it should come up with always the same phase with respect to the color clock.
On the boards with no MOS-8701, there's a classic PLL circuit that should come up with always the same phase relationship too.
What am I missing? Thanks (iz8dwf at amsat dot org)
Linus Åkesson
Thu 23-Feb-2017 17:16
"the phase of the colour carrier with respect to the dotclock. The latter is assigned randomly at power-on".
I believe the MOS-8701 works with fixed delays to generate the dot clock, so it should come up with always the same phase with respect to the color clock.
Hi!
The MOS-8701 produces a 7.88 MHz dotclock and a 4.43 MHz colour carrier from the same internal signal. Their ratio is exactly 16:9, which means that 16 hi-res pixels on the screen correspond to nine complete cycles of the colour signal.
It follows that for each hi-res pixel, the phase of the colour carrier is advanced by 9/16 revolutions, which is 202.5 degrees. If it starts at 0 degrees, then after 8 hi-res pixels it will be at 180 degrees. Hence the familiar red/green vertical banding that repeats after 16 pixels; red and green are 180 degrees apart in YUV.
Now, at which pixel is the colour carrier at 0 degrees?
This depends on the timing relationship between the 8701 and the VIC chip. The 8701 has a reset pin, but it doesn't seem to be connected in the C64. The VIC doesn't even have a reset pin.
So, during power-on, there will be a brief period before the 8701 is outputting a stable signal, and during this period the VIC state machine may or may not respond properly. Meanwhile, the internal clock-divide counters of the 8701 might not even start from zero.
That is where the random assignment happens.
Linus Åkesson
Thu 23-Feb-2017 17:18
I guess what is banked out would not suffer?
Yes, and often several bytes at the same time.
Banking doesn't affect anything, I'm afraid, because the Row Select procedure (LSB) is carried out regardless of what the MSB will be. The corruption happens inside the RAM chips themselves.
Sun 26-Feb-2017 20:25
lft wrote:
Hi!The MOS-8701 produces a 7.88 MHz dotclock and a 4.43 MHz colour carrier from the same internal signal. Their ratio is exactly 16:9, which means that 16 hi-res pixels on the screen correspond to nine complete cycles of the colour signal.
Actually, from the C64 schematics (and from what I can see from a scope) the colour clock output is 4 x 4.433619 MHz which is the crystal frequency, and this signal goes to the VIC-II.
The colour carrier is obtained internally on the VIC-II, but what you say of course doesn't depend from where the 4.433 MHz signal is
generated.
lft wrote:
Meanwhile, the internal clock-divide counters of the 8701 might not even start from zero.That is where the random assignment happens.
this could be true.
The discussion is very interesting to me since eliminating a precise phase offset between the two clocks (as long as it's constant after each power up) could allow to make a very cheap 8701 replacement.
Sun 12-Nov-2017 18:51
Linus Åkesson
Tue 14-Nov-2017 06:24
That's not how quantum computing works.
Besides, if a bit is neither valid 0 or 1, I don't see why it couldn't be both invalid 0 and invalid 1. (In fact it is.)
Thu 11-Apr-2019 20:07
Question: You explain that bytes at "fragile" addresses can become corrupted within some page. But what determines the page?
In other words, the LSB of the addresses being corrupted is %xxxxx111, but what is the HSB?
(I'm a coder, not an engineer, so it might be obvious - but i don't know how the electrical circuits work)
Sat 13-Jun-2020 15:39
Tue 23-Feb-2021 04:24
https://parallel.princeton.edu/papers/micro19-gao.pdf
There's also this paper from late 2013 (while you released SafeVSP in early 2013 - coincidence or hidden inspiration...?) where they also exploit that behaviour to do in-DRAM copying, but it's only theoretical as they didn't actually attempt it with real parts and only mention to the effect that it's "not allowed", unlike the above paper which actually tried it and discovered it already works:
https://users.ece.cmu.edu/~omutlu/pub/rowclone_micro13.pdf
I bet this effect is reproducible with all DRAM ever made, from the ones used in the C64 to the latest DDRs.
Mon 5-Apr-2021 15:53
I was wondering what you make of this "safe VSP channel": https://pastebin.com/dGFckHUb
Some have expressed disbelief at its viability, but I have not seen any cogent rebuttal or, conversely, proof of its effectiveness. I only really started looking into it today and haven't even dabbled with it myself (not wishing to leap into another detour from my main coding project, Parallaxian).
Thu 22-Apr-2021 10:46
The VSP bug is the manifestation of what mathematicians call a "complex system", i.e., something in which disparate, organically unrelated inputs can trigger identical outputs. Hence a "one size fits all" software solution is going to be challenging.
It seems that the EOR-based channel write thing looks like a way to minimise the risk of the bug being triggered without necessarily completely guaranteeing its prevention. That said, it was extremely impressive work on your part to take the issue that far.
My personal position would be to file the bug along with moody tape decks and disk drives and therefore consider it should not be consigned to the dustbin of coding history at all. At the bare minimum, games with VSP scrolling could have a "gracefully degraded" version, with no VSP, for use on crash-prone hardware, but by no means should coders be keen to dump VSP because of a bug on outlier hardware.
Fri 4-Mar-2022 02:29
x00 x01 x02 x03 x04 x05 x06 (bräcklig)x07 x08 x09 x0a x0b x0c x0d x0e (bräcklig)x0f
x10 x11 x12 x13 x14 x15 x16 (bräcklig)x17 x18 x19 x1a x1b x1c x1d x1e (bräcklig)x1f
..
xFFF0 xFFF1 xFFF2 xFFF3 xFFF4 xFFF5 xFFF6
(bräcklig)xFFF7
xFFF8 xFFF9 xFFFa xFFFb xFFFc xFFFd xFFFe
(bräcklig)xFFFF
Thank you to clarify.
John
Linus Åkesson
Fri 4-Mar-2022 07:21