*=------------------------------------------------------------------------=* >> Precalc Linear lookup Blitter C2P << Version: 1.06 Released: 5th-08-1996 Written & Designed By: Kevin Picone (c) Copyright 1996 by Kevin Picone of Underware Design All Rights Reserved. *=------------------------------------------------------------------------=* Contact: 'Kevin Picone' at Email: uwdesign@lin.cbl.com.au *=------------------------------------------------------------------------=* Features: * Linear Chunky Frame Buffer * Single,Double & Quad Pixel widths are supported * Uses Normal (none resorted Pixels) * 256/64/16 Colour Versions * Various Specialized C2P methods. * Normal * Delta * Null Skip & Clear * Delta Null Skip & Clear (NEW to V1.04) * 16bit conversion with 32bit Writes to ChipRAM Special Requirements: * ECS/AGA Blitter for long blits. * Fastram for Precalc buffer (from 512k -> 2meg ;) * Normal Planar Screen (NOT interleaved) * No Screen modulo is allowed Disadvantages: * Rather Hungry upon FASTRAM * Extra ChipRAM demands * Uses the Blitter * Provided Sources are *VERY* bulky * Can be cumbersome to initially add * No 'C' support (sorry!) *=------------------------------------------------------------------------=* * Copyright: ============ The included source codes, intellectual properties, documentation and their contained description of methods, remain the (c) copyright of Kevin Picone 1996. I hereby grant permission for this code to be used freely in either P.D./Freeware/Shareware/commercial software releases or used as the basis for a better C2P solution in PD/Freeware/Shareware/Commercial software releases, with there being only one condition, that being that you *MUST* credit me for my work. ( a free copy of the software wouldn't hurt either;) This archive may be freely distributed via any means. * Disclaimer: ============= I in no way imply either directly or indirectly, that the described methods or included source code(s) are the best (fastest) possible C2P solutions in any/some/all cases or situations, they are provided as purely _optional_ methods to solve the c2p bottleneck. * What is PLLB-C2P ? ==================== Precalc Linear Lookup Blitter - Chunky To Planar is actually a C2P system and not just a few assorted routines. The system (although it was originally designed purely for my own use) _attempts_ to allow the programmer too easily setup and support various bitplane depths, pixel widths and conversion methods with relative ease. * What are 'Normal' 'Delta' & 'Nullskip' c2p routines ? ======================================================= Pllpc2p has built in support for various types of c2p conversion. These methods are 'Normal', 'Delta', 'NullSkip' & 'DeltaNullSkip'. I've done this for one rather obvious reason, as it's quite common that specific rendering algorithms can have their performance enhanced by using customized c2p solutions. Hence, I've given you as many choices as I possibly could, actually I've probably gone overboard, but who cares. ;) * Normal - This method is the simplest type of C2P possible, each frame it will convert the entire chunky frame buffer into planar. Normal C2P is probably best used when your chunky frame buffer is constantly changing by say %80->%100. * Delta - Delta c2p is a little more complex, it constantly takes two chunky frame buffers, the chunky frame just rendered and the last chunky frame rendered, and then compares them, looking for differences. Each time a difference is found, it converts that group of 16 pixels into planar. Delta c2p can be very powerful, and not too mention quite fast in various situations. Personally, I'd recommend that you use it over 'normal' c2p, or at least do some performance tests yourself. Those Familiar with Delta C2P algo's will notice that PLLB c2p allows for double buffering of your 'planar' image. i.e. you don't just render over the current visible frame. (So no ugly frame cuts.) Moreover, Pllbc2p Delta routines, work in 16 pixel (or less, depends upon the pixel width) delta fields, for improved delta frequency and only shift data to ChipRAM when a difference is found. * NullSkip - Is a specialize C2P algorithm that scans for groups of NULL pixels (groups of pixel 0). Each time it locates a NULL pixel group, it performs a quick clear operation to the planar buffer and then continues on, instead of passing the null (clear) pixels through the c2p process. NullSkip C2P is probably best used, when you can be fairly sure that a large part of chunky frame buffer is going to be constantly blank/clear. Nullskip also clears your chunky buffer as it goes. * DeltaNUllSkip- Is just a logical extension to 'NUllSKIP', and NO it's not a combination of the 'DELTA' + 'NULLSKIP" Methods. It's actually a rather simple way to reduce the chipram access in the null skip algorithm. Which is very useful on high end CPU's like the 060, 040, and even 40/50mhz 030 systems. It works via remembering if the present group of NULL pixels was cleared in the planar buffer during the last c2p sweep or not. If it was, then it doesn't write to chip at all, if not, it performs the clear and sets the delta tag for this group of pixels. Hence, this reduces the amount of chipram access and provides a handy speed-up in the process. DeltaNullSkip also clears out the chunky buffer while it processes it. * I don't understand this Pixelwidth stuff ? ============================================ Pllbc2p, like most other C2p routines supports various pixel widths, including 'Single', 'Double' and even 'Quad' pixel modes. In Double & Quad pixel mode, pllbc2p allows you to render a halved or quarter sized (width) chunky image, that the c2p routine will automatically blow it up to your selected planar screen width. This can greatly improve both your our algorithms performance and of course the c2p routines performance. Pllbc2p doesn't automatically support various Pixel height, so you'll have to set those up yourself. I normally just use the copper to blur the scan lines. * The Bizarre Method: ===================== The following method outline is based upon the 256 colour single pixel width, other c2p modes may and do vary.. PLLB-C2P is a 16bit combination processor sweep / with lookup array and Blitter resort algorithm, which was originally designed for 020/030 and ECS/AGA systems, But it's also very useful on 040 & 060 processors. The basic idea behind the algorithm is the usage of large Pre calculated pixel combination tables. These tables gives us the ability to directly use the chunky pixels as pointers into the Pre calculated conversion table(s). The chunky lookup table(s), contain all the possible combinations of two 8bit chunky pixels in their unrolled planar format. (I.e. 8 bytes...) Examples. (256colour Pixel width of 1) (p=plane) p0 p1 p2 p3 p4 p5 p6 p7 Chunky Input $00,$0f = $01,$01,$01,$01,$00,$00,$00,$00 Chunky Input $0f,$00 = $02,$02,$02,$02,$00,$00,$00,$00 Chunky Input $01,$0f = $03,$01,$01,$01,$00,$00,$00,$00 Chunky Input $88,$55 = $01,$00,$01,$02,$01,$00,$01,$02 Chunky Input $ff,$0f = $03,$03,$03,$03,$02,$02,$02,$02 Obviously, with a little simple maths you'll quickly notices that this means the lookup array for two 8bit chunky pixels ='s (256*256*8) = 512K ;) and can be as large as 2 megabytes, if available. Since the tables are unrolled into precalced groups of 8 bytes (2 longwords) , the idea is too always MOVE/logical OR from these tables in longwords, into our tempory data registers on the CPU.. So to overview the conversion process of 16 chunky pixels into planar, we've got the following simple routine. Convert the first 8pixels, Pixels 0->7 ; A0 = pointer to CHunky pixel buffer ; A5 = Base pointer to the Precalc comb's buffer ; Process 8 Chunky bits into planar clr.l d1 ; Clear out d1 move.l (a0)+,d0 ; Move Bytes ABCD from chunky buffer move.w d0,d1 ; move bytes CD into d1 clr.w d0 ; AB-- clear low word of D0 swap d0 ; --AB exchange upper word for lower word lsl.l #3,d0 ; mult d0 * 8 (width of precalc buffer) move.l (a5,d0.l),d3 ; grab planes 1,2,3,4 for pixels AB move.l 4(a5,d0.l),d4 ; grab planes 5,6,7,8 for Pixels AB lsl.l #2,d3 ; Shift PLanes 1,2,3,4 up 2 bits lsl.l #2,d4 ; sift planes 5,6,7,8 up 2 bits lsl.l #3,d1 ; mult by 8 (width of precalc Buff) or.l (a5,d1.l),d3 ; or on planes 1,2,3,4 for pixels CD or.l 4(a5,d1.l),d4 ; or on planes 5,6,7,8 for pixels CD lsl.l #2,d3 ; Shift PLanes 1,2,3,4 up 2 bits lsl.l #2,d4 ; sift planes 5,6,7,8 up 2 bits ;(first 4 pixels are processed at this point) clr.l d1 ; Clear out d1 move.l (a0)+,d0 ; Move pixels EFGH from chunky buffer intoD0 move.w d0,d1 ; move pixels GH into d1 clr.w d0 ; EF-- clear low word of D0 swap d0 ; --EF exchange upper word for lower word lsl.l #3,d0 ; mult EF * 8 (width of precalc buffer) or.l (a5,d0.l),d3 ; grab planes 1,2,3,4 for pixels EF or.l 4(a5,d0.l),d4 ; grab planes 5,6,7,8 for Pixels EF lsl.l #2,d3 ; Shift PLanes 1,2,3,4 up 2 bits lsl.l #2,d4 ; sift planes 5,6,7,8 up 2 bits lsl.l #3,d1 ; mult GH by 8 (width of precalc Buff) or.l (a5,d1.l),d3 ; or on planes 1,2,3,4 for pixels GH or.l 4(a5,d1.l),d4 ; or on planes 5,6,7,8 for pixels GH d3 = a1b1c1d1e1f1g1h1a2b2c2d2e2f2g2h2a3b3c3d3e3f3g3h3a4b4c4d4e4f4g4h4 d4 = a5b5c5d5e5f5g5h5a6b6c6d6e6f6g6h6a7b7c7d7e7f7g7h7a8b8c8d8e8f8g8h8 Above, I mentioned that the precalc array could be as large as 2 megabytes of FASTRAM, which is (unfortunately) true. The reason for this is a simple speedup which enables us to remove the two required 'lsl.l #2,d3' & 'lsl.l #2,d4' instructions after each 'Move/logical or' completely from the C2P loop. The idea is to create 4 versions of the Precalced Array with each one being already shifted into position. Cumbersome I know, but can make the routine(s) much faster.... Note: The Pllb-C2p system handles all the possible buffer combinations ===== all by it's self. Anyway, above also I stated that this is a 16bit routine, (which is needed for the blitter resort pass) so after we've processed the first 8 chunky pixels into planar, we then repeat the process, but this time storing the second converted 8 planes of pixels 8 to 15 into say registers d5,d6... Now, all that's needed is to merge the planar data together, then move it into the tempory CHIP buffer for the Blitter to resort/transfer to our final screen... ; a1 = output buffer in CHIPRAM ; d2 = $ff00ff00 ; d3 = planes 1,2,3,4 pixels 0-7 ; d4 = planes 5,6,7,8 pixels 0-7 ; d5 = planes 1,2,3,4 pixels 8-15 ; d6 = planes 5,6,7,8 pixels 8-15 ; resort the planar bytes into words move.l d3,d0 ; move Planes 1,2,3,4 to d0 (first 8 pixels_ move.l d5,d1 ; move planes 1,2,3,4 to d1 (second 8 pixels) and.l d2,d0 ; mask with $ff00ff00 d0 ='s planes 1,-,3,- and.l d2,d1 ; mask with $ff00ff00 d1 ='s planes 1,-,3,- ; (second 8 pixels) eor.l d0,d3 ; mask off planes 1,-,3,- from planes 1,2,3,4 ; d3 = -,2,-4 eor.l d1,d5 ; mask off planes 1,-,3,- from planes 1,2,3,4 ; d5 = -,2,-4 second 8 pixels lsr.l #8,d1 ; d1 = -,1,-,3 .. Second 8 pixels or.l d1,d0 ; d0 = 1,1,3,3 move.l d0,(a1)+ ; Dump out 32bits (16bits of planes 1 & 3) lsl.l #8,d3 ; d3 = 2,-,4,- or.l d5,d3 ; d3 = 2,2,4,4 move.l d3,(a1)+ ; Dump out 32bits (16bits of planes 2 & 4) (then repeat the process for Planes 4,5,6,7) One thing, you've probably already noticed, is that via this method your able to write to CHIPRAM in Longwords without a great amount of fuss... Plus, the routine(s) are small enought to fit nicely within the 68020's small instruction cache, which helps a great deal. In _theory_ pllbc2p on an 28mhz 020's should be as fast if it was running on a 25mhz 030's, but obviously various conditions could hinder this. Unfortunately, we still need to use the blitter to move/resort the mixed processed display (which is now in Chipram) from the Tempory Image buffer to it's final resting place, the screen. To avoid having too mess with the blitter constantly the final screen(s) need to be defined as normal planar and *NOT* interleaved, as this allows us to blit move an entire BitPLane worth of Pixels, per each useage of the blitter. Hence, a simple hardware bashing routine to Move/Resort a single bitplane from the tempory image buffer to the display screen, would look something like this. Bitter_Moveresort_TempImageBitplane_to_Display: ; (NICE LABEL ;) lea.l $dff000,a6 jsr waitblitter ; wait for Blitter ;) move.l #$09f00000,$40(a6) ; setup blitter miniterms and chan's move.l #$ffffffff,$44(a6) ; init masks move.l #((16-2)*$10000),$64(a6) ; setup source and dest modulo's lea.l BlitterTemp_ImageBuffer,a0 Move.l Display_buffer,a1 add.l d0,a0 ; add on any source plane offset add.l d1,a1 ; add on any dest plane offset move.l a0,$50(a6) ; set blitter Chan A pointer to source buf move.l a1,$54(a6) ; set blitter Chan D pointer to dest Buf move.w #20*256,$5c(a6) ; (20 words wide * 256 lines in Height move.w #1,$5e(a6) ; init width (1 word) and START! rts The idea is to work out some way (depends on what your doing really) that allows you to trigger the blitter and then just forget about it, which could be via say, Async blits, Interrupts, copper or perhaps even another Task or something. Personally, I just found it simpler (in the Tmap engine it was originally designed for) to check it's progress at preselected times. ie. between mapping textures, objects, floors etc etc.. which added minimal (if any really) overhead, and seemed to work just fine.. ;)..... lazy I know ;) * What Memory Is needed ?: ========================== Well this depends upon the type of game/demo or perhaps util your trying to create. Since, i've included Normal, Delta, Nullskip & now Delta Nullskip versions of each routine, well obviously your Buffer requirements will differ. * NORMAL C2P In FASTRAM, all that you require (min) is too firstly allocate your source CHUNKY PIXEL buffer and than the number of Precalc tables you wish to use.. 1->4 ie. 512k->2meg .. In CHIPRAM, you'll need to allocate the tempory image buffer ie. ((ScreenWidthInPixels/8) * ScreenHeight*bitplanes)='s size of Tempory image buffer. So if you were only using Two displays for double buffering originally well now, you'll need an extra display for the Tempory Image buffer. (sorry) * DELTA C2P The only different requirement for using the delta versions, is that you'll need a second CHUNKY PIXEL buffer. It should also be noted, that you have to swap chunky pixel buffer pointers after each frame is rendered, just as you would do normally for double/triple or even Quad buffering purposes. * NULL SKIP C2P The Nullskip versions have the same memory requirements as the Normal C2P versions... * DELTA NULLSKIP C2P Delta Nullskip has the same memory requirements as the Normal c2p version, except it also needs a special 'delta' buffer so it can remember which groups of null pixels it's cleared in the planar buffer. This buffer needs to be in FASTRAM and at least the size of your chunky buffer divided by 4. Ie. (ChunkyScreenwidth*ChunkyScreenheight)/4 PLEASE NOTE: Neither the Normal or Delta versions of the routine CLEAR the source Chunky Frame Buffer. * Linear Frame Buffer: ====================== After messing with various Tmapped & Gouraud Shaded engines (just like everybody else) where the requirement of a chunky frame buffer for gfx is high, the one rather obvious thing I found is that, it's _normally_ much better to save say a couple of cycles per chunky pixel, via rendering / copying across the linear buffer, than to save say a couple of cycles in the C2P loop, particularly when your constantly rendering a full frame buffer of pixels, where pixel (texture/gouraud) overlap (ie. polygons that share the same screen space) is high. Pllb-c2p is probably quite unique, since it uses Precalc tables to preform the actual bit shifting process, well we obtain a free linear buffer and we also achive slightly faster C2P in the process. * A possible C2P Hint for 'Wolf 3D' or 'Doom' texture mapped engines: ===================================================================== Without wanting to sound critical of *ANY* of the presently available Tmap Engines, and having written a 'Blade Stone' ish engine myself, (so I know just how hard the task actually is), the one thing I notice is the lack of specialized C2P for when floor/Ceiling texture mapping isn't present. Personally, I consider this the ideal time for either a delta'd C2P routine or a specialized NUll SKIP/CLEAR (you could make it use a pattern) routine. Via using the latter method, I was able to obtain 8->9fps refresh updates upon just a a1200/020 14mhz+fastram, with a screen size of 320*256, 256colour, 1x*2y pixel res, while the engine includes lightsources, Solid & See through Texture Mapped Walls & 2D mapped objects. Also, it becomes pretty obvious that for a Delta'd C2P routine to be all that useful while Floor/Ceiling Mapping is present, that the artist should probably think more about making the floors & ceilings (and even walls where possible) texture images a little less complex, which will enchance preformance greatly. It might also be a nice option for Full or reduced detail floors & Ceilings, instead of turning them off completely. * Why supply Quad Pixel width versions ?: ========================================= Well, what can I say, they give you an optional 'Copper' styled screen resolution, but without the loss of 24bit colour. (not that it matters too much at this res ;) Perhaps they can be best used in effects like 'Fire' / 'Water' & 'Life' effects. * Some Future Idea's: ===================== * Auto 8bit to 6bit colour reduce/remapping. (Still - Coming Soon ;) * Ham 8 with auto interpolation. (Just a crazy idea at the moment ;) * That's it from me: ==================== Well, hopefully the enclosed routine(s) are of some use to you, or perhaps they might inspire you to take this method further than I have, as this is really only the initial working version(s) of PLLB-C2P, so I've *NO* doubt there's any number of possible speed-ups just waiting to be found. If you do bother too enhance any of the pllbc2p routines, well, I'd appreciated it greatly if you'd let me know, so I can pass this information to others in future releases of pllbc2p. Cya, Kevin Picone Underware Design *=----------------------------------------------------------------------=* T H E E N D *=----------------------------------------------------------------------=*