The bottleneck is the podule bus, not the internal memory interface of the GPU: I suspect the difference between the cards lies in efficiency of memory transfer over the podule bus, and the relative complexity of setting up accelerated commands on the R128 versus SM50X chipsets.
Adding a faster GPU will help just about as much as adding a shiny new SSD drive to the RPC's interface - you will still get <6MB/sec data transfer, even though the drive could manage nearer 300MB/sec! NB both drivers were written by the same author and have the same featureset (indeed share the same common codebase).
Most of the tests I'd seen were done by the author and are common, reproducible operations - complex draw file rendering, circle drawing, desktop redraw time, accelerated sprite plots, etc. It would be relatively easy for someone to independently write some test programs in BASIC to test this.