News learned this days.. ~ GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

First snow leo is getting ogl 3.0 for 10.6.3.. holy cow..
http://netkas.org/?p=362

Also mesa 8.0 will get OpenGL 3.x (sometime this year I hope).. see phoronix..

So seems hardware and emulated (?) OGL drivers are getting OGL 3.x (in Apple and Linux)

Second AMD releases also GPU ShaderAnalyzer

What's New in Version 1.53

* Support for Microsoft DirectX 11.
* Support for Catalyst™ driver 9.9-9.12.
* Support for ATI Radeon™ HD 5870 graphics cards.
* Support for ATI Radeon™ HD 5770 graphics cards.
* Fixed support for IL disassembly.
* Fixed issue with simple GLSL shaders.

It includes support for compute shaders? yes..

ATI Catalyst 10.1 BETA Adds New OpenGL Extensions

smallpt opencl multigpu enabled:
http://davibu.interfree.it/opencl/smallptgpu2/smallptgpu-v2.0alpha2.tgz
(found in beyon3d forums and gtx295 scales near 2x)

OpenGL Extensions Viewer 3.16

3d vision linux on lowend quadros? (follow)
http://70.87.46.147/vbulletin/showthread.php?t=143514

http://forum.beyond3d.com/showthread.php?t=56105

N-Queen Solver for OpenCL
I rewrote my old n-Queen solver from CUDA into OpenCL. It's not a port, because I rewrite everything from scratch (only the idea remains the same). I was hoping to use OpenCL for comparison between AMD and NVIDIA hardwares.

Unfortunately, for some reason, AMD's OpenCL compiler crashed when compiling my kernel for GPU (it's ok for CPU version though). So right now it doesn't work on AMD's GPU at all, but with AMD Stream SDK 2.0 it's possible to run on CPU devices.

Both source and executable are provided in the attachment.

The arguments are:

nqueen_cl -cpu -clcpu -p -thread # -platform # -local N

-cpu: use CPU implementation (single threaded)
-clcpu: use OpenCL CPU devices
-p: enable profiling
-thread #: set the number of threads. (default: max work group item number * number of devices)
-platform #: select a platform (default: select platform 0)
-local: use local memory (shared memory) for arrays. This runs faster on NVIDIA's GPU because NVIDIA GPU have no indexed registers.
N: the board size (1 ~ 32)

Note: large board size (> 20) takes forever to run (I estimate that 20 queen would take more than 2 hours to run on 9800GT).

The N-queen algorithm is a straightforward one. There is no special considerations to avoid redundant boards (i.e. most boards can be mirrored and rotated to create 8 solutions). Only a simple mirror reduction is used.

Running on my computer (Core 2 Duo E8400 3.0GHz):

16 queen -cpu: 7.27 s
16 queen -clcpu: 3.79 s

On GPU (GeForce 9800GT):

16 queen: 2.36 s
16 queen -local: 1.23 s

About the crash problem: the original kernel I've developed (i.e. the kernel used by clcpu path) doesn't crash the compiler. However, it's extremely slow (> 20 s for 16 queen on my 4850, and > 7s on 9800GT) because it uses a four arrays to simulate a stack for recursion. These arrays are good for CPU version because they reduce the amount of computation, but for GPU they are too hard on registers. So I developed another kernel which uses only one array, but it requires more computation to generate/restore data for each steps. However, this new kernel crashes the compiler (it works when selecting CPU devices, but it's slower). I'm using Cat 9.12 hotfix right now.
Attached Files
File Type: rar nqueen_cl.rar (11.4 KB, 4 views)
File Type: rar nqueen_cl_src.rar (17.9 KB, 4 views)

rumors

NVIDIA GeForce GTX 395 Specs :

- Codename "GF104".
- Dual Core GPU Design (Two GF100 "Fermi" Cores).
- 6.4 Billion Transistors In Total (TSMC 40nm Process).
- 32 Streaming Multiprocessors (SM).
- Each SM has 2x16-wide groups of Scalar ALUs (IEEE754-2008; FP32 and FP64 FMA).
- The 32 SMs Have 1536KB Shared L2 Cache.
- 1024 Stream Processors (1-way Scalar ALUs) at 1350MHz.
- 1024 ALUs In Total.
- 1024 FP32 FMA Ops/Clock.
- 512 FP64 FMA Ops/Clock.
- Single Precision (SP; FP32) FMA Rate : 2.76 Tflops.
- Double Precision (DP; FP64) FMA Rate : 1.38 Tflops.
- 256 Texture Address Units (TA).
- 256 Texture Filtering Units (TF).
- INT8 Bilinear Texel Rate : 153.6 Gtexels/s
- FP16 Bilinear Texel Rate : 76.8 Gtexels/s
- 80 Raster Operation Units (ROPs).
- ROP Rate : 48 Gpixels
- 600MHz Core.
- 640bit (2x320bit) Memory Subsystem.
- 4200 MHz Memory Clock.
- 336 GB/s Memory Bandwidth.
- 2560MB (1280MB Effective) GDDR5 Memory.
- New Cooling Design.
- High Power Consumption.

GTX395 is 60% faster than GTX380
GTX395 is 70% faster than HD5970
Release Date : May 2010, Price : 499-549 USD.

Testing OpenCL Snow Leopard in PC:
*on vmware work 7 with fusion 3 gadgets:
*hackintosh distro: Mac OS x Snow leopard Universal v3.6 (10.6.2) _Reup

GPU computing Stay up to date in OpenCL, DirectCompute, CUDA, CAL and OpenGL information

Thursday, 14 January 2010

News learned this days..

0 comments:

Post a Comment

Popular Posts

Blog Archive

About Me