Memory management on GPUs

When using the following syntax

var device = here.gpus[0];
on device var x = new MyRecord();

how does the memory management work for x? Does it have the same semantics as records, so once it goes out of scope, it gets removed?

Can I save it as a member to another record like,

record MyRecord2 {
  var y: MyRecord;
  proc init(y: MyRecord) {
    this.y = y;
  }
}
var z = new MyRecord2(x);

while maintaining the location of the information of x on the CUDA device (these would be large datastructures, so duplication would be highly undesirable)? If not, how would I do this?

Thanks :slight_smile:

Hi Iain,

The semantics for x are indeed the same as they are for records; once the var x is no longer in scope, the record is deallocated. Unfortunately, records are assigned by value. Thus, when you write this.y = y, you will perform a copy and move the argument off the GPU locale. In your naive case, where MyRecord2 just wraps MyRecord, one alternative is to just place MyRecord2 on the GPU as well (on device var z). However, I suspect that your outer record is supposed to have non-GPU fields as well. Off the top of my head, I am not sure how to achieve what you want.

1 Like

Hi Daniel,

I appreciate your reply. Your assumption is correct about my intention to have some fields of MyRecord2 be also on CPU. In other words, I am looking for a GPU memory container kind of object.

Though, you mention in Add a prototype for remote variable declarations by DanilaFe · Pull Request #25240 · chapel-lang/chapel · GitHub the record _remoteVarWrapper , which is used as a desugaring of on device var ...;. Would this by chance be able to act like a CPU proxy for GPU storage?

Iain

Follow up: I came up with a workaround by using _remoteVarWrapper to make a GPU resource manager:

record remote {
    type eltType;
    var device: locale;
    var item: _remoteVarWrapper(eltType);
    var _parentDevice: locale; // is this necessary?

    proc init(item: ?eltType,device: locale) {
        this.eltType = eltType;
        this.device = device;
        this.item = chpl__buildRemoteWrapper(device,eltType,item);
        this._parentDevice = here;
    }

    proc init(item: ?eltType) { this.init(item,here); }

    proc init(type eltType) {
        this.eltType = eltType;
        this.device = here;
        this._parentDevice = here;
    }

    proc ref access() ref {
        // if here != this.device { try! throw new Error("Trying to access memory on wrong device!"); }
        if here != this.device {
            this.to(here);
            if debug then writeln("moved " + this.device.name + " -> " + here.name);
        }
        return this.item.get();
    }

    proc ref to(device: locale) {
        if this.device == device then return;
        if here != this._parentDevice { // this may not be the best path for the data flow
            on this._parentDevice {
                this.to(device);
            }
        } else {
            this.device = device;
            this.item = chpl__buildRemoteWrapper(device,eltType,this.item.get());
        }
    }
}

Then you can treat references to data on different devices as record values, and be explicit about which device they are on.

var t: tensor(1) = [i in {0..<10}] i:real;
var rt: remote(tensor(1)) = new remote(t);
writeln(rt.access());
rt.to(device);
on device {
    rt.access().data += 1.0;
}
rt.to(here);
writeln(rt.access());
(_domain = {0..9}, data = 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0)
0 (gpu 0): gputil.chpl:2079: copy from host to device, 80 bytes, commid 275
0 (gpu 0): $CHPL_HOME/modules/internal/ChapelArray.chpl:2699: kernel launch (block size: 512x1x1)
0 (gpu 0): gputil.chpl:2079: copy from device to host, 80 bytes, commid 273
(_domain = {0..9}, data = 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0)

As shown, the data transfers only occur when .to is called. Another (cleaner) example using plain arrays:

proc remote_sin(ref rx: remote(?t)) {
  on rx.device {
    use Math;
    ref data = rx.access();
    data = sin(data);
  }
}
const data = [i in {0..<10}] i:real;
var rt = new remote(data);
writeln(rt.access());
rt.to(device);
remote_sin(rt);
rt.to(here);
writeln(rt.access());

This seems to work fine, and you can save values of remote(...) as class/record fields without copying the underlying data. But I am unsure if this is approach has any issues.

I think by adding a init= proc, one could have a system for programmatically transferring data between GPU and program memory, without writing the transfers syntactically using on statements.

Hi Iain —

Could you use a class to store the fields that you don't want to copy?
That would both have the potential to decouple the lifetime of the object
from its scope, and to support copies of references/pointers to the object
without copying the object itself.

-Brad

Hi Brad,

Yes, I just realized that my solution is incomplete. There should be a shared class in between remote and _remoteVarWrapper that protects _remoteVarWrapper from being deinitialized. My CUDA machine is no longer available, so I will have to wait to add this in.

I am not sure if this is what you meant, though.

Iain