GL_NV_shader_buffer

GL_NV_shader_buffer_load

Name
Name Strings
Contact
Contributors
Status
Version
Number
Dependencies
Overview
New Procedures and Functions
New Tokens
Additions to Chapter 2 of the OpenGL 3.0 Specification (OpenGL Operation)
Additions to Chapter 5 of the OpenGL 3.0 Specification (Special Functions)
Additions to Chapter 6 of the OpenGL 3.0 Specification (Querying GL State)
Additions to Appendix D of the OpenGL 3.0 Specification (Shared Objects and Multiple Contexts)
Additions to the NV_gpu_program4 specification:
Modifications to The OpenGL Shading Language Specification, Version 1.30
Additions to the AGL/EGL/GLX/WGL Specifications
Errors
New State
Dependencies on NV_gpu_program4:
Issues
Revision History

Name

  
    NV_shader_buffer_load

Name Strings

  
    GL_NV_shader_buffer_load

Contact

  
    Jeff Bolz, NVIDIA Corporation (jbolz 'at' nvidia.com)

Contributors

  
    Pat Brown, NVIDIA  
    Chris Dodd, NVIDIA  
    Mark Kilgard, NVIDIA  
    Eric Werness, NVIDIA

Status

  
    Complete

Version

  
    Last Modified Date: April 6, 2009  
    Author Revision: 1.0

Number

TBD

Dependencies

  
    Written based on the wording of the OpenGL 3.0 specification.  
  
    This extension interacts with NV_gpu_program4.

Overview

  
    At a very coarse level, GL has evolved in a way that allows   
    applications to replace many of the original state machine variables   
    with blocks of user-defined data. For example, the current vertex   
    state has been augmented by vertex buffer objects, fixed-function   
    shading state and parameters have been replaced by shaders/programs   
    and constant buffers, etc.. Applications switch between coarse sets   
    of state by binding objects to the context or to other container   
    objects (e.g. vertex array objects) instead of manipulating state   
    variables of the context. In terms of the number of GL commands   
    required to draw an object, modern applications are orders of   
    magnitude more efficient than legacy applications, but this explosion   
    of objects bound to other objects has led to a new bottleneck -   
    pointer chasing and CPU L2 cache misses in the driver, and general   
    L2 cache pollution.  
  
    This extension provides a mechanism to read from a flat, 64-bit GPU   
    address space from programs/shaders, to query GPU addresses of buffer  
    objects at the API level, and to bind buffer objects to the context in  
    such a way that they can be accessed via their GPU addresses in any   
    shader stage.   
      
    The intent is that applications can avoid re-binding buffer objects   
    or updating constants between each Draw call and instead simply use   
    a VertexAttrib (or TexCoord, or InstanceID, or...) to "point" to the   
    new object's state. In this way, one of the cheapest "state" updates   
    (from the CPU's point of view) can be used to effect a significant   
    state change in the shader similarly to how a pointer change may on   
    the CPU. At the same time, this relieves the limits on how many   
    buffer objects can be accessed at once by shaders, and allows these   
    buffer object accesses to be exposed as C-style pointer dereferences  
    in the shading language.  
  
    As a very simple example, imagine packing a group of similar objects'   
    constants into a single buffer object and pointing your program  
    at object <i> by setting "glVertexAttribI1iEXT(attrLoc, i);"  
    and using a shader as such:  
  
        struct MyObjectType {  
            mat4x4 modelView;  
            vec4 materialPropertyX;  
            // etc.  
        };  
        uniform MyObjectType *allObjects;  
        in int objectID; // bound to attrLoc  
          
        ...  
  
        mat4x4 thisObjectsMatrix = allObjects[objectID].modelView;  
        // do transform, shading, etc.  
  
    This is beneficial in much the same way that texture arrays allow   
    choosing between similar, but independent, texture maps with a single  
    coordinate identifying which slice of the texture to use. It also  
    resembles instancing, where a lightweight change (incrementing the   
    instance ID) can be used to generate a different and interesting   
    result, but with additional flexibility over instancing because the   
    values are app-controlled and not a single incrementing counter.  
      
    Dependent pointer fetches are allowed, so more complex scene graph   
    structures can be built into buffer objects providing significant new   
    flexibility in the use of shaders. Another simple example, showing   
    something you can't do with existing functionality, is to do dependent  
    fetches into many buffer objects:  
  
        GenBuffers(N, dataBuffers);  
        GenBuffers(1, &pointerBuffer);  
  
        GLuint64EXT gpuAddrs[N];  
        for (i = 0; i < N; ++i) {  
            BindBuffer(target, dataBuffers[i]);  
            BufferData(target, size[i], myData[i], STATIC_DRAW);  
              
            // get the address of this buffer and make it resident.  
            GetBufferParameterui64vNV(target, BUFFER_GPU_ADDRESS, &gpuAddrs[i]);   
            MakeBufferResidentNV(target, READ_ONLY);  
        }  
  
        GLuint64EXT pointerBufferAddr;  
        BindBuffer(target, pointerBuffer);  
        BufferData(target, sizeof(GLuint64EXT)*N, gpuAddrs, STATIC_DRAW);  
        GetBufferParameterui64vNV(target, BUFFER_GPU_ADDRESS, &pointerBufferAddr);   
        MakeBufferResidentNV(target, READ_ONLY);  
  
        // now in the shader, we can use a double indirection  
        vec4 **ptrToBuffers = pointerBufferAddr;  
        vec4 *ptrToBufferI = ptrToBuffers[i];  
  
    This allows simultaneous access to more buffers than   
    EXT_bindable_uniform (MAX_VERTEX_BINDABLE_UNIFORMS, etc.) and each  
    can be larger than MAX_BINDABLE_UNIFORM_SIZE.

New Procedures and Functions

  
    void MakeBufferResidentNV(enum target, enum access);  
    void MakeBufferNonResidentNV(enum target);  
    boolean IsBufferResidentNV(enum target);  
    void NamedMakeBufferResidentNV(uint buffer, enum access);  
    void NamedMakeBufferNonResidentNV(uint buffer);  
    boolean IsNamedBufferResidentNV(uint buffer);  
  
    void GetBufferParameterui64vNV(enum target, enum pname, uint64EXT *params);  
    void GetNamedBufferParameterui64vNV(uint buffer, enum pname, uint64EXT *params);  
  
    void GetIntegerui64vNV(enum value, uint64EXT *result);  
  
    void Uniformui64NV(int location, uint64EXT value);  
    void Uniformui64vNV(int location, sizei count, uint64EXT *value);  
    void GetUniformui64vNV(uint program, int location, uint64EXT *params);  
    void ProgramUniformui64NV(uint program, int location, uint64EXT value);  
    void ProgramUniformui64vNV(uint program, int location, sizei count, uint64EXT *value);

New Tokens

  
    Accepted by the <pname> parameter of GetBufferParameterui64vNV,  
    GetNamedBufferParameterui64vNV:  
  
        BUFFER_GPU_ADDRESS_NV                          0x8F1D  
  
    Returned by the <type> parameter of GetActiveUniform:  
      
        GPU_ADDRESS_NV                                 0x8F34  
  
    Accepted by the <value> parameter of GetIntegerui64vNV:   
  
        MAX_SHADER_BUFFER_ADDRESS_NV                   0x8F35

Additions to Chapter 2 of the OpenGL 3.0 Specification (OpenGL Operation)

  
    Append to Section 2.9 (p. 45)  
  
    The data store of a buffer object may be made accessible to the GL  
    via shader buffer loads by calling:  
  
        void MakeBufferResidentNV(enum target, enum access);  
  
    <access> may only be READ_ONLY, but is provided for future   
    extensibility to indicate to the driver that the GPU may write to the  
    memory. <target> may be any of the buffer targets accepted by   
    BindBuffer.  
  
    While the buffer object is resident, it is legal to use GPU addresses   
    in the range [BUFFER_GPU_ADDRESS, BUFFER_GPU_ADDRESS + BUFFER_SIZE)   
    in any shader stage.  
  
    The data store of a buffer object may be made inaccessible to the GL  
    via shader buffer loads by calling:  
      
        void MakeBufferNonResidentNV(enum target);  
  
    A buffer is also made non-resident implicitly as a result of being  
    respecified via BufferData or being deleted. <target> may be any of   
    the buffer targets accepted by BindBuffer.  
  
    The function:  
  
        void GetBufferParameterui64vNV(enum target, enum pname, uint64EXT *params);  
  
    may be used to query the GPU address of a buffer object's data store.   
    This address remains valid until the buffer object is deleted, or   
    when the data store is respecified via BufferData. The address "zero"  
    is reserved for convenience, so no buffer object will ever have an   
    address of zero.  
  
    The functions:  
  
        void NamedMakeBufferResidentNV(uint buffer, enum access);  
        void NamedMakeBufferNonResidentNV(uint buffer);  
        void GetNamedBufferParameterui64vNV(uint buffer, enum pname, uint64EXT *params);  
     
    operate identically to the non-"Named" functions except, rather than   
    using currently bound buffers, it uses the buffer object identified   
    by <buffer>.  
  
    Add to Section 2.20.3 (p. 98)  
  
        void Uniformui64NV(int location, uint64EXT value);  
        void Uniformui64vNV(int location, sizei count, uint64EXT *value);  
  
    The Uniformui64{v}NV commands will load <count> uint64EXT values into   
    a uniform location defined as a GPU_ADDRESS_NV or an array of   
    GPU_ADDRESS_NVs.  
  
    The functions:  
  
        void ProgramUniformui64NV(uint program, int location, uint64EXT value);  
        void ProgramUniformui64vNV(uint program, int location, sizei count, uint64EXT *value);  
     
    operate identically to the non-"Program" functions except, rather   
    than updating the currently in use program object, these "Program"   
    commands update the program object named by the initial program   
    parameter.

Additions to Chapter 5 of the OpenGL 3.0 Specification (Special Functions)

  
    Add to Section 5.4, p. 310 (Display Lists)  
  
    Edit the list of commands that are executed immediately when compiling  
    a display list to include MakeBufferResidentNV,   
    MakeBufferNonResidentNV, NamedMakeBufferResidentNV,   
    NamedMakeBufferNonResidentNV, GetBufferParameterui64vNV,   
    GetNamedBufferParameterui64vNV, IsBufferResidentNV, and  
    IsNamedBufferResidentNV.

Additions to Chapter 6 of the OpenGL 3.0 Specification (Querying GL State)

  
    Add to Section 6.1.11, p. 314 (Pointer, String, and 64-bit Queries)  
  
    The command:  
          
        void GetIntegerui64vNV(enum value, uint64EXT *result);  
  
    obtains 64-bit unsigned integer state variables. Legal values of   
    <value> are only those that specify GetIntegerui64vNV in the state  
    tables in Chapter 6.  
  
    Add to Section 6.1.13, p. 332 (Buffer Object Queries)  
  
    The commands:  
  
        boolean IsBufferResidentNV(enum target);  
        boolean IsNamedBufferResidentNV(uint buffer);  
  
    return TRUE if the specified buffer is resident in the current   
    context.  
  
    Add to Section 6.1.15, p. 337 (Shader and Program Queries)  
  
        void GetUniformui64vNV(uint program, int location, uint64EXT *params);

Additions to Appendix D of the OpenGL 3.0 Specification (Shared Objects and Multiple Contexts)

  
    Add a new section D.X (Object Use by GPU Address)  
  
    A buffer object's GPU addresses is valid in all contexts in the share  
    group that the buffer belongs to. A buffer should be made resident in  
    each context that will use it via GPU address, to allow the GL   
    knowledge that it is used in each command stream.

Additions to the NV_gpu_program4 specification:

  
    Change Section 2.X.2, Program Grammar  
  
    If a program specifies the NV_shader_buffer_load program option,   
    the following modifications apply to the program grammar:  
  
    Append to <opModifier> list: | "F32" | "F32X2" | "F32X4" | "S8" | "S16" |   
    "S32" | "S32X2" | "S32X4" | "U8" | "U16" | "U32" | "U32X2" | "U32X4".  
  
    Append to <SCALARop> list: | "LOAD".  
  
    Modify Section 2.X.4, Program Execution Environment  
  
    (Add to the set of opcodes in Table X.13)  
  
                  Modifiers   
      Instruction F I C S H D  Out Inputs    Description  
      ----------- - - - - - -  --- --------  --------------------------------  
      LOAD        X X X X - F  v   su        Global load  
  
  
    (Add to Table X.14, Instruction Modifiers, and to the corresponding  
    description following the table)  
  
      Modifier  Description  
      --------  -----------------------------------------------  
      F32       Access one 32-bit floating-point value  
      F32X2     Access two 32-bit floating-point values  
      F32X4     Access four 32-bit floating-point values  
      S8        Access one 8-bit signed integer value  
      S16       Access one 16-bit signed integer value  
      S32       Access one 32-bit signed integer value  
      S32X2     Access two 32-bit signed integer values  
      S32X4     Access four 32-bit signed integer values  
      U8        Access one 8-bit unsigned integer value  
      U16       Access one 16-bit unsigned integer value  
      U32       Access one 32-bit unsigned integer value  
      U32X2     Access two 32-bit unsigned integer values  
      U32X4     Access four 32-bit unsigned integer values  
  
    For memory load operations, the "F32", "F32X2", "F32X4", "S8", "S16",  
    "S32", "S32X2", "S32X4", "U8", "U16", "U32", "U32X2", and "U32X4" storage  
    modifiers control how data are loaded from memory.  Storage modifiers are  
    supported by LOAD instruction and are covered in more detail in the  
    descriptions of that instruction.  LOAD must specify exactly one of these  
    modifiers, and may not specify any of the base data type modifiers (F,U,S)  
    described above.  The base data type of the result vector of a LOAD  
    instruction is trivially derived from the storage modifier.  
  
  
    Add New Section 2.X.4.5, Program Memory Access  
  
    Programs may load from buffer object memory via the LOAD (global load)  
    instruction.  
  
    Load instructions read 8, 16, 32, 64, or 128 bits of data from a source  
    address to produce a four-component vector, according to the storage  
    modifier specified with the instruction.  The storage modifier has three  
    parts:  
  
      - a base data type, "F", "S", or "U", specifying that the instruction  
        fetches floating-point, signed integer, or unsigned integer values,  
        respectively;  
  
      - a component size, specifying that the components fetched by the  
        instruction have 8, 16, or 32 bits; and  
  
      - an optional component count, where "X2" and "X4" indicate that two or  
        four components be fetched, and no count indicates a single component  
        fetch.  
  
    When the storage modifier specifies that fewer than four components should  
    be fetched, remaining components are filled with zeroes.  When performing  
    a global load (LOAD), the GPU address is specified as an instruction  
    operand.  Given a GPU address <address> and a storage modifier <modifier>,  
    the memory load can be described by the following code:  
  
      result_t_vec BufferMemoryLoad(char *address, OpModifier modifier)  
      {  
        result_t_vec result = { 0, 0, 0, 0 };  
        switch (modifier) {  
        case F32:  
            result.x = ((float32_t *)address)[0];  
            break;  
        case F32X2:  
            result.x = ((float32_t *)address)[0];  
            result.y = ((float32_t *)address)[1];  
            break;  
        case F32X4:  
            result.x = ((float32_t *)address)[0];  
            result.y = ((float32_t *)address)[1];  
            result.z = ((float32_t *)address)[2];  
            result.w = ((float32_t *)address)[3];  
            break;  
        case S8:  
            result.x = ((int8_t *)address)[0];  
            break;  
        case S16:  
            result.x = ((int16_t *)address)[0];  
            break;  
        case S32:  
            result.x = ((int32_t *)address)[0];  
            break;  
        case S32X2:  
            result.x = ((int32_t *)address)[0];  
            result.y = ((int32_t *)address)[1];  
            break;  
        case S32X4:  
            result.x = ((int32_t *)address)[0];  
            result.y = ((int32_t *)address)[1];  
            result.z = ((int32_t *)address)[2];  
            result.w = ((int32_t *)address)[3];  
            break;  
        case U8:  
            result.x = ((uint8_t *)address)[0];  
            break;  
        case U16:  
            result.x = ((uint16_t *)address)[0];  
            break;  
        case U32:  
            result.x = ((uint32_t *)address)[0];  
            break;  
        case U32X2:  
            result.x = ((uint32_t *)address)[0];  
            result.y = ((uint32_t *)address)[1];  
            break;  
        case U32X4:  
            result.x = ((uint32_t *)address)[0];  
            result.y = ((uint32_t *)address)[1];  
            result.z = ((uint32_t *)address)[2];  
            result.w = ((uint32_t *)address)[3];  
            break;  
        }  
        return result;  
      }  
  
    If a global load accesses a memory address that does not correspond to a  
    buffer object made resident by MakeBufferResidentNV, the results of the  
    operation are undefined and may produce a fault resulting in application  
    termination.  
  
    The address used for the buffer memory loads must be aligned to the fetch  
    size corresponding to the storage opcode modifier.  For S8 and U8, the  
    offset has no alignment requirements.  For S16 and U16, the offset must be  
    a multiple of two basic machine units.  For F32, S32, and U32, the offset  
    must be a multiple of four.  For F32X2, S32X2, and U32X2, the offset must  
    be a multiple of eight.  For F32X4, S32X4, and U32X4, the offset must be a  
    multiple of sixteen.  If an offset is not correctly aligned, the values  
    returned by a buffer memory load will be undefined.  
  
  
    Modify Section 2.X.6, Program Options  
  
    + Shader Buffer Load Support (NV_shader_buffer_load)  
  
    If a program specifies the "NV_shader_buffer_load" option, it may use the  
    LOAD instruction to load data from a resident buffer object given a GPU  
    address.  
  
  
    Section 2.X.8.Z, LOAD:  Global Load  
  
    The LOAD instruction generates a result vector by reading an address from  
    the single unsigned integer scalar operand and fetching data from buffer  
    object memory, as described in Section 2.X.4.5.  
  
      address = ScalarLoad(op0);  
      result = BufferMemoryLoad(address, storageModifier);  
  
    LOAD supports no base data type modifiers, but requires exactly one  
    storage modifier.  The base data type of the result vector is derived from  
    the storage modifier.  The single scalar operand is always interpreted as  
    an unsigned integer.  
  
    The range of GPU addresses supported by the LOAD instruction may be  
    subject to an implementation-dependent limit.  If any component fetched by  
    the LOAD instruction corresponds to memory with an address larger than the  
    value of MAX_SHADER_BUFFER_ADDRESS_NV, the value fetched for that  
    component will be undefined.

Modifications to The OpenGL Shading Language Specification, Version 1.30

  
    In section 3.6, p.14 (Keywords)  
  
    Add intptr_t and uintptr_t to the list of reserved keywords.  
  
    In section 4.1, p.19 (Basic Types)  
  
    Add to the table of transparent types:  
  
    intptr_t  |   a signed integer the same precision as a pointer  
    uintptr_t |   an unsigned integer the same precision as a pointer  
  
    Replace the paragraph "There are no pointer types" with:  
  
    Pointers to any of the transparent types, user-defined structs, or  
    pointer types are supported.   
      
    Modify the first paragraph of Section 4.1.3:  
  
    Signed and unsigned integer variables are fully supported. In this   
    document, the term integer is meant to generally include both signed   
    and unsigned integers. Unsigned integers have at least 32 bits of   
    precision, and uint, uvec2, uvec3, and uvec4 have exactly 32 bits of  
    precision per component. Signed integers have at least 32 bits,   
    including a sign bit, in two's complement form. Operations resulting in  
    overflow or underflow will not cause any exception, nor will they   
    saturate, rather they will wrap to yield the low-order N bits of the   
    result. intptr_t and uintptr_t variables have the same number of bits   
    of precision as the native size of a pointer in the underlying   
    implementation.  
  
    Add a new section 4.1.11 (Pointers)  
      
    Pointers are 64-bit values that represent the address of some "global"  
    memory (i.e. not local to this invocation of a shader). Pointers to   
    any of the transparent types, user-defined structs, or pointer types   
    are supported. Pointers are dereferenced with the operators (*), (->),  
    and ([]). The binary operator add (+) is supported on pointer types   
    (pointer+integer or vice versa) and the value is computed as in the C  
    language. There is no mechanism to assign a pointer to the address of   
    a local variable or array, nor is there a mechanism to allocate or   
    free memory from within a shader. There are no function pointers.  
  
    Pointer dereferences into structures use the following rules to   
    compute the offset and size of members in the structure:  
  
    - Members of type "bool" are stored as 32-bit integer values where all  
      non-zero values correspond to true, and zero corresponds to false.  
  
    - Members of type "int" are stored as 32-bit two's complement integers.  
  
    - Members of type "uint" are stored a 32-bit integers.  
   
    - Members of type "float" are stored as 32-bit IEEE 754 single-  
      precision floating-point values.  
   
    - Vectors with <N> elements with basic data types of "bool", "int",   
      "uint", or "float" are stored as <N> values in consecutive memory   
      locations, with components stored in order with the first (X)   
      component at the lowest offset.  
   
    - Column-major matrices with <C> columns and <R> rows (using the type  
      "mat<C>x<R>", or simply "mat<C>" if <C>==<R>) are treated as an   
      array of <C> floating-point vectors, each consisting of <R>   
      components. Columns will be stored in order, with column zero at   
      the lowest offset.  
  
    - Arrays are stored in memory by element order, with array member zero  
      at the lowest offset. The difference in offset between each pair of  
      elements in the array is constant.  
  
    - Arrays of matrices with <C> columns and <R> rows, with <N> array   
      elements are treated an array of <N>*<C> column vectors. The first  
      <C> elements of the array in memory will be the columns of the first  
      matrix of the array.  
  
    - Members of a structure are stored in monotonically increasing  
      offsets based on their location in the declaration. All variables   
      present in a structure will have space allocated, even if they are   
      unused in the shader.  
  
    The first member declared in a struct is always stored at an offset of  
    zero relative to the start of the struct. The byte offset of each  
    subsequent member is determined by taking the offset of the byte   
    following the last byte used to store the previous member and   
    optionally rounding up according to the following rules:  
  
    (1) The offset of a scalar or vector will be rounded up to a multiple   
        of the sizeof the scalar or vector, respectively, except 3-vectors  
        are aligned as if they were 4-vectors. Matrices will be aligned by   
        applying the above rules of treating matrices as arrays of vectors   
        and applying this rule and rule (5) recursively.  
   
    (2) The offset of the first element of a structure is aligned to the  
        least common multiple of the sizes of all scalar and vector types   
        in the structure (where 3-vectors are treated as 4-vector size).  
   
    (3) The offset of all other structure members will be derived by   
        taking the offset of the byte following the last byte used to   
        store the previous member, and rounding up only if required by any  
        of the other rules described here.  
   
    (4) The size of a structure is padded to the least common multiple of   
        the sizes of all scalar and vector types in the structure (where   
        3-vectors are treated like 4-vectors), and the padding is applied   
        at the end of the structure.  
   
    (5) The offset of the first element of an array is aligned as if it   
        were not in an array. That is, no special alignment is required   
        for arrays.  
   
    (6) The elements of an array will be tightly packed according to rules  
        (1) and (2). Note that structures with sizes padded according to   
        rule (4) will naturally be tightly packed under rule (2).  
  
  
    In section 5.1, p.37 (Operators)  
  
    Add to Precedence 2: "field access from a pointer"       "->"  
    Add to Precedence 3: "dereference"                       "*"  
    Add a new precedence between 3 and 4 and shift the lower precedences  
    down:                "cast"                              "()"  
  
    Change the text to say "There is no address-of operator."  
  
    Add a bullet point to the end of section 5.9, p.48 (Expressions)  
  
    The pointer operators dereference (*) and field access from a   
    pointer (->). Dereferenced pointers are not l-values and cannot  
    be assigned to. Field access from a pointer (p->field) is equivalent   
    to operator dereference ((*p).field) for all pointer types, including  
    operations on builtin types such as vec4s (e.g. p->xyxy is a legal   
    swizzle for a vec4 *p).  
  
    Add to section 8.3, p.70 (Common Functions)  
  
    Syntax                      Description  
    --------------------------  -----------------------------------  
    void *packPtr(uvec2 a)      (void *)(((uintptr_t)a.y << 16 << 16) + a.x)  
    uvec2 unpackPtr(void *a)    uvec2((uintptr_t)a & 0xffffffff, (uintptr_t)a >> 16 >> 16);  
  
    In section 9, p.91 (Shading Language Grammar)  
  
    Change the sentence:  
    // Grammar Note: No '*' or '&' unary ops. Pointers are not supported.  
    to  
    // Grammar Note: No '&' unary.

Additions to the AGL/EGL/GLX/WGL Specifications

  
    None

Errors

  
    INVALID_ENUM is generated by MakeBufferResidentNV if <access> is not  
    READ_ONLY.  
      
    INVALID_ENUM is generated by GetBufferParameterui64vNV if <pname> is  
    not BUFFER_GPU_ADDRESS_NV.  
  
    INVALID_OPERATION is generated by NamedMakeBufferResidentNV,   
    NamedMakeBufferNonResidentNV, IsNamedBufferResidentNV,   
    and GetNamedBufferParameterui64vNV if buffer is not an in-use buffer   
    name.  
  
Examples  
  
    (1) Layout of a complex structure using the rules from Section 4.1.11  
        of the GLSL spec:  
    struct  Example {  
                    // bytes used            rules  
      float a;      //  0-3                    
      vec2 b;       //  8-15                 1   // bumped to a multiple of 8  
      vec3 c;       //  16-27                1  
      struct {  
        int d;      //  32-35                2   // bumped to a multiple of 8 (bvec2)  
        bvec2 e;    //  40-47                1  
      } f;  
      float g;      //  48-51                  
      float h[2];   //  52-55 (h[0])         5   // multiple of 4 (float) with no additional padding  
                    //  56-59 (h[1])         6   // tightly packed  
      mat2x3 i;     //  64-75 (i[0])           
                    //  80-91 (i[1])         6   // bumped to a multiple of 16 (vec3)  
      struct {  
        uvec3 j;    //   96-107 (m[0].j)       
        vec2 k;     //  112-119 (m[0].k)     1   // bumped to a multiple of 8 (vec2)  
        float l[2]; //  120-123 (m[0].l[0])  1,5 // simply float aligned  
                    //  124-127 (m[0].l[1])  6   // tightly packed  
                    //  128-139 (m[1].j)  
                    //  144-151 (m[1].k)  
                    //  152-155 (m[1].l[0])  
                    //  156-159 (m[1].l[1])  
      } m[2];  
    };  
    // sizeof(Example) == 160  
  
    (2) Replacing bindable_uniform with an array of pointers:  
  
        #version 120  
        #extension GL_NV_shader_buffer_load : require  
        #extension GL_EXT_bindable_uniform : require  
  
        in vec4 **ptr;  
        in uvec2 whichbuf;  
  
        void main() {  
            gl_FrontColor = ptr[whichbuf.x][whichbuf.y];  
            gl_Position = ftransform();  
        }  
  
        in the GL code, assuming the bufferobject setup in the Overview:  
  
        glBindAttribLocation(program, 8, "ptr");      
        glBindAttribLocation(program, 9, "whichbuf");      
        glLinkProgram(program);  
        glBegin(...);  
        glVertexAttribI2iEXT(8, (unsigned int)pointerBufferAddr, (unsigned int)(pointerBufferAddr>>32));  
        for (i = ...) {  
            for (j = ...) {  
                glVertexAttribI2iEXT(9, i, j);  
                glVertex3f(...);  
            }  
        }  
        glEnd();

New State

  
    Update Table 6.11, p. 349 (Buffer Object State)  
  
    Get Value                   Type    Get Command                  Initial Value   Sec     Attribute  
    ---------                   ----    -----------                  -------------   ---     ---------  
    BUFFER_GPU_ADDRESS_NV       Z64+    GetBufferParameterui64vNV    0               2.9     none  
  
    Update Table 6.46, p. 384 (Implementation Dependent Values)  
  
    Get Value                   Type    Get Command                  Minimum Value   Sec     Attribute  
    ---------                   ----    -----------                  -------------   ---     ---------  
    MAX_SHADER_BUFFER_ADDRESS_NV Z64+   GetIntegerui64vNV            0xFFFFFFFF      2.X.2   none

Dependencies on NV_gpu_program4:

  
    This extension is generally written against the NV_gpu_program4   
    wording, program grammar, etc., but doesn't have specific   
    dependencies on its functionality.

Issues

1) Only buffer objects?

RESOLVED: YES, for now. Buffer objects are unformatted memory and
easily mapped to a "pointer"-style shading language.

2) Should we allow writes?

RESOLVED: NO, deferred to a later extension. Writes involve
specifying many kinds of synchronization primitives. Writes are also
a "side effect" which makes program execution "observable" in cases
where it may not have otherwise been (e.g. early-Z can kill fragments
before shading, or a post-transform cache may prevent vertex program
execution).

3) What happens if an invalid pointer is fetched?

UNRESOLVED: Unpredictable results, including program termination?
Make the driver trap the error and report it (still unpredictable
results, but no program termination)? My preference would be to
at least report the faulting address (roughly), whether it was
a read or a write, and which shader stage faulted. I'd like to not
terminate the program, but the app has to assume all their data
stored in the GL is lost.

4) What should this extension be named?

RESOLVED: NV_shader_buffer_load. Rather than trying to choose an
overly-general name and naming future extensions "GL_XXX2", let's
name this according to the specific functionality it provides.

5) What are the performance characteristics of buffer loads?

RESOLVED: Likely somewhere between uniforms and texture fetches,
but totally implementation-dependent. Uniforms still serve a purpose
for "program locals". Buffer loads may have different caching
behavior than either uniforms or texture fetches, but the expectation
is that they will be cached reads of memory and all the common sense
guidelines to try to maintain locality of reference apply.

6) What does MakeBufferResidentNV do? Why not just have a
MapBufferGPUNV?

RESOLVED: Reserving virtual address space only requires knowing the
size of the data store, so an explicit MapBufferGPU call isn't
necessary. If all GPUs supported demand paging, a GPU address might
be sufficient, but without that assumption MakeBufferResidentNV serves
as a hint to the driver that it needs to page lock memory, download
the buffer contents into GPU-accessible memory, or other similar
preparation. MapBufferGPU would also imply that a different address
may be returned each time it is mapped, which could be cumbersome
for the application to handle.

7) Is it an error to render while any resident buffer is mapped?

RESOLVED: No. As the number of attachment points in the context grows,
even the existing error check is falling out of favor.

8) Does MapBuffer stall on pending use of a resident buffer?

RESOLVED: No. The existing language is:

"If the GL is able to map the buffer objects data store into the
clients address space, MapBuffer returns the pointer value to
the data store once all pending operations on that buffer have
completed."

However, since the implementation has no information about how the
buffer is used, "all pending operations" amounts to a Finish. In
terms of sharing across contexts/threads, ARB_vertex_buffer_object
says:

"How is synchronization enforced when buffer objects are shared by
multiple OpenGL contexts?

RESOLVED: It is generally the clients' responsibility to
synchronize modifications made to shared buffer objects."

So we shouldn't dictate any additional shared object synchronization.
So the best we could do is a Finish, but it's not clear that this
accomplishes anything for the application since they can just as
easily call Finish. Or if they don't want synchronization, they can
use MAP_UNSYNCHRONIZED_BIT. It seems the resolution to this is
inconsequential as GL already provides the tools to achieve either
behavior. Hence, don't bother stalling.

However, if a buffer was previously resident and has since been made
non-resident, the implementation should enforce the stalling
behavior for those pending operations from before it was made non-
resident.

9) Given issue (8), what are some effective ways to load data into
a buffer that is resident?

RESOLVED: There are several possibilities:

- BufferSubData.

- The application may track using Fences which parts of the buffer
are actually in use and update them with CPU writes using
MAP_UNSYNCHRONIZED_BIT. This is potentially error-prone, as
described in ARB_copy_buffer.

- CopyBufferSubData. ARB_copy_buffer describes a simple usage example
for a single-threaded application. Since this extension is targeted
at reducing the CPU bottleneck in the rendering thread, offloading
some of the work to other threads may be useful.

Example with a single Loading thread and Rendering thread:

Loading thread:
while (1) {
WaitForEvent(something to do);

NamedBufferData(tempBuffer, updateSize, NULL, STREAM_DRAW);
ptr = MapNamedBuffer(tempBuffer, WRITE_ONLY);
// fill ptr
UnmapNamedBuffer(tempBuffer);
// the buffer could have been filled via BufferData, if
// that's more natural.

// send tempBuffer name to Rendering thread
}
Rendering thread:
foreach (obj in scene) {
if (obj has changed) {
// get tempBuffer name from Loading thread

NamedCopyBufferSubData(tempBuffer, objBuf, objOffset, updateSize);
}
Draw(obj);
}

If we further desire to offload the data transfer to another
thread, and the implementation supports concurrent data transfers
in one context/thread while rendering in another context/thread,
this may also be accomplished thusly:

Loading thread:
while (1) {
WaitForEvent(something to do);

NamedBufferData(sysBuffer, updateSize, NULL, STREAM_DRAW);
ptr = MapNamedBuffer(sysBuffer, WRITE_ONLY);
// fill ptr
UnmapNamedBuffer(sysBuffer);

NamedBufferData(vidBuffer, updateSize, NULL, STREAM_COPY);
// This is a sysmem->vidmem blit.
NamedCopyBufferSubData(sysBuffer, vidBuffer, 0, updateSize);
SetFence(fenceId, ALL_COMPLETED);

// send vidBuffer name and fenceId to Rendering thread

// This could have been a BufferSubData directly into
// vidBuffer, if that's more natural.
}
Rendering thread:
foreach (obj in scene) {
if (obj has changed) {
// get vidBuffer name and fenceId from Loading thread

// note: there aren't any sharable fences currently,
// actually need to ask the loading thread when it
// has finished.
FinishFence(fenceId);

// This is hopefully a fast vidmem->vidmem blit.
NamedCopyBufferSubData(vidBuffer, objBuffer, objOffset, updateSize);
}
Draw(obj);
}

In both of these examples, the point at which the data is written to
the resident buffer's data store is clearly specified in order
with rendering commands. This resolves a whole class of
synchronization bugs (Write After Read hazard) that
MAP_UNSYNCHRONIZED_BIT is prone to.

10) What happens if BufferData is called on a buffer that is resident?

RESOLVED: BufferData is specified to "delete the existing data store",
so the GPU address of that data should become invalid. The buffer is
therefore made non-resident in the current context.

11) Should residency be a property of the buffer object, or should
a buffer be "made resident to a context"?

RESOLVED: Made resident to a context. If a shared buffer is used in
two threads/contexts, it may be difficult for the application to know
when the residency state actually changes on the shared object
particularly if there is a large latency between commands being
submitted on the client and processed on the server. Allowing the
buffer to be made resident to each context individually allows the
state to be reliably toggled in-order in each command stream. This
also allows MakeBufferNonResident to serve as indication to the GL
that the buffer is no longer in use in each command stream.

This leads to an unfortunate orphaning issue. For example, if the
buffer is resident in context A and then deleted in context B, how
can the app make it non-resident in context A? Given the name-based
object model, it is impossible. It would be complex from an
implementation point of view for DeleteBuffers (or BufferData) to
either make it non-resident or throw an error if it is resident in
some other context.

An ideal solution would be a (separate) extension that allows the
application to increment the refcount on the object and to decrement
the refcount without necessarily deleting the object's name. Until
such an extension exists, the unsatisfying proposed resolution is that
a buffer can be "stuck" resident until the context is deleted. Note
that DeleteBuffers should make the buffer non-resident in the context
that does the delete, so this problem only applies to rare multi-
context corner cases.

12) Is there any value in requiring an "immutable structure" bit of
state to be set in order to query the address?

RESOLVED: NO. Given that the BufferData behavior is fairly
straightforward to specify and implement, it's not clear that this
would be useful.

13) What should the program syntax look like?

RESOLVED: Support 1-, 2-, 4-vec fetches of float/int/uint types, as
well as 8- and 16-bit int/uint fetches via a new LOAD instruction
with a slew of suffixes. Handling 8/16bit sizes will be useful for
high-level languages compiling to the assembly. Addresses are required
to be a multiple of the size of the data, as some implementations may
require this.

Other options include a more x86-style pointer dereference
("MOV R0, DWORD PTR[R1];") or a complement to program.local
("MOV R0, program.global[R1];") but neither of these provide the
simple granularity of the explicit type suffixes, and a new
instruction is convenient in terms of implementation and not muddling
the clean definition of MOV.

14) How does the GL know to invalidate caches when data has changed?

RESOLVED: Any entry points that can write to buffer objects should
trigger the necessary invalidation. A new entry point may only be
necessary once there is a way to write to a buffer by GPU address.

15) Does this extension require 64bit register/operation support in
programs and shaders?

RESOLVED: NO. At the API level, GPU addresses are always 64bit values
and when they are stored in uniforms, attribs, parameters, etc. they
should always be stored at full precision. However, if programs and
shaders don't support 64bit registers/operations via another
programmability extension, then they will need to use only 32 bits.
On such implementations, the usable address space is therefore limited
to 4GB. Such a limit should be reflected in the value of
MAX_SHADER_BUFFER_ADDRESS_NV.

It is expected that GLSL shaders will be compiled in such a way as to
generate 64bit pointers on implementations that support it and 32bit
pointers on implementations that don't. So GLSL shaders written against
a 32bit implementation can be expected to be forward-compatible when
run against a 64bit implementation. (u)intptr_t types are provided to
ease this compatibility.

Built-in functions are provided to convert pointers to and from a pair
of integers. These can be used to pass pointers as two components of a
generic attrib, to construct a pointer from an RGUI32 texture fetch,
or to write a pointer to a fragment shader output.

16) What assumption can applications make about the alignment of
addresses returned by GetBufferParameterui64vNV?

RESOLVED: All buffers will begin at an address that is a multiple of
16 bytes.

17) How can the application guarantee that the layout of a structure
on the CPU matches the layout used by the GLSL compiler?

RESOLVED: Provide a standard set of packing rules designed around
naturally aligning simple types. This spec will define pointer fetches
in GLSL to use these rules, but does not explicitly guarantee that
other extensions (like EXT_bindable_uniform) will use the same packing
rules for their bufferobject fetches. These packing rules are
different from the ARB_uniform_buffer_object rules - in particular,
these rules do not require vec4 padding of the array stride.

18) Is the address space per-context, per-share-group, or global?

RESOLVED: It is per-share-group. Using addresses from one share group
in another share group will cause undefined results.

19) Is there risk of using invalid pointers for "killed" fragments,
fragments that don't take a certain branch of an "if" block, or
fragments whose shader is conceptually never executed due to pixel
ownership, stipple, etc.?

RESOLVED: NO. OpenGL implementations sometimes run fragment programs
on "helper" pixels that have no coverage, or continue to run fragment
programs on killed pixels in order to be able to compute sane partial
deriviatives for fragment program instructions (DDX, DDY) or automatic
level-of-detail calculations for texturing. In this approach,
derivatives are approximated by computing the difference in a quantity
computed for a given fragment at (x,y) and a fragment at a neighboring
pixel. When a fragment program is executed on a "helper" pixel or
killed pixel, global loads may not be executed in order to prevent
spurious faults. Helper pixels aren't explicitly mentioned in the spec
body; instead, partial derivatives are obtained by magic.

If a fragment program contains a KIL instruction, compilers may not
reorder code such that a LOAD instruction is executed before a KIL
instruction that logically precedes it in flow control. Once a
fragment is killed, subsequent loads should never be executed if they
could cause any observable side effects.

As a result, if a shader uses instructions that explicitly or
implicitly do LOD calculations dependent on the result of a global
load, those instructions will have undefined results.

Revision History

  
    Rev.    Date    Author    Changes  
    ----  --------  --------  -----------------------------------------  
     1              jbolz     Internal revisions.  
     1              jbolz     Internal revisions.

Last update: November 14, 2006.
Cette page doit être lue avec un navigateur récent respectant le standard XHTML 1.1.