PHP associative arrays or hashes: in-depth look


Before you start reading i warn you: this article won't help you in every day PHP developing in any way. But if you are naturally interested, like me, in how things work you're welcome to spend your precious time on reading and understanding this article.
Still here? Great. Lets try to stay sane after all this low level mumbo-jumbo and start.
PHP array or in other terms hash is a combination of the data storing methods called "vector" and "linked list".
Vector is a nice aligned structure in the memory, simplest and fastest way to store and access linked pieces of the data because accessing vector offsets functionality is baked right into the modern CPU's. I'm referring to the assembler's offset calculating functionality like MOV [BASE+OFFSET], DATA.
Think of the vectors as a numerically indexed PHP array. If you need to access array member you just pass offset to it like this:
$array = array("a", "b", "c");
echo $array[2]; //prints c

Accessing vector member in C is very similar:
char vector[] = "abc"; /* or more similar to PHP {'a', 'b', 'c'} */
printf("%c", vector[2]); /* prints c */

But this simplicity have a noticeable weakness: all pieces of the data should be the same fixed size and vector itself  have a fixed size. So if you need to store or delete data dynamically (actually this is possible but vector is aligned structure and you will have gaps and waste memory) or your data pieces size isn't fixed vectors are not very suitable to you. And here comes into play other data storing method: linked list.
Linked list is designed to store linked data of the arbitrary size located anywhere in the memory.

To help you in understanding an idea behind the linked list lets assume that you need to find some person who is able to fix your rare Tissot watches. You are asking some person who is related to the watches fixing: "Are you the chosen one who is able to fix my Tissot watches?", "Nope, he answers, ask the guy who lives at this address", you are reaching the guy you were pointed to and asking him the same question and if he is able to do the job your seeking job is done or he is pointing you to other guy until you find what you need or until you reach the guy who knows nobody able to help you.
Linked lists are the same: you are starting from specific memory location and until you find something specific or you are jumping to the next address pointed by the struct member where next object is located.

#include <stdlib.h>

typedef struct repairer {
    char *name;
    int canFixTissot;
    void *nextHeKnows;
} repairer;


int main(void) {
    repairer repairer1;
    repairer1.name = "Dave";
    repairer1.canFixTissot = 0;
    repairer1.nextHeKnows = NULL;

    repairer repairer2;
    repairer2.name = "Steve";
    repairer2.canFixTissot = 0;
    repairer2.nextHeKnows = NULL;
    /* Pointer to this record from the previous record */
    repairer1.nextHeKnows = &repairer2;

    repairer repairer3;
    repairer3.name = "Andy";
    repairer3.canFixTissot = 1;
    repairer3.nextHeKnows = NULL;
    /* Pointer to this record from the previous record */
    repairer2.nextHeKnows = &repairer3;

    /* At the some point you can traverse repairers one-by-one statrting from the first */
    repairer *someRepairer = &repairer1;
    while (someRepairer != NULL) {
        if (someRepairer->canFixTissot) {
            printf("%s can fix my wathces\n", someRepairer->name);
            break;
        }
        /* Go to the next record in the linked list */
        someRepairer = someRepairer->nextHeKnows;
    };
    return EXIT_SUCCESS;
}

Zend hash is actually a combination of the both data storing methods. Speed and simplicity of the vector combined with a flexibility of the linked list.
So what is a hash role exactly? To store some data identified by the string key and retrieve it back using the same key. Yes, we can store key value data as a linked lists and traverse them all each time. But such a traversing would have a significant performance impact. Zend hash is smarter than that. Searching in the vectors is a lighting fast but it is not possible to store an arbitrary size key directly in the vector. That's why key hashing was invented. You should be familiar to the hashing functions like crc32 or md5 and most likely using them very often. Since such hash function as a crc32 produces fixed size hash (integer, 4 bytes in size) we can store them in the vector!
It is time to get hands dirty and write some old good PHP code.

<?php
$array = array(
    'test' => 1,
    'test2' => 2
);
echo $array['test' ];

You should be familiar to this simple piece of the PHP code. What Zend engine (ZE) actually does behind the scenes:
1. Allocates new zval in executor globals (global userspace variables in this case) and stores pointer to it in the (surprise!) "globals" hash table (key of the hash element is a PHP variable name)
2. Inits it as array (ZE is doing this by calling array_init macros)
3. Allocates memory for the array element "test" and stores it in "array" hash table.
4. Allocates memory for the array element "test2" and stores it in "array" hash table.
5. Searches for the "test" hash key and outputs data pointed by this key.
Lets look at process of the hash table creation in detail.
Under the series of macro lies function _zend_hash_init. Most interesting for us part of this function is a vector size determination and memory allocation for it. As i mentioned above vector has a fixed size and you should know its size on memory allocation. The problem is that ZE don't knows how many elements will be in the hash when it creates hash. That's why ZE calculates initial vector size by this formula:

while ((1U << i) < nSize) {
    i++;
}
ht->nTableSize = 1 << i;

In our case nSize is 0. On hash initialization nTableSize size will be 8 according to the formula above. ZE allocates vector with a size of 8 pointers:

ecalloc_rel(ht->nTableSize, sizeof(Bucket *))

and sets bit mask (i'll explain it's purpose later)

ht->nTableMask = ht->nTableSize - 1;

Here is a hash vector schematically:
|   |   |   |   |   |   |   |   |   |

Next ZE adds element to the hash by calling _zend_hash_add_or_update function. This function is quite a big and the most interesting part is this:

h = zend_inline_hash_func(arKey, nKeyLength);
nIndex = h & ht->nTableMask;

First ZE calculates hash of the key string using algorithm called DJBX33A, then applies bit mask to using bitwise AND. By doing this ZE shrinks string hash size unsigned long 210728875269 returned by zend_inline_hash_func to fit in the hash vector size. The result of this operation on our hash key string "test" will be 5.
Next ZE allocates memory for the "bucket" (linked list element) plus memory (+ nKeyLength) for the array key characters.

p = (Bucket *) pemalloc(sizeof(Bucket) - 1 + nKeyLength, ht->persistent);
All memory after size of the Bucket struct and to the end of memory allocated by pemalloc is used to store array key characters (thanks to michee for bringing this tricky allocation to attention).

next ZE stores pointer to hash value in the hash vector's memory by the offset of 5 (in its 6th element in other words)
ht->arBuckets[nIndex] = p;

schematically:
|   |   |   |   |   | x |   |   |   |

next ZE adds "test2" element, its hash is 6954052885527 and nIndex is 7:
|   |   |   |   |   | x |   | x |   |

However there is no guarantee that all elements that should be added to the hash will have different nIndex (even ulong DJBX33A hashes may collide). Situation when result of the nIndex calculation is same for 2 or more keys is called collision. Collision can be fixed using linked list.
Each array bucket have initial, not bitwise ANDed string key hash value and also size of the initial key, first character of the key to minimize linked list collisions and a pointer to the next element which is initially NULL (points to nothing)

typedef struct bucket {
    ulong h; /* Full, not bitwise ANDed hash of the initial key */
    uint nKeyLength; /* Initial key size */
    void *pData; /* Array element value */
    ...
    struct bucket *pNext; /* Pointer to the next bucket */
    ...
    char arKey[1]; /* First char of the initial key as a pointer to the first byte of the memory allocated for the key characters */
} Bucket;
Why char[1] and not a pointer? Size of the char[1] is 1 byte and pointer's size is 4 (32 bit system) or 8 (64 bit system) bytes. By doing this Zend engine saves 3-7 bytes per key and hashes are used everywhere in PHP so actually it saves a lot of memory!

When ZE detects collision it traverses linked list of the buckets, finds last element (which points nowhere, to the NULL) and points it to the new hash element.
To find previously stored value by its key ZE calls zend_hash_find function. First ZE needs to reproduce string key hash and nIndex values, piece of code that does the job is the same as in _zend_hash_add_or_update function:

h = zend_inline_hash_func(arKey, nKeyLength);
nIndex = h & ht->nTableMask;

next ZE accesses vector by calculated nIndex offset

p = ht->arBuckets[nIndex];

gets pointer to the linked list and traverses it, compares current string key hash value, size of the key and byte-by-byte comparison of the key with previously stored key in the bucket until all this values will be equal:

while (p != NULL) {
    if ((p->h == h) && (p->nKeyLength == nKeyLength)) {
        if (!memcmp(p->arKey, arKey, nKeyLength)) {
            *pData = p->pData; /* Data that matched the key is found */
            return SUCCESS;
        }
    }
    p = p->pNext; /* Go to the next bucket in the linked list */
}

What happens if hash size increases above 8? ZE reallocates vector to the bigger size (next size according to the formula is 16) and reindexes hash.

Zend hash is so effective that it is used very heavily all over the engine. For instance for storing userspace functions, global variables, included files and in many other parts of the Zend engine.

Comments

  1. great post! I'd like to see more of these:)
    A few questions though:

    1) why is the first char of the key stored in Bucket?

    2) why does it alloc
    p = (Bucket *) pemalloc(sizeof(Bucket) - 1 + nKeyLength, ht->persistent);

    why not just p = (Bucket *) pemalloc(sizeof(Bucket) ) ?

    3) pData i suppose it will be a pointer to a zval?

    Thanks,

    ReplyDelete
  2. and why does it do

    if (!memcmp(p->arKey, arKey, nKeyLength)) ??

    If arkey is declared as just 1 byte?
    probably considering how allocation was done above, that holds the entire key? Then why is it declared as holding just 1 byte?

    ReplyDelete
  3. Thank you for noticing. After some debugging i can confirm that Zend engine allocates memory for whole array key not only for the first char. sizeof(Bucket) - 1 + nKeyLength is a clever and tricky way to allocate memory for arbitrary key size (at the end of the allocated bytes, actually all memory after sizeof(Bucket)). Comment at the end of the Bucket struct definition warns about this trick: ...char arKey[1]; /* Must be last element */

    ReplyDelete
  4. yeah, i realized it was a trick for that...just that i don't know why they haven't used a pointer....probably to save some space.

    Anyway how did you do that debugging? I haven't coded in C in a while and i don't know some good techniques do debug this sort of stuff for core php....


    ps: i'm the guy who added you on gtalk.

    ReplyDelete
  5. Take a look at previous post how to build PHP debugging environment http://cmyker.blogspot.com/2011/11/php-extension-development-with-linux.html

    ReplyDelete
  6. Yeah, size of the char[1] is 1 byte and pointer's size is 4 (32 bit system) or 8 (64 bit system) bytes. By doing this Zend engine saves 3-7 byres per key and hashes are used everywhere in PHP so actually it saves a lot of memory!

    ReplyDelete

Post a Comment

Popular posts from this blog

Memory efficient array permutation in PHP 5.5 using generators

How to dump http request headers with PHP under CGI/FastCGI SAPI

Zend Framework 2 AJAX: return JSON response from controller action. The proper way.