Basic Data Structures
Stack
first in, last out
property- an item can be inserted and removed from a stack
- but an item can only be removed from the stack after all the items added after it are removed first
- supports three operations
- Insert (or "Push"): Putting an item into the stack
- Peek: Look at the top item of the stack (the last inserted item that's not removed)
- Remove (or "Pop"): Remove the top item of the stack
- Usefulness:
- one of the most fundamental concepts of computer programming
- recursion: utilizes a stack under the hood
- As such, Depth-First Search uses a stack, either directly or through the use of recursion
- recursion: utilizes a stack under the hood
- one of the most fundamental concepts of computer programming
Implementation
- using an (statically-sized) array and a pointer pointing to the top of the stack (usually the next unused space)
- When inserting an item
- set the value at the pointer to the item and increment the pointer by 1
- When removing an item
- decrease the pointer by 1
- no need to reset the item at the pointer because it isn't accessible by the stack
- it will be overwritten when more items are inserted
- might still want to reset it for languages with garbage collectors to prevent memory leaks
- it will be overwritten when more items are inserted
Caution
- In a real world scenario, need to be careful when doing an operation on a stack
- Removing an item from a stack with no items will cause an
underflow
error - If we have a static sized array as a stack and we try to insert an item into the stack while it's full
- it will cause an
overflow
- it will cause an
- Removing an item from a stack with no items will cause an
- Underflow errors can be prevented by checking if the stack is empty before removing an item from the stack
- as programs almost never want to pop from an empty stack
- If a program does want that, that program probably wants to do something different if the stack is empty
Overflow
errors are a bit tricky, as programs sometimes would like to add more items to the stack
- for modern programming languages, there is usually a dynamically sized array data structure and can be used as a stack
- Insertion and deletion from the list has a time complexity of O(1) (on average), just like a stack
- Since memories are dynamically assigned to the list, don't have to worry about overflowing
- because if it reaches a max size, the system will just allocate more memory for them
- e.g.: python
# Initializes the stack
stack = []
# Pushing 5 into the stack
stack.append(5)
# Look at the top item of the stack and print it
print(stack[-1])
# Removing the top item from the stack
stack.pop()
Queue
first in, first out
property- an item can be inserted and removed from the queue
- but an item can only be removed from the queue after all the items added before it are removed first
- supports three operations
- Insert (or "Push"): Putting an item into the end of the queue
- Peek: Look at the first item of the queue
- Remove (or "Pop"): Remove the first item of the queue
Implementation
- use an array and two pointers
- one pointing to the start of the queue, and one pointing at the end of the queue
- When inserting an item into the queue
- set the entry at the end pointer to the value and increase the end pointer by one
- When removing an item from the queue, increase the start pointer by one
Deque
- is a double-ended queue
- inserting and removing items from the queue can be done on both end
- can use the same implentation logic, but allow the increment and decrement of both start and end pointers
Caution
- One of the flaws of the current implementation
- when one of the queue pointers reaches the end of the array, it will cause an overflow
- However, if some elements have been removed from the other end
- then when the queue overflows, there are still a lot of unused empty spaces
- improvement that can be done on this implementation is to make the array "loop"
- When a pointer tries to move past the array, it loops around to the other end of the array instead
- This is known as a
Circular Buffer
- This is known as a
- When a pointer tries to move past the array, it loops around to the other end of the array instead
- Most modern programming languages offer a built-in deque data structure
- and they often use a dynamic array as its underlying data structure
- they can also use a double-linked list, like Python's deque class
- won't have to worry about deques overflowing because the resizing of the array is handled for you
- e.g. python
# Initialize a new deque
queue = deque()
# Add 2 to the end of the deque
queue.append(2)
# Add 4 to the front of the deque
queue.appendleft(4)
# Look at the end of the deque and print it
print(queue[-1])
# Look at the front of the deque and print it
print(queue[0])
# Remove the end of the deque
queue.pop()
# Remove the front of the deque
queue.popleft()
Hash Map
Hash Function
- is a function that can convert a data of arbitrary size to a value of a fixed size (usually 32-bit integer)
- The result of the function is called the hash value
- for a valid hash function
- two items representing the same values must return the same hash value as a result
- e.g.: two data representing the string "Smith" must return the same value in the hash function, regardless how this string is represented internally
- The function always return the same value
- as long as the size and entries of the array are the same, regardless of any other attributes such as memory address, etc.
- The converse is not necessarily true as two items with the same hash values are not necessarily equal
- e.g.
[32, 10]
has the same hash value as[18, 10, 14]
, even though they are not equal - Such cases where the hash of two different items are the same is known as a
hash collision
- e.g.
- two items representing the same values must return the same hash value as a result
- good hash function with the following attributes can help improve the efficiency of hash tables
- it is easy to calculate the hash value (the function has low time complexity)
- The chance of collision is very low
- All possible values are utilized approximately equally
Hash Tables
- when programming, we need a data structure that maps from an arbitrary data type to another arbitrary data type, like an array with non-integer index
- it is one way of representing this data structure
- idea is we have an array of fixed size as our data structure
- When we add a key to the data structure
- we first convert it down to an integer within the range of the array (using a hash function)
- and put that key in that index (where the hash value represents)
- we first convert it down to an integer within the range of the array (using a hash function)
- When looking up using that key, we hash it again and look up that index
- Since the same value always hash to the same thing, we will always find the correct item in that index
- When we add a key to the data structure
Improvements
- As the possibilities of keys increases, collision (where two different items have the same hash value) is unavoidable by the pigeonhole principle
- In that case, it is necessary to consider what happens when two keys with the same hash are used in this table
- With the current situation, the data structure will treat them as interchangeable
- which can lead to unexpected behavior
- e.g. you want to set the entry for "Anne", but doing so will change the entry for "John", which is often not what we want
- which can lead to unexpected behavior
- One method of dealing with this problem is
Separate Chaining
- Instead of storing the entry directly in the array
- we use a list of key-value pairs in each entry in the array instead
- When we set the value represented by a key in the hash table, after finding out the hash of that key, we first search for if the key exists already in the list represented by the hash value
- If so, we can update the value to a new one
- Otherwise, we can append the new key value pair into the list
- The list is usually a linked list, although other lists that can dynamically increase in size will do as well
- Searching for an item takes similar steps
- Instead of storing the entry directly in the array
- Another strategy is
Open Addressing
- use the same table to store everything, but we find the next unused space instead when adding a new key
Efficiency
- Assuming the hash function uses constant time
- the average time complexity is
O(n / k)
for each insertion/access operation on the table because of the pigeonhole principle- where n is the number of entries in the hash table
- k is the array size of the table
- in the worst case scenario, where everything in the array is hashed to the same key, the time complexity is
O(n)
- the average time complexity is
- having a good hash function is very important to the efficiency of a hash table
- Having a simple hash function ensures that each key lookup is efficient
- Having a hash function where collision is unlikely ensures that the number of entries at each index is kept at minimal
- thus improving the efficiency of each lookup
- Most modern programming languages offer a dynamic hash table: a hash table whose size dynamic changes as more items are added
- each access to the table has an average time complexity of
O(1)
- which is very close to that of accessing an array
- As such, when you need a non-integer index, you can use a hash table instead
- e.g. Python has the dict class, while Java has the HashMap class
- e.g.: python
# Initialize a new hash map
hashmap = {}
# Set the entry represented by "John" to 28
hashmap["John"] = 28
# Check if "John" is in the hash map
if "John" in hashmap:
# Print the entry represented by "John"
print(hashmap["John"])
- e.g.: python
- each access to the table has an average time complexity of