2021-11-13 | UNLOCK

Memory Analysis About List and Array in Python

0. Pre-knowledge

In Python, the list () is a dynamic array of pointers, and the array object provided by the Array Module is a dynamic array that holds the same type of values. Since the array directly stores the value, it uses less memory than the list. Both lists and arrays are dynamic arrays, so when new elements are added to them, and there is no space to save new elements, they will automatically reallocate the memory block and copy the values in the original memory to the new memory block.

In order to reduce the number of re-allocation of memory, usually every time it is re-allocated, the size is k times the original size. The larger the value of k, the fewer times the memory is re-allocated, but the more space is wasted. This section observes the memory allocation patterns of lists and arrays through a series of experiments.

list storage structure

After list() is instantiated, the structure is divided into 3 parts,

-variable name:
list object (structured data + pointer array)

-list content:
where id represents the position of the list object,

“v”refers to the name of the variable, v[:] refers to the list object, this rule is also true for other sequence structures in python, the following examples can be used as evidence,

When a=b, a and b point to the same list object

When a=b[:], the list object of a and the list object of b point to the same list content

In addition, [0] and [:1] are also different:

1
2
3
4
5
1. In [30]: a[0]
2. Out[30]: 1
3.
4. In [31]: a[:1]
5. Out[31]: [1]

Empty list memory space

1
2
1. In [32]: sys.getsizeof([])
2. Out[32]: 64

1. Calculate the growth mode of the list by getsizeof()

Step1:

sys.getsizeof() can get the memory size occupied by the list. Please write a program to calculate a list with a length of 10000, and each subscript of it saves the size of the list when it grows to this subscript:

1
2
3
4
5
6
7
8
import sys
size = []
for i in range(10000):
size.append(sys.getsizeof(size))

import pylab as pl
pl.plot(size, lw=2, c='b')
pl.show()

Step2:

Calculate the resize_pos array representing the memory size of the list each time memory is allocated:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np

size = []
for i in range(10000):
size.append(sys.getsizeof(size))
size = np.array(size)
new_size = np.diff(size)

resize_pos = size[np.where(new_size)]
//resize_pos = size[np.nonzero(new_size)]

pl.plot(resize_pos, lw=2)
pl.show()
print ("list increase rate:")
tmp = resize_pos[25:].astype(np.float)
//In order to calculate the growth rate, you only need to calculate the average of
//the quotient of the two values before and after the resize_pos array.

print (np.average(tmp[1:]/tmp[:-1]))
//In order to improve the accuracy, we only calculate the average value of the second half.
//Note that the astype() method needs to be used to convert the integer array to the floating point array.
//The output of the program is as follows:

Step3:

We can use scipy.optimize.curve_fit() to fit the resize_pos array, and the fitting function is an exponential function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from scipy.optimize import curve_fit

def func(x, a, b, c, d):
return a * np.exp(b * x + c) + d
xdata = range(len(resize_pos))
ydata = resize_pos
popt, pcov = curve_fit(func, xdata, ydata)

y = [func(i, *popt) for i in xdata]
pl.plot(xdata, y, lw=1, c='r')
pl.plot(xdata, ydata, lw=1, c='b')
pl.show()
print ("list increase rate by curve_fit:")
print (10**popt[1])

Question 1:

Are the element storage addresses consecutive?

First, test the memory address of the content stored in the list object.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
In [2]: id(a)
Out[2]: 139717112576840

In [3]: for i in a:
...: print(id(i))
...:
139717238769920
139717238769952
139717238769984
139717239834192
139717240077480
139717240523888
139717195281104
139717112078024

In [4]: for i in a[6]:
...: print(id(i))
...:
139717240220952
139717240202048

In [5]: for i in a[7]:
...: print(id(i))
...:
139717238770016
139717238770048

Then look at the relative address,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
In [6]: for i in a:
...: print(id(i)-139717238769920)
...:
0
32
64
1064272
1307560
1753968
-43488816
-126691896

In [7]: for i in a[6]:
...: print(id(i)-139717238769920)
...:
1451032
1432128

In [8]: for i in a[7]:
...: print(id(i)-139717238769920)
...:
96
128

It can be seen that for the list object, the element content is not necessarily stored linearly, but due to memory allocation problems, the illusion of linear storage will appear. When the element appears in a container or changes relative to the previous element type, the memory space will no longer be continuous.

Question 2:

Are the address of the list object and the address of the element consecutive?

In fact, Q1 has already answered this question. After all, the element address itself is not continuous, but we still tested it.

1
2
In [22]: id(a[0])-id(a)
Out[22]: 126193080

The difference is far, and we can see from the analysis of the source code that the body of the list object is an array of pointers, that is, the position of id(a) is an array of pointers to the position of the element, and of course there are auxiliary object header information and the like.

Question 3:

Analysis of the memory usage of list objects (without elements)

1
2
3
4
5
6
7
8
In [16]: sys.getsizeof([1,2,3,'a','b','c','de'])
Out[16]: 120

In [17]: sys.getsizeof([1,2,3,'a','b','c'])
Out[17]: 112

In [18]: sys.getsizeof([1,2,3,'a','b'])
Out[18]: 104

It can be seen that each object of the list occupies 8 bytes and 32 bits of space. Let’s look at the slice,

1
2
3
4
5
6
7
8
9
10
11
In [20]: sys.getsizeof(a[:3])
Out[20]: 88

In [21]: sys.getsizeof(a[:4])
Out[21]: 96

In [23]: sys.getsizeof(a[3:4])
Out[23]: 72

In [24]: sys.getsizeof(a[3:5])
Out[24]: 80

The slice object also occupies 8 bytes for each element, but the slice is also a list object, even if it is cut from the middle (without cutting the head), it will also contain the storage occupation of the header information.

2. Estimate the array memory allocation by computing run time