[TOC]

online reading

Preliminaries

Data Manipulation

data manipulation vs. data preprocessing

Data Preprocessing: A comprehensive process that includes data manipulation to prepare raw data for analysis and modeling.

Data Manipulation: A subset of preprocessing tasks focused on transforming and organizing data.

data preprocessing includes data manipulation

Data Manipulation

Definition: Data manipulation refers to the process of changing data to make it more organized and easier to analyze. This includes various operations to transform the data.

Common Tasks:

Merging and Joining: Combining data from multiple sources or tables.
Sorting: Arranging data in a specific order.
Filtering: Selecting a subset of data based on conditions.
Aggregation: Summarizing data, such as calculating averages or sums.
Reshaping: Changing the structure or format of data, such as pivoting tables.
Indexing: Selecting specific rows or columns of data.
Data Cleaning: Correcting or removing incorrect, corrupted, or duplicate data.

Data Manipulation

Definition: Data manipulation refers to the process of changing data to make it more organized and easier to analyze. This includes various operations to transform the data.

Common Tasks:

Merging and Joining: Combining data from multiple sources or tables.
Sorting: Arranging data in a specific order.
Filtering: Selecting a subset of data based on conditions.
Aggregation: Summarizing data, such as calculating averages or sums.
Reshaping: Changing the structure or format of data, such as pivoting tables.
Indexing: Selecting specific rows or columns of data.
Data Cleaning: Correcting or removing incorrect, corrupted, or duplicate data.

Saving Memory

Python first evaluates Y + X, allocating new memory for the result and then points Y to this new location in memory.

import torch

# Example tensors
Y = torch.tensor([1, 2, 3])
X = torch.tensor([4, 5, 6])

# Save the original id of Y
before = id(Y)

# Perform addition and reassignment
Y = Y + X

# Check if id(Y) remains the same after reassignment
print(id(Y) == before)  # False, because Y now points to a new memory location

# Output:
# False

False

We can assign the result of an operation to a previously allocated array Y by using slice notation: Y[:] = <expression>. To illustrate this concept, we overwrite the values of tensor Z, after initializing it, using zeros_like, to have the same shape as Y.

Z = torch.zeros_like(Y)
print('id(Z):', id(Z))
Z[:] = X + Y
print('id(Z):', id(Z))

id(Z): 140381179266448
id(Z): 140381179266448

If the value of X is not reused in subsequent computations, we can also use X[:] = X + Y or X += Y to reduce the memory overhead of the operation.

import torch

# Example tensors
Y = torch.tensor([1, 2, 3])
X = torch.tensor([4, 5, 6])

# Save the original id of Y
before = id(Y)

# Perform in-place addition
Y += X

# Check if id(Y) remains the same after in-place addition
print(id(Y) == before)  # True

# Output:
# True
before = id(X)
X += Y
id(X) == before

True

Broadcasting

Broadcasting is a feature in numerical libraries like NumPy and PyTorch that lets you perform operations on arrays of different shapes by automatically expanding smaller arrays to match the shape of larger ones. This process is efficient and avoids unnecessary memory usage.

# same shape 

import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(a + b)  # Output: [5 7 9]

# different shape

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(a + b)  # Output: [5 7 9]

a = np.array([1, 2, 3])
b = 2
print(a + b)  # Output: [3 4 5]


a = np.array([[1], [2], [3]])
b = np.array([4, 5, 6])
print(a + b)
# Output:
# [[5 6 7]      +1
#  [6 7 8]       +2
#  [7 8 9]]       +3

Conversion to Other Python Objects

Converting to a NumPy tensor (ndarray), or vice versa, is easy. The torch tensor and NumPy array will share their underlying memory, and changing one through an in-place operation will also change the other.

Numpy Array — —- —- PyTorch Tensor

A = X.numpy()
B = torch.tensor(A)
type(A), type(B)

# output: (numpy.ndarray, torch.Tensor)

To convert a size-1 tensor to a Python scalar, we can invoke the item function or Python’s built-in functions.

1
2
3

a = torch.tensor([3.5])
a, a.item(), float(a), int(a)
# (tensor([3.5000]), 3.5, 3.5, 3)

Data Processing

Comma-separated values (CSV) files are ubiquitous for the storing of tabular (spreadsheet-like) data.

CSV files and Excel sheets both store tabular data, but they have key differences:

CSV Files

Format: Plain text format where data is separated by commas.
File Extension: .csv
Content: Only contains data, no formatting, formulas, or multimedia.
Compatibility: Can be opened by any text editor, spreadsheet software, or program that supports CSV.
Size: Typically smaller because it lacks additional features.

Excel Sheets

Format: Binary or XML-based format for Microsoft Excel.
File Extensions: .xls (older format), .xlsx (newer format)
Content: Contains data, but also supports complex formatting, formulas, charts, and multimedia.
Compatibility: Best opened with Excel or similar spreadsheet software (like Google Sheets or LibreOffice Calc).
Features: Supports advanced features like pivot tables, macros, and data validation.

Summary

CSV files are simple and lightweight for storing plain data, while Excel sheets offer rich features for data manipulation and presentation.

Linear Algebra

Scalars

严格来说，仅包含一个数值被称为标量（scalar）。

We denote scalars by ordinary lower-cased letters .

Scalars are implemented as tensors that contain only one element.

x = torch.tensor(3.0)
y = torch.tensor(2.0)

x + y, x * y, x / y, x**y
#(tensor(5.), tensor(6.), tensor(1.5000), tensor(9.))

Vectors

you can think of a vector as a fixed-length array of scalars.

For example, if we were training a model to predict the risk of a loan defaulting, we might associate each applicant with a vector whose components correspond to quantities like their income, length of employment, or number of previous defaults.

x = torch.arange(3)
x

#tensor([0, 1, 2])

Matrices

A = torch.arange(6).reshape(3, 2)
A


tensor([[0, 1],
        [2, 3],
        [4, 5]])
        
A.T

tensor([[0, 2, 4],
        [1, 3, 5]])

Tensor

While you can go far in your machine learning journey with only scalars, vectors, and matrices, eventually you may need to work with higher-order tensors.

Tensors give us a generic way of describing extensions to n th-order arrays.

Basic Properties of Tensor Arithmetic

The elementwise product of two matrices is called their Hadamard product (denoted ⊙)

Matrix Product vs. Hadamard Product

product

Reduction (sum)

A: (array([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [12., 13., 14., 15.],
        [16., 17., 18., 19.]]),

0 columns

1 rows

A_sum_axis1 = A.sum(axis=1) 
A_sum_axis1, A_sum_axis1.shape
#(array([ 6., 22., 38., 54., 70.]), (5,))

A_sum_axis0 = A.sum(axis=0)
A_sum_axis0, A_sum_axis0.shape
# (array([40., 45., 50., 55.]), (4,))

Non-Reduction Sum

sum_A = A.sum(axis=1, keepdims=True)
sum_A

array([[ 6.],
       [22.],
       [38.],
       [54.],
       [70.]]) 


A / sum_A

array([[0.        , 0.16666667, 0.33333334, 0.5       ],
       [0.18181819, 0.22727273, 0.27272728, 0.3181818 ],
       [0.21052632, 0.23684211, 0.2631579 , 0.28947368],
       [0.22222222, 0.24074075, 0.25925925, 0.2777778 ],
       [0.22857143, 0.24285714, 0.25714287, 0.27142859]])

When keepdims=True is used with NumPy’s sum() function along a specified axis, it affects the shape of the resulting array by retaining the dimensions of the summed axis. Here’s how it works:

累加

A.cumsum(axis=0)

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  6.,  8., 10.],
       [12., 15., 18., 21.],
       [24., 28., 32., 36.],
       [40., 45., 50., 55.]])

dot product

import numpy as np

# Define matrices A and B for matrix product
A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

# Matrix product using np.dot or @ operator
AB = np.dot(A, B)
# Or equivalently: AB = A @ B

print("Matrix Product AB:")
print(AB)

Matrix Product AB:
[[19 22]
 [43 50]]



-----------------------------------------------------------




import numpy as np

# Define matrices A and B for Hadamard product (element-wise multiplication)
A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

# Element-wise multiplication (Hadamard product)
A_hadamard_B = A * B

print("Hadamard Product A ⊙ B:")
print(A_hadamard_B)


Hadamard Product A ⊙ B:
[[ 5 12]
 [21 32]]

Norms

the norm of a vector tells us how big it is

import numpy as np
import torch

# Example vector x
x = np.array([3, -4, 5])
x_torch = torch.tensor([3., -4., 5.])

# Compute L1 norm
norm_l1_np = np.linalg.norm(x, ord=1)
norm_l1_torch = torch.norm(x_torch, p=1)

# Compute L2 norm
norm_l2_np = np.linalg.norm(x, ord=2)
norm_l2_torch = torch.norm(x_torch, p=2)

print("Vector x:", x)
print("L1 Norm (NumPy):", norm_l1_np)
print("L1 Norm (PyTorch):", norm_l1_torch.item())
print("L2 Norm (NumPy):", norm_l2_np)
print("L2 Norm (PyTorch):", norm_l2_torch.item())

Output:

Vector x: [ 3 -4  5]
L1 Norm (NumPy): 12.0
L1 Norm (PyTorch): 12.0
L2 Norm (NumPy): 7.0710678118654755
L2 Norm (PyTorch): 7.071067810058594

L1 Norm (Manhattan Norm): Computes the sum of the absolute values of the vector elements. It measures the distance a taxi would travel in a city grid system.

L2 Norm (Euclidean Norm): Computes the square root of the sum of the squares of the vector elements. It measures the straight-line distance between two points in Euclidean space.

Automatic Differentiation

自动微分

Create Tensor x:

1 2	x = torch.arange(4.0) print(x)

Output:

1 2	tensor([0., 1., 2., 3.])

x is a tensor with values [0., 1., 2., 3.].

Dot Product of x with Itself:

1 2	torch.dot(x, x)

The dot product of a vector with itself is calculated as follows:

torch.dot(x,x)=x[0]⋅x[0]+x[1]⋅x[1]+x[2]⋅x[2]+x[3]⋅x[3]

Plugging in the values from x:

torch.dot(x,x)=0⋅0+1⋅1+2⋅2+3⋅3=0+1+4+9=14‘ = 0 + 1 + 4 + 9 = 14

Multiply by 2:

1
2
3

y = 2 * torch.dot(x, x)
print(y)
# y = 2 * (x[0]^2 + x[1]^2 + x[2]^2 + x[3]^2)

We then multiply the result of the dot product by 2:

y=2⋅14=28

Result:

print(y)

Output:

1	tensor(28., grad_fn=<MulBackward0>)

We can now take the gradient of y with respect to x by calling its backward method. Next, we can access the gradient via x’s grad attribute.

1 2	y.backward() x.grad

1	tensor([ 0., 4., 8., 12.])

Now let’s calculate another function of x and take its gradient. Note that PyTorch does not automatically reset the gradient buffer when we record a new gradient. Instead, the new gradient is added to the already-stored gradient. This behavior comes in handy when we want to optimize the sum of multiple objective functions. To reset the gradient buffer, we can call x.grad.zero_() as follows:

x.grad.zero_()  # Reset the gradient
y = x.sum()
y.backward()
x.grad

1	tensor([1., 1., 1., 1.])

y.backward(): This function call computes gradients using the chain rule of calculus. It calculates ∂y/∂xi for each element xi in x and stores these gradients in x.grad.

x.grad: After calling y.backward(), x.grad will be [1.0, 1.0, 1.0], because ∂y/∂xi = 1 for each element of x when y = sum(x).

x.grad.zero_()
y = x * x
y.backward(gradient=torch.ones(len(y)))  # Faster: y.sum().backward()
x.grad

1	tensor([0., 2., 4., 6.])

Detaching Computation

https://www.geeksforgeeks.org/tensor-detach-method-in-python-pytorch/

x.grad.zero_()
y = x * x
u = y.detach()
z = u * x

z.sum().backward()
x.grad == u

1	tensor([True, True, True, True])

因此，尽管 u = y.detach() 分离了 y，使得 u 不再具有梯度信息，但 u 仍然是一个张量。它与 y 共享相同的数据，但不再与计算图相关联，因此在反向传播过程中不会传播梯度到 u。

u 是 y 的一个副本，但不再与计算图相关联，因此不会进行梯度跟踪。

假设 x 的初始值为 [1.0, 2.0, 3.0]，那么：

y = [1.0^2, 2.0^2, 3.0^2] = [1.0, 4.0, 9.0]
u = [1.0, 4.0, 9.0] （因为 u 是 y 的副本）
z = u * x = [1.0 * 1.0, 4.0 * 2.0, 9.0 * 3.0] = [1.0, 8.0, 27.0]

1
2
3

x.grad.zero_()
y.sum().backward()
x.grad == 2 * x

1	tensor([True, True, True, True])

x.grad.zero_()
y = x * x
u = y.detach()
z = u * x

z.sum().backward()

u 是通过 detach() 方法从 y 分离出来的一个张量。这意味着 u 是一个与 y 具有相同数值的张量，但是它不再与计算图关联，也不再跟踪梯度。因此，可以将 u 视为一个常数张量。
当执行 z.sum().backward() 时，PyTorch 计算 z 的和 z.sum() 对 x 的梯度。因为 z = u * x，所以 z 的梯度 = u
- z.sum() 对 x[i] 的梯度是 u[i]，因为在计算图中，u 被视为一个常数，而不是变量。

因此，x.grad 包含的值 [2.0, 4.0, 6.0] 是根据这个推理得出的.

Dynamic control flow

控制流的梯度计算

def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c

让我们计算梯度。

1
2
3

a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()

我们现在可以分析上面定义的f函数。请注意，它在其输入a中是分段线性的。换言之，对于任何a，存在某个常量标量k，使得f(a)=k*a，其中k的值取决于输入a，因此可以用d/a验证梯度是否正确。

1	a.grad == d / a

Dynamic control flow in the context of deep learning frameworks like PyTorch means that the computation and flow of operations can depend on the input data or conditions encountered during runtime. Unlike static control flow where the computational graph is predetermined and fixed before execution, dynamic control flow allows for flexibility in how operations are executed based on the actual data being processed.