Framework Migration Guide
https://chainer.github.io/migration-guide/
Authors: Chainer Team
This document provides technical information for migration from Chainer to PyTorch.
Chainer uses NumPy/CuPy (xp.ndarray) as an array library, and wraps them as chainer.Variable to support autograd. Similarly, PyTorch uses ATen (at::Tensor (C++)) as an array library ("tensor library" in PyTorch terms), and wraps it as torch::Tensor (C++ API) / torch.Tensor (Python API) to support autograd. torch.* provides API similar to (but not compatible with) NumPy, e.g. torch.dot, torch.float32, etc.
As both frameworks share the same concept, define-by-run, the look-and-feel of code written in PyTorch is pretty similar to Chainer. Here is the high-level mapping of features:
Chainer | PyTorch | Notes |
Variable chainer.Variable | Tensor torch.Tensor | |
Function chainer.FunctionNode (chainer.functions.*) | Function torch.autograd.Function (torch.nn.functional.*) | `torch.*` also provides NumPy-like (but not compatible) operations. |
Link / Chain chainer.{Link, Chain} (chainer.links.*) | Module torch.nn.Module (torch.nn.*) | |
Sequential chainer.Sequential | Sequential torch.nn.Sequential | You can use function modules as member (e.g., torch.nn.ReLU instead of torch.nn.functional.relu). |
Dataset chainer.dataset.DatasetMixin (chainer.datasets.*) | Dataset torch.utils.data.Dataset | There are no TransformDataset in PyTorch (there is one in CPM as cpm.TransformDataset); datasets conventionally accepts `transforms` argument that perform per-example preprocessing. |
Iterator chainer.iterators.* | DataLoader torch.utils.data.DataLoader | Unlike Chainer's Iterator, DataLoader automatically collates all samples into one Tensor by default; use collate_fn to customize this behavior. DataLoader itself supports multi-process iteration (using num_workers option). |
Optimizer chainer.Optimizer (chainer.optimizers.*) | Optimizer torch.optim.Optimizer (torch.optim.*) | |
Trainer chainer.training.Trainer | Engine ignite.Engine | ignite.engine.create_supervised_trainer() |
Updater (with converter) chainer.training.Updater | As noted above, Iterator concatenates examples by default. Transfer to device is handled by Engine (or custom loop code if you don't use Ignite) | |
Evaluator chainer.training.extensions.Evaluator | ignite.engine.create_supervised_evaluator() | |
Extension chainer.training.Extension (chainer.training.extensions.*) | Handler (ignite.handlers.*, ignite.contrib.handlers.*) |
Refer to the Porting Guide section for the details of the difference of each component.
Arguably the model is the hardest part to port without affecting the outcome of the training.
It might be easier to port in this order:
You can use cpm.TorchModule to wrap a PyTorch module as a Chainer model.
chainer-pytorch-migration Python module (called CPM or "cpm" (module name) in this document) provides various utilities to help migration from Chainer to PyTorch.
Example code assumes that cpm is imported as follows:
import chainer_pytorch_migration as cpm import chainer_pytorch_migration.ignite |
This class wraps a PyTorch module as a Chainer link. It allows training PyTorch models in Chainer training scripts. The graph (forward/backward) must be constructed and traversed in PyTorch.
model = torchvision.models.resnet50() model.cuda() w_model = cpm.TorchModule(model) w_model.to_gpu(device) # Just synchronizes the metadata, does not transfer data |
This class wraps a Chainer parameter as a PyTorch parameter. It allows training of Chainer models (chainer.Link) in PyTorch training scripts (with torch.optim.Optimizer). The graph (forward/backward) must be constructed and traversed in Chainer. cpm.LinkAsTorchModel internally uses it.
# initialized parameter arr = numpy.full(shape, 17, 'float32') chainer_param = chainer.Parameter(arr) torch_param = cpm.ChainerParameter(chainer_param) |
This class automatically creates all the cpm.ChainerParameter objects for a given chainer link and provides methods such as parameters(), named_parameters() or state_dict() required by pytorch optimizers or tools such as horovod.
model = ChainerModel() model.to_device(ch_device) # Initialize parameters before converting to `ChainerParameter`s. model(ch_device.xp.zeros((1, 784)).astype('f')) # Convert parameters to `ChainerParameter`s to share memory with PyTorch. torched_model = cpm.LinkAsTorchModel(model) optimizer = optim.SGD(torched_model.parameters(), lr=args.lr) |
This function registers a chainer trainer extension to be used with ignite.
Function call requires the ignite trainer, torch optimizer and the chainer extension as the parameters
optimizer.target = model trainer.out = 'path to store extension results' cpm.ignite.add_trainer_extension(trainer, optimizer, extensions.ExponentialShift('lr', 0.9, 1.0, 0.1)) |
pytorch-pfn-extras Python module (called PPE or "ppe" (module name) in this document) provides various supplementary components for PyTorch, including APIs similar to Chainer, e.g. Extensions, Reporter, Lazy modules (automatically infer shapes of parameters). Here are some notable features Refer to the Documentation for the full list of features.
PPE also provides the interoperability feature between CuPy and PyTorch memory pool.
This function makes CuPy use a memory pool from PyTorch. You need to call it before any operations using CuPy.
# Enable using PyTorch memory allocator in CuPy. ppe.cuda.use_torch_mempool_in_cupy() # Revert back to CuPy's default memory pool. ppe.cuda.use_default_mempool_in_cupy() |
Note: The feature was originally implemented in CPM as cpm.use_torch_in_cupy_malloc, but has been moved to PPE. CPM version has been deprecated and not recommended any more.
PyTorch datasets (pytorch.utils.data.Dataset) are basically compatible with Chainer’s. In most cases they are interchangeable in both directions.
As of PyTorch 1.2.0, PyTorch cannot handle data arrays with negative strides (can result from numpy.flip or chainercv.transforms.flip, for example).
Perhaps the easiest way to circumvent this problem is to wrap the dataset with numpy.ascontiguousarray.
def avoid_negative_strides(in_data): data, label = in_data data = numpy.ascontiguousarray(data) return data, label dataset = cpm.TransformDataset(dataset, avoid_negative_strides) data_loader = torch.utils.data.DataLoader(dataset, ...) |
Another way is to customize the collation function with collate_fn argument in torch.utils.data.DataLoader.
def collate(batch): data = numpy.stack([d for d, l in batch]) label = numpy.stack([l for d, l in batch]) data_tensor = torch.from_numpy(data) label_tensor = torch.from_numpy(label) return data_tensor, label_tensor data_loader = torch.utils.data.DataLoader(dataset, ..., collate_fn=collate) |
In Chainer, it’s possible to specify custom converters for each batch via `training.updaters.StandardUpdater(train_iter, optimizer, device=device, converter=_converter)`. In PyTorch, similar functionality can be achieved via the data loader: `DataLoader(..., collate_fn=_converter)`.
There is, however, an important difference when used in conjunction with multiprocessing. In Chainer, `_converter` will be run in the main process, so it’s safe to access CUDA in the function when using multiprocessing’s `fork` mode. In PyTorch, however, `_converter` will be run inside each forker worker processes of the data loader. This means that we cannot access CUDA without getting a CUDA init error. It seems like in PyTorch, the correct usage is instead to only do CPU-related operations inside `_convert`, and only send the resulting tensors to the GPU *after* retrieving them from the data loader. The following is an example of correct PyTorch usage:
it = DataLoader(..., collate_fn=_converter) for img, label, metadata in it: img = img.cuda() label = label.cuda() # metadata is still on CPU ... |
Note that the above scenario is different from what we expect in Chainer, where the `_converter` is called in the main process, which is why Chainer code might have CUDA-related operations inside the `_converter`.
Note that in the above use case, _convert should also use `pin_memory` in order to speed up the transfer of `(img, label)` from CPU to GPU: https://discuss.pytorch.org/t/when-to-set-pin-memory-to-true/19723
torch.DataLoader automatically converts NumPy arrays to PyTorch tensors, but if you want to do that manually, refer to NumPy Bridge.
DLPack can be used to bridge between CuPy and torch.Tensor. Note that DLPack does not handle ownership, so you have to make sure the original buffer (the original cupy.ndarray object or dltensor capsule object returned by toDlpack()) survives while the converted tensor/array is in use.
If you allocate a memory both in PyTorch and CuPy, it is also recommended to call ppe.cuda.use_torch_mempool_in_cupy before using CuPy to let CuPy use the PyTorch memory pool. Otherwise memories allocated and freed in CuPy will be kept in the CuPy memory pool which cannot be used by PyTorch.
The behavior is different from NumPy/CuPy, which respects Python 3 division rules. You need to explicitly cast to float in PyTorch (discussion).
>>> x = numpy.arange(5) >>> x array([0, 1, 2, 3, 4]) >>> x / 5 array([0. , 0.2, 0.4, 0.6, 0.8]) >>> torch.from_numpy(x) / 5 tensor([0, 0, 0, 0, 0]) >>> torch.from_numpy(x).float() / 5 tensor([0.0000, 0.2000, 0.4000, 0.6000, 0.8000]) |
See PyTorch for Numpy users for the comparison table.
This is an example code of training loop. Note model.train().
device = torch.device('cuda:0') for i_epoch in range(args.epoch): train_loss = 0 train_correct = 0 model.train() for x, t in data_loader: x = x.to(device) t = t.to(device).long() optimizer.zero_grad() y = model(x) loss = F.nll_loss(y, t) loss.backward() optimizer.step() train_loss += loss.sum().item() _, pred = torch.max(y, 1) train_correct += (pred == t).sum().item() train_loss /= len(data_loader.dataset) train_accuracy = train_correct / len(data_loader.dataset) print('Train average loss: {:.03f}'.format(train_loss)) print('Train accuracy : {:.03f} %'.format(train_accuracy * 100)) |
This is an example code of evaluation loop. Note model.eval() and with torch.no_grad().
device = torch.device('cuda:0') total_loss = 0 total_correct = 0 model.eval() with torch.no_grad(): for x, t in data_loader: x = x.to(device) t = t.to(device).long() y = model(x) total_loss += F.nll_loss(y, t, reduction='sum').item() _, pred = torch.max(y, 1) total_correct += (pred == t).sum().item() average_loss = total_loss / len(loader.dataset) accuracy = total_correct / len(loader.dataset) |
Ignite is something corresponding to chainer.training.Trainer in Chainer.
This Chainer code:
updater = chainer.training.StandardUpdater( train_iter, optimizer, device=device) trainer = chainer.training.Trainer(updater, (100, ‘epoch’)) trainer.extend( extensions.Evaluator( val_iter, model, device=device), trigger=(1, ‘epoch’)) trainer.run() |
can be written in PyTorch using Ignite:
trainer = ignite.engine.create_supervised_trainer( model, optimizer, F.nll_loss, device=device) evaluator = ignite.engine.create_supervised_evaluator( model, metrics={ 'accuracy': ignite.metrics.Accuracy(), 'loss': ignite.metrics.Loss(F.nll_loss), }, device=device) @trainer.on(ignite.engine.Events.EPOCH_COMPLETED) def validation(engine): evaluator.run(val_loader) average_accuracy = evaluator.state.metrics[‘accuracy’] average_loss = evaluator.state.metrics[‘loss’] print(average_accuracy, average_loss) trainer.run(train_loader, max_epochs=100) |
For a list of supported metrics, see https://pytorch.org/ignite/metrics.html.
Using cpm.ignite.add_trainer_extension it is possible to register a chainer extension to be called within the ignite training loop.
A list of the supported extensions follows:
Works | Doesn’t work |
ExponentialShift | DumpGraph |
FailOnNonNumber | Evaluator |
InverseShift | unchain_variables |
LinearShift | |
LogReport | |
MicroAverage | |
MultistepShift | |
ParameterStatistics | |
PlotReport | |
PolynomialShift | |
PrintReport | |
ProgressBar | |
snapshot(read docs) | |
StepShift | |
observe_lr | |
VariableStatisticsPlot | |
WarmupShift |
Some drawbacks rely on that metrics associated to the model or links might not accessible by default.
For example the user will need to report the loss or accuracy per iteration by using an ignite callback as this was done inside the chainer model.
Also for some extensions to work it is necessary for the user to assign the torch or chainer model to the optimizer target attribute and the output directory path for the LogReport, plotters and snapshot extensions
from chainer import reporter def report_loss(engine): reporter.report({'loss':engine.state.output}) |
An example of how to register multiple extensions:
# Torch optimizer # Ignite trainer trainer = create_supervised_trainer(model, optimizer, F.nll_loss, device=device)
optimizer.target = model # Set the output dir for some of the extensions trainer.out = 'result' # Restore the snapshot cpm.ignite.load_chainer_snapshot(trainer, optimizer, 'result/snapshot_iter_4691') # Add a bunch of extensions cpm.ignite.add_trainer_extension(trainer, optimizer, extensions.ProgressBar()) cpm.ignite.add_trainer_extension(trainer, optimizer, extensions.observe_lr()) cpm.ignite.add_trainer_extension(trainer, optimizer, extensions.MicroAverage('loss','lr','mav',(1, 'iteration'))) cpm.ignite.add_trainer_extension(trainer, optimizer, extensions.LogReport()) cpm.ignite.add_trainer_extension(trainer, optimizer, extensions.FailOnNonNumber()) cpm.ignite.add_trainer_extension(trainer, optimizer, extensions.ExponentialShift('lr', 0.9, 1.0, 0.1)) cpm.ignite.add_trainer_extension(trainer, optimizer, extensions.ParameterStatistics(model, prefix='model')) cpm.ignite.add_trainer_extension(trainer, optimizer, extensions.VariableStatisticsPlot(model)) cpm.ignite.add_trainer_extension(trainer, optimizer, extensions.PrintReport( ['epoch', 'iteration', 'loss', 'lr', 'mav', 'model/fc2/weight/grad/percentile/1'])) cpm.ignite.add_trainer_extension(trainer, optimizer, extensions.PlotReport(['loss'], 'epoch', filename='loss.png')) cpm.ignite.add_trainer_extension(trainer, optimizer, extensions.snapshot(writer=writer), trigger=(1, 'epoch')) # writer is a SimpleWriter |
When using the snapshot extension with an ignite trainer, the pytorch objects are saved to an additional “snapshot-torch” file in the output folder. This allows to keep using these snapshots once the migration is finished and directly load pytorch models or the optimizer state from these files.
Additionally, if you are mixing chainer models or optimizers with ignite and pytorch, these objects will be saved in the chainer snapshot file.
The correct way to restore a snapshot is by using cpm.ignite.load_chainer_snapshot(engine, optimizer, snapshot_path) with the Chainer snapshot path.
Note that previously taken Chainer snapshots are not compatible.
You can pass a step function to an Ignite engine.
Use Mapping of functions and links to find and replace with the corresponding feature in PyTorch. You can also find existing model implementations in:
Common pitfalls:
You can find the PyTorch equivalent of Chainer's functions and links in tables below.
Notes:
F refers to chainer.functions (Chainer) / torch.nn.functional (PyTorch).
Chainer | PyTorch | Notes |
Arithmetic functions | ||
Batched addition (accumulating multiple tensors in a single call) is not supported. | ||
Activation functions | ||
Rewrite as: x.clamp(0, z) | ||
Rewrite as: torch.cat((F.relu(x), F.relu(-x))) | ||
Rewrite as: torch.clamp(x * 0.2 + 0.5, 0, 1) | ||
The default slope value is different. | ||
See L.LSTM. | ||
Need to implement manually; see https://github.com/pytorch/pytorch/issues/805 | ||
`training` option must be explicitly specified instead of `train` config in Chainer. | ||
Some OSS implementations are available (e.g., https://github.com/reachtarunhere/S-LSTM-PyTorch) | ||
PyTorch falls back to linear function by default; threshold option must be explicitly given. | ||
Rewrite as: x * F.sigmoid(beta * x) | ||
Some OSS implementations are available (e.g., https://github.com/dasguptar/treelstm.pytorch) | ||
Array manipulations | ||
PyTorch operations perform broadcast automatically like as in NumPy: https://pytorch.org/docs/stable/notes/broadcasting.html | ||
N/A: https://github.com/pytorch/pytorch/pull/17160 [a] | ||
Rewrite as: torch.cat([a,b],dim=2) | ||
Rewrite as: torch.unsqueeze(a, dim) | ||
Use dims=1 | ||
Use dims=0 | ||
Use direct indexing: `x[indexes]`. Negative strides are not supported. | ||
Rewrite as: torch.cat([a,b],dim=1) | ||
NCHW is only supported | ||
See: https://discuss.pytorch.org/t/swap-axes-in-pytorch/970/2 | ||
Replace `constant_values` argument with `value`. Modes other than `constant` are also available. | ||
You cannot specify the length but the maximum length among the inputs is used. | ||
Different behavior to F.repeat. F.tile is more similar. | ||
See https://discuss.pytorch.org/t/swap-axes-in-pytorch/970/2 | ||
Rewrite as: torch.gather(x, 1, t[:, None])[:, 0] | ||
Requires manual manipulation of the results to achieve some of the separate functionality. | ||
You need to implement it yourself. Ref: https://discuss.pytorch.org/t/is-there-any-layer-like-tensorflows-space-to-depth-function/3487/14 | ||
The second argument `size` takes `torch.Size` object that denotes the target output image size (N, C, H, W), while `F.spatial_transformer_grid` takes just a tuple of (H, W). The size of returned tensor is also different: (N x H x W x 2) is returned instead of (N x 2 x H x W). Also note the breaking change regarding align_corners in v1.3.0 (https://github.com/pytorch/pytorch/releases/tag/v1.3.0) | ||
Grid shape is (N, 2, H, W) in Chainer while (N, H, W, 2) in PyTorch. | ||
No `force_tuple`. | ||
Use torch.stack or torch.cat([a,b],dim=axis) | ||
Use permute instead: https://discuss.pytorch.org/t/swap-axes-in-pytorch/970/7 | ||
Use Tensor.permute or torch.t for no axes version | ||
N/A | ||
Rewrite as: torch.cat([a,b],dim=0) | ||
Neural network connections | ||
No `cover_all`. | ||
No `cover_all`. | ||
No `cover_all`. | ||
Use `groups` argument; see https://discuss.pytorch.org/t/depthwise-and-separable-convolutions-in-pytorch/7315/2 | ||
Not implemented: https://github.com/pytorch/pytorch/issues/2260 | ||
Use `dilation` argument. | ||
There is no option for `n_batch_axes`. | ||
Undocumented _C._VariableFunctions function torch.gru? The "link" is available https://pytorch.org/docs/stable/nn.html#torch.nn.GRU and this is probably the expected usage. | ||
See L.NStepBiLSTM. | ||
See L.NStepBiRNNTanh or L.NStepBiRNNReLU. | ||
See L.NStepBiGRU. | ||
See L.NStepLSTM. | ||
See L.NStepRNNTanh or L.NStepRNNReLU. | ||
Evaluation functions | ||
N/A, Ignite has an implementation: https://pytorch.org/ignite/metrics.html#ignite.metrics.Accuracy | ||
N/A, Ignite has an implementation: https://pytorch.org/ignite/metrics.html#ignite.metrics.Accuracy | ||
N/A | ||
N/A, Ignite has an implementation: https://pytorch.org/ignite/metrics.html#ignite.metrics.Precision | ||
Not available. It's an evaluation metric that's not differentiable. It's implemented in Ignite though and could be used (as a reference) https://github.com/pytorch/ignite/pull/496 | ||
N/A, Ignite has an implementation: https://pytorch.org/ignite/metrics.html#ignite.metrics.Recall | ||
Loss functions | ||
N/A | ||
Possibly: -torch.distributions.Bernoulli(y).log_prob(x).sum() | ||
Not available: https://github.com/pytorch/pytorch/issues/11134 | ||
Not available: https://github.com/pytorch/pytorch/issues/11134 | ||
Not available. See https://github.com/Wizaron/instance-segmentation-pytorch for a reproducing work. | ||
N/A | ||
N/A | ||
Use reduction='sum' to keep reduction method | ||
See also: ignite.metrics.MeanAbsoluteError | ||
See https://github.com/kefirski/pytorch_NEG_loss, https://github.com/theeluwin/pytorch-sgns | ||
Mathematical functions | ||
See F.mean. | ||
linalg ops batch is on progress, inverse is already merged | ||
Rewrite as: x.reshape(len(x), -1).norm(dim=1) ** 2 | ||
Rewrite as x + y[(...,) + (None,) * (x.ndim - y.ndim - axis)] | ||
Arbitrary number of batch axes are supported. | ||
Not available. Implement it similar to erf, erfinv? https://github.com/pytorch/pytorch/pull/2799 | ||
N/A | ||
Interface is quite different. | ||
N/A | ||
Currently undocumented: https://github.com/pytorch/pytorch/pull/27812 | ||
Normal math should suffice. | ||
N/A | ||
Weighted average is not supported; rewrite as (without keepdims): torch.tensordot(x, weights / weights.sum(), ([axis], [0])) | ||
F.gelu(x) is corresponding to x * F.ndtr(x). | ||
N/A | ||
Not documented: https://github.com/pytorch/pytorch/issues/25347 n>=2 not supported: https://github.com/pytorch/pytorch/blob/v1.3.1/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp#L179 | ||
- | ||
Rewrite as: x * y[(...,) + (None,) * (x.ndim - y.ndim - 1)] | ||
Only support dense-sparse product. For sparse-dense product, transpose the operands and the output. | ||
Rewrite as: x * x | ||
N/A | ||
N/A | ||
Noise injections | ||
No mask support, elements are randomly zeroed. | ||
The default value of tau is 1, while the Chainer's function takes 0.1. | ||
Not available. Use F.dropout on the weight, or try torchnlp.nn.WeightDrop. | ||
N/A | ||
Normalization functions | ||
N/A | ||
N/A | ||
training=False? | ||
Batch Renormalization not implemented: https://discuss.pytorch.org/t/support-for-batch-renormalization/2965, https://discuss.pytorch.org/t/batch-renormalization-implementation-in-thcunn/5144 | ||
N/A | ||
Currently undocumented. | ||
`gamma = weight` & `beta= bias` | ||
The PyTorch's `F.normalize` is not only for L2 normalization. But the default behavior is for L2 normalization, i.e., the default value of the second argument `p` is set to 2. | ||
Spatial pooling | ||
Superset of Chainer's counterpart. | ||
N/A | ||
In addition to arguments documented in `nn.MaxPool1D`, `return_indices` argument is available to obtain index for unpooling. | ||
ditto. | ||
ditto. | ||
N/A | ||
Requires Torchvision, torchvision has only 2 roi functions, roi_align uses the average of the pixels while roi_pool uses the max value | ||
N/A | ||
N/A | ||
The `roi_pool` function of Torchvision is meant to be roi max pooling according to the source: https://github.com/pytorch/vision/blob/ccd1b27d2b7312ebddb4d51b3a4f8ade1ba8fa8b/torchvision/csrc/cpu/ROIPool_cpu.cpp#L65 Regarding the API, the way to pass the batch indices of each set of RoI coordinates is different. | ||
Requires torchvision. | ||
Not available. Not too difficult to implement? It's a combination of existing functions in Chainer. | ||
Pass `indices` returned from F.max_pool1d. | ||
ditto. | ||
ditto. | ||
N/A | ||
N/A | ||
Utility functions | ||
L refers to chainer.links (Chainer), and nn refers to torch.nn (PyTorch).
Chainer | PyTorch | Notes |
Learnable connections | ||
N/A. Reference user implementation at: https://github.com/ttpro1995/TreeLSTMSentiment | ||
N/A | ||
Use `groups` argument; see https://discuss.pytorch.org/t/depthwise-and-separable-convolutions-in-pytorch/7315/2 | ||
Use `dilation` argument. | ||
N/A | ||
`torchvision.models.inception.InceptionA` seems to be the corresponding module for Chainer's `L.Inception`, but is not documented. | ||
See torchvision.models.inception for Inception v3 | ||
N/A | ||
N/A | ||
bidirectional=True, no explicit activation, no stacking | ||
bidirectional=True, no explicit activation, no stacking | ||
bidirectional=True, no explicit activation, no stacking | ||
bidirectional=True, no explicit activation, no stacking | ||
You could use torch.nn.modules.ParameterList with 1 element | ||
N/A | ||
N/A | ||
N/A | ||
Activation/loss/normalization functions with parameters | ||
The argument `momentum` in the PyTorch implementation seems to be equivalent to `1 - decay` in the Chainer's link. The default value for the argument `eps` (1e-5) is different from Chainer's default value (2e-5). | ||
N/A | ||
Not available. A reference implementation (not that well implemented?) https://github.com/huangleiBuaa/IterNorm-pytorch/blob/master/extension/normailzation/dbn.py. Otherwise look at the Torch lua official implementation https://github.com/princeton-vl/DecorrelatedBN. | ||
affine=True | ||
elementwise_affine=True | ||
N/A | ||
N/A | ||
Machine learning models | ||
N/A | ||
Pre-trained models | ||
Superset of Chainer's VGG variations in torchvision. | ||
ditto | ||
N/A | ||
transform_input=True in torchvision.models.googlenet | ||
torchvision only, pretrained=True | ||
torchvision only, pretrained=True | ||
torchvision only, pretrained=True | ||
N/A | ||
N/A | ||
See https://github.com/marvis/pytorch-caffe or https://github.com/Microsoft/MMdnn |
Here is the mapping of configurations in Chainer (chainer.config.*) and PyTorch:
Chainer | PyTorch | Notes |
autotune | torch.backends.cudnn.benchmark | Not thread-local. |
cudnn_deterministic | torch.backends.cudnn.deterministic | Not thread-local. |
cudnn_fast_batch_normalization | N/A | Intentionally unsupported as the precision is low in some models. |
debug | N/A | Use torch.autograd.detect_anomaly() context-manager to check NaN during backward, display the corresponding forward stack trace when error occurred in backward. |
dtype | Mixed precision support is done via Apex. Not thread-local. | |
enable_backprop | torch.no_grad() torch.enable_grad() | You can use them as context-manager or decorator. See also Backprop modes. |
is_recomputing | N/A | See torch.utils.checkpoint.checkpoint for F.forget equivalent (it also supports RNG). |
keep_graph_on_report | N/A | |
lazy_grad_sum | N/A | |
train | N/A | The mode is configured per Module (using Module.train() and Module.eval()). See also Train/Test modes. |
type_check | N/A | |
use_cudnn | torch.backends.cudnn.enabled | Enabled by default. Not thread-local. |
use_cudnn_tensor_core | N/A | Tensor Cores cannot be disabled. |
use_ideep | N/A | PyTorch itself supports MKL-DNN. You can check availability using torch.backends.mkldnn.is_available(). |
use_static_graph | N/A | |
warn_nondeterministic | N/A |
See Reproducibility for the reproducibility (including steps to fix seeds).
There is no equivalent feature in PyTorch.
Replacements for Chainer built-in hooks:
You can register Module Hooks per module. There's no way to inject a hook for every Module called under the specific scope.
Replacements for Chainer built-in hooks:
There is no direct equivalent in PyTorch, but you can register backward hooks per Tensor / Module to modify gradients.
Replacements for Chainer built-in hooks:
To quickly try a PyTorch model in a training script using Chainer, cpm.TorchModule is the tool to use. Assuming you have a training script using Chainer, you have to try the following steps:
As of writing, there are two major ways to run distributed deep learning applications: torch.distributed and Horovod. We recommend torch.distributed as a first option because of the following reasons.
In this document, we describe both approaches to migrate ChainerMN programs to PyTorch.
Torch.distributed is the standard module for distributed deep learning of PyTorch.
Torch.distributed supports three backends: “nccl”, “mpi” and “gloo”. For users who are migrating from Chainer and ChainerMN and have been using NCCL with MPI, using “nccl” backend is the most straightforward way. In this section, we assume that you use NCCL and MPI to run your distributed deep learning programs. In particular we assume Open MPI as the MPI implementation used here because it is the recommended option in ChainerMN, but other MPI implementations are mentioned as well.
In ChainerMN, process invocation is totally coordinated by the MPI runtime. However, in PyTorch and torch.distributed, you may need a few more steps to invoke distributed deep learning processes. The simplest initialization method might be environment variable initialization.
The following environmental variables are necessary (whatever system you use to invoke your script, including MPI). Other variables, WORLD_SIZE and RANK, are set from inside the following snippet.
MASTER_ADDR : Address of the computing node where the rank 0 process runs.
MASTER_PORT : A free port of the MASTER_ADDR machine. The port will be used by the rank 0 process.
Note that process invocation is highly system-dependent issue. PyTorch supports other options such as TCP initialization and shared file-system initialization. Please refer to the official documents for more details.
The following code snippets shows how to initialize torch.distributed module.
# setup env for torch.distributed comm_world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"]) comm_rank = int(os.environ["OMPI_COMM_WORLD_RANK"]) comm_local_rank = int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK']) os.environ["WORLD_SIZE"] = str(comm_world_size) os.environ["RANK"] = str(comm_rank) torch.cuda.set_device(comm_local_rank) torch.distributed.init_process_group(backend='nccl', init_method='env://') |
Environmental variables set by MPI runtime are here, instead of communicator.intra_rank in ChainerMN because torch.distributed does not provide corresponding rank information. If you use MVAPICH2, use MV2_COMM_WORLD_SIZE, MV2_COMM_WORLD_RANK, MV2_COMM_WORLD_LOCAL_RANK respectively.
Each node can get a slice of a globally shared dataset using a DistributedSampler.
sampler = torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=comm_world_size, rank=comm_rank) loader_kwargs = {'num_workers': 1, 'pin_memory': True} # Assuming we use GPUs loader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size, sampler=sampler, **loader_kwargs) |
This will make every worker to only load a slice of the dataset, this sampler can be normally fed to the DataLoader.
Also, you need to call DistributedSampler.set_epoch() to adjust epoch numbers. Thus typical training loop looks like:
for epoch in range(1, args.epochs + 1): train_sampler.set_epoch(epoch) train(args, model, device, train_loader, optimizer, epoch) test(args, model, device, test_loader, len(test_dataset)) scheduler.step() |
We need to specify the device to which the data is transferred using comm_local_rank.
class MyNN(nn.module): ... device = torch.device("cuda:{}".format(comm_local_rank) if use_cuda else "cpu") model = MyNN().to(device) model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[comm_local_rank]) |
In contrast to Horovod, We can use the same optimizer as in non-distributed execution.
Parameter values are synchronized (i.e. initial broadcast and allreduce in every iteration) automatically by DistributedDataParallel class and thus no further modification is necessary.
To avoid potential data races other kinds of bugs, you may need to use torch.distributed.barrier() to synchronize processes before or after data loading, and finishing the application.
PyTorch can use Horovod to do Data Parallel training in a similar way to ChainerMN.
Data is distributed across the nodes and the optimizer is wrapped in with Horovod to automatically average the gradients of several MPI processes.
The following snippet shows how to import horovod and retrieve the current worker id and the total number of workers.
import horovod.torch as hvd hvd.init() print(‘My rank is {} of {} workers‘.format(hvd.rank(), hvd.size())) |
hvd.local_rank() is used to get the rank inside a single node, this is useful to assign GPUs, similar to ChainerMN’s intra_rank().
torch.cuda.set_device(hvd.local_rank()) |
Each node can get a slice of a globally shared dataset using a DistributedSampler.
torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=hvd.size(), rank=hvd.rank()) |
This will make every worker to only load a slice of the dataset, this sampler can be normally fed to the DataLoader
The optimizer is wrapped in a hvd.DistributedOptimizer object with the following configuration parameters.
compression : value in {hvd.Compression.fp16, hvd.Compression.none}
compression is used to reduce the size of the allreduce operations performed by the optimizer.
backward_passes_per_step : int default value usually 1
Number of batches that are performed locally before performing the gradients exchange.
optimizer = hvd.DistributedOptimizer( optimizer, named_parameters=model.named_parameters(), compression=compression, backward_passes_per_step=args.batches_per_allreduce) |
From the documentation:
DistributedOptimizer exposes the synchronize() method, which forces allreduce operations to finish before continuing the execution. It’s useful in conjunction with gradient clipping, or other operations that modify gradients in place before step()is executed. Make sure to use optimizer.skip_synchronize() if you’re calling synchronize() in your code.
Before starting the training loop, initial model parameters and the optimizer state must be broadcasted to all the workers:
hvd.broadcast_parameters(model.state_dict(), root_rank=0) hvd.broadcast_optimizer_state(optimizer, root_rank=0) |
When computing the loss and other metrics such as accuracy, the values of multiple workers can be explicitly exchanged to compute averages:
self.sum += hvd.allreduce(val.detach().cpu(), name=metric_name) |
Horovod has support to exchange data using other MPI collectives:
There are _async versions of the three functions that can be queried using poll() on the returned handler or synchronize() to wait till completion.
import torch import horovod.torch as hvd … … def main(): # Initialize horovod hvd.init() torch.cuda.set_device(hvd.local_rank()) # Read the dataset and create the iterators dataset = datasets.ImageFolder(…) train_sampler = torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=hvd.size(), rank=hvd.rank()) loader = torch.utils.data.DataLoader(dataset, sampler=train_sampler, …) … # Create the optimizer optimizer = optim.SGD(model.parameters(), …) optimizer = hvd.DistributedOptimizer( optimizer, named_parameters=model.named_parameters(), compression= hvd.Compression.none, backward_passes_per_step=args.batches_per_allreduce) # Broadcast initial state broadcast parameters & optimizer state. hvd.broadcast_parameters(model.state_dict(), root_rank=0) hvd.broadcast_optimizer_state(optimizer, root_rank=0) # Start training for epoch in range(epochs): train_sampler.set_epoch(epoch) … |
Communication traces showing Horovod communications can be obtained by setting the HOROVOD_TIMELINE environment variable.
mpirun -bind-to-none -np 8 -x HOROVOD_TIMELINE=timeline.json ... |
The resultant trace can be visualized in Chrome by using the browser built-in chrome://tracing feature.
Horovod has several knobs to improve its performance
Horovod launches all-reduce in parallel with backward computation, and apex unscales gradient after backward computation.
To avoid race conditions, we have to wait for all-reduce completion before unscaling:
from apex import amp ... with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward() optimizer.synchronize() # Wait for all-reduce completion with optimizer.skip_synchronize(): optimizer.step() |
Also, backward_passes_per_step should be 1 when using Horovod and apex. The current implementation of Horovoda and apex do not work as expected when backward_passes_per_step is not 1.
Horovod does not yet officially support MNBN (https://github.com/horovod/horovod/issues/1384), but there exists an unofficial implementation: https://github.com/atranitell/Synchronized-BatchNorm-PyTorch-Horovod/blob/master/sync_bn.py. Apex also has an implementation: https://nvidia.github.io/apex/parallel.html#apex.parallel.SyncBatchNorm
Horovod supports simultaneous usage with mpi4py (https://github.com/horovod/horovod#mpi4py). You can directly work with mpi4py to e.g. rewrite ChainerMN's comm.gather_obj:
import horovod.torch as hvd # Initialize Horovod hvd.init() # Verify that MPI multi-threading is supported. assert hvd.mpi_threads_supported() from mpi4py import MPI assert hvd.size() == mpi_comm.Get_size() mpi_comm.gather(obj, root=0) # This is equal to ChainerMN’s comm.gather_obj |
Horovod is introduced here because it greatly resembles ChainerMN and can be used in our computing infrastructure right away. Alternatives are:
To train chainer models in distributed environments using Horovod, the chainer link should be wrapped using cpm.LinkAsTorchModel. The use of a PyTorch optimizer is required.
model = ChainerModel() model.to_device(ch_device) # Initialize parameters before converting to `ChainerParameter`s. model(ch_device.xp.zeros((1, 784)).astype('f')) # Convert parameters to `ChainerParameter`s to share memory with PyTorch. torched_model = cpm.LinkAsTorchModel(model) optimizer = optim.SGD(torched_model.parameters(), lr=args.lr) optimizer = hvd.DistributedOptimizer( optimizer, named_parameters=torched_model.named_parameters()) hvd.broadcast_parameters(torched_model.state_dict(), root_rank=0) hvd.broadcast_optimizer_state(optimizer, root_rank=0) |
Using the cpm tool it is also possible to train a PyTorch model using ChainerMN.
The current support is limited only to data parallel training.
from chainer_pytorch_migration import chainermn … comm = chainermn.create_communicator('pure_nccl') # Set up standard ResNet-50 model. model = models.resnet50() model.cuda() w_model = links.TorchModule(model) w_model.to_gpu(device) optimizer = optim.SGD(model.parameters(), lr=lr) optimizer = chainermn.create_multi_node_optimizer(optimizer, comm) optimizer.setup(w_model) |
Explains differences of how variables can be unchained from the computational graph.
Explains differences of how backprop modes are switched.
Explains differences of how train/test modes are switched.
This section introduces some of the larger repositories under the PyTorch GitHub organization. It also refers to the official list of other ecosystem-libraries acknowledged by PyTorch.
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
GitHub: https://github.com/pytorch/pytorch
Summary: High level utilities such as training loop abstraction.
GitHub: https://github.com/pytorch/ignite
Summary: PyTorch for CV.
GitHub: https://github.com/pytorch/vision
Recommended by the official installation guide to install along with pytorch.
Provides domain-agnostic (not limited to CV) data augmentation functionality.
Provides loaders for video data. Slow due to ffmpeg but this might be improved in the future?
Summary: PyTorch for NLP.
GitHub: https://github.com/pytorch/text
Summary: PyTorch for audio data.
GitHub: https://github.com/pytorch/audio
Summary: Seq2seq models.
GitHub: https://github.com/pytorch/fairseq
Seq2seq models such as translation. Includes the Transformer and BERT-like models.
There is an official list of libraries included in the PyTorch ecosystem (besides the domain specific libraries above), including e.g. Ignite.
[a]We can do it with `expand` https://pytorch.org/docs/stable/tensors.html#torch.Tensor.expand