dislib.array

class dislib.data.array.Array(blocks, top_left_shape, reg_shape, shape, sparse)[source]

Bases: object

A distributed 2-dimensional array divided in blocks.

Normally, this class should not be instantiated directly, but created using one of the array creation routines provided.

Apart from the different methods provided, this class also supports the following types of indexing:

  • A[i] : returns a single row
  • A[i, j] : returns a single element
  • A[i:j] : returns a set of rows (with i and j optional)
  • A[:, i:j] : returns a set of columns (with i and j optional)
  • A[[i,j,k]] : returns a set of non-consecutive rows
  • A[:, [i,j,k]] : returns a set of non-consecutive columns
  • A[i:j, k:m] : returns a set of elements (with i, j, k, and m optional)
Parameters:
  • blocks (list) – List of lists of nd-array or spmatrix.
  • top_left_shape (tuple) – A single tuple indicating the shape of the top-left block.
  • reg_shape (tuple) – A single tuple indicating the shape of the regular block.
  • shape (tuple (int, int)) – Total number of elements in the array.
  • sparse (boolean, optional (default=False)) – Whether this array stores sparse data.
Variables:
  • shape (tuple (int, int)) – Total number of elements in the array.
  • _blocks (list) – List of lists of nd-array or spmatrix.
  • _top_left_shape (tuple) – A single tuple indicating the shape of the top-left block. This can be different from _reg_shape when slicing arrays.
  • _reg_shape (tuple) – A single tuple indicating the shape of regular blocks. Top-left and and bot-right blocks might have different shapes (and thus, also the whole first/last blocks of rows/cols).
  • _n_blocks (tuple (int, int)) – Total number of (horizontal, vertical) blocks.
  • _sparse (boolean) – True if this array contains sparse data.
collect()[source]

Collects the contents of this ds-array and returns the equivalent in-memory array that this ds-array represents. This method creates a synchronization point in the execution of the application.

Warning: This method may fail if the ds-array does not fit in memory.

Returns:array – The actual contents of the ds-array.
Return type:nd-array or spmatrix
max(axis=0)[source]

Returns the maximum along the given axis.

Parameters:axis (int, optional (default=0))
Returns:max – Maximum along axis.
Return type:ds-array
mean(axis=0)[source]

Returns the mean along the given axis.

Parameters:axis (int, optional (default=0))
Returns:mean – Mean along axis.
Return type:ds-array
min(axis=0)[source]

Returns the minimum along the given axis.

Parameters:axis (int, optional (default=0))
Returns:min – Minimum along axis.
Return type:ds-array
shape

Total shape of the ds-array

sum(axis=0)[source]

Returns the sum along the given axis.

Parameters:axis (int, optional (default=0))
Returns:sum – Sum along axis.
Return type:ds-array
transpose(mode='rows')[source]

Returns the transpose of the ds-array following the method indicated by mode. ‘All’ uses a single task to transpose all the blocks (slow with high number of blocks). ‘rows’ and ‘columns’ transpose each block of rows or columns independently (i.e. a task per row/col block).

Parameters:mode (string, optional (default=rows)) – Array of samples.
Returns:dsarray – A transposed ds-array.
Return type:ds-array

Array creation routines

dislib.array(x, block_size)[source]

Loads data into a Distributed Array.

Parameters:
  • x (spmatrix or array-like, shape=(n_samples, n_features)) – Array of samples.
  • block_size ((int, int)) – Block sizes in number of samples.
Returns:

dsarray – A distributed representation of the data divided in blocks.

Return type:

ds-array

dislib.load_txt_file(path, block_size, delimiter=', ')[source]

Loads a text file into a distributed array.

Parameters:
  • path (string) – File path.
  • block_size (tuple (int, int)) – Size of the blocks of the array.
  • delimiter (string, optional (default=”,”)) – String that separates columns in the file.
Returns:

x – A distributed representation of the data divided in blocks.

Return type:

ds-array

dislib.load_svmlight_file(path, block_size, n_features, store_sparse)[source]

Loads a SVMLight file into a distributed array.

Parameters:
  • path (string) – File path.
  • block_size (tuple (int, int)) – Size of the blocks for the output ds-array.
  • n_features (int) – Number of features.
  • store_sparse (boolean) – Whether to use scipy.sparse data structures to store data. If False, numpy.array is used instead.
Returns:

x, y – A distributed representation (ds-array) of the X and y.

Return type:

(ds-array, ds-array)

dislib.random_array(shape, block_size, random_state=None)[source]

Returns a distributed array of random floats in the open interval [0.0, 1.0). Values are from the “continuous uniform” distribution over the stated interval.

Parameters:
  • shape (tuple of two ints) – Shape of the output ds-array.
  • block_size (tuple of two ints) – Size of the ds-array blocks.
  • random_state (int or RandomState, optional (default=None)) – Seed or numpy.random.RandomState instance to generate the random numbers.
Returns:

dsarray – Distributed array of random floats.

Return type:

ds-array

Other functions

dislib.data.array.apply_along_axis(func, axis, x, *args, **kwargs)[source]

Apply a function to slices along the given axis.

Execute func(a, *args, **kwargs) where func operates on nd-arrays and a is a slice of arr along axis. The size of the slices is determined by the blocks shape of x.

func must meet the following conditions:

  • Take an nd-array as argument
  • Accept axis as a keyword argument
  • Return an array-like structure
Parameters:
  • func (function) – This function should accept nd-arrays and an axis argument. It is applied to slices of arr along the specified axis.
  • axis (integer) – Axis along which arr is sliced. Can be 0 or 1.
  • x (ds-array) – Input distributed array.
  • args (any) – Additional arguments to func.
  • kwargs (any) – Additional named arguments to func.
Returns:

out – The output array. The shape of out is identical to the shape of arr, except along the axis dimension. The output ds-array is dense regardless of the type of the input array.

Return type:

ds-array

Examples

>>> import dislib as ds
>>> import numpy as np
>>> x = ds.random_array((100, 100), block_size=(25, 25))
>>> mean = ds.apply_along_axis(np.mean, 0, x)
>>> print(mean.collect())