Indexing and selecting data

In this section, we consider the following variable:

using coordinate_type = xf::xcoordinate<xf::fstring>;
using dimension_type = xf::xdimension<xf::fstring>;
using variable_type = xf::xvariable<double, coordinate_type>;

data_type d = xt::eval(xt::random::rand({6, 3}, 15., 25.));
variable_type v(std::move(d),
                {
                    {"group", xf::axis({"a", "b", "d", "e", "g", "h"})},
                    {"city",  xf::axis({"London", "Paris", "Brussels"})}
                });

Printing this variable in a Jupyter Notebook gives:

London

Paris

Brussels

a

16.3548

23.3501

24.6887

b

17.2103

18.0817

20.4722

d

16.8838

24.9288

24.9646

e

24.6769

22.2584

24.8111

g

16.0986

22.9811

17.9703

h

15.0478

16.1246

21.3976

xframe provides flexible indexing methods for data selection, similar to the ones of xarray. These methods are summarized in the following table:

Dimension lookup

Index lookup

xvariable syntax

Positional

By integer

v(2, 1)

Positional

By label

v.locate("d", "Paris")

By name

By integer

v.iselect({{"group", 2}, {"city", 1}})

By name

By label

v.select({{"group", "d"}, {"city", "Paris"}})

Positional indexing

The most basic way to access elements of an xvariable is to use operator(), like you would do with an xtensor:

std::cout << v(2, 1) << std::endl;

Contrary to Python, it is not possible to have different return types for a same method in C++. Multi selection is done with free functions that return views on the variable:,

#include "xvariable_view.hpp"

auto view1 = xf::ilocate(v, xf::irange(0, 5, 2), xf::irange(1, 3));
std::cout << view1 << std::endl;

Paris

Brussels

a

23.3501

24.6887

d

24.9288

24.9646

g

22.9811

17.9703

Therefore a change in the view will reflect in the underlying variable:

view1(0, 1) = 0.;
std::cout << v(2, 2) << std::endl;
// Outputs 0.

In the code creating the view, irange returns a range slice from xtensor, so any multi selection in xtensor is also supported in xframe.

xvariable also supports label-based indexing, with the locate method for single point selection, and locate free function for multi selection:

std::cout << v.locate("d", "Paris") << std::endl;
auto view2 = xf::locate(v, xf::range("a", "h", 2), xf::range("Paris", "Brussels"));
std::cout << view2 << std::endl;
// Same output as previous code

Be aware of the difference between range and irange parameters: for the former one, accepting labels, the last value is included while for the latter one, accepting integral indices, the las value is excluded.

xframe provides label-based slices similar to those of xtensor, so label-based multi selection is really similar to positional multi selection.

Indxing with dimension names

With the dimension names, we do not have to rely on the dimension order. We can use them explicitely to select data; Like positional indexing, xframe provides methods and free functions depending on the kind of selection you want to do:

// Dimension by name, index by position
std::cout << v.iselect({{"city", 1}, {"group", 2}}) << std::endl;
auto view3 = xf::iselect(v, {{"city", xf::irange(1, 3)}, {"group", xf::irange(0, 5, 2)}});

// Dimension by name, index by label
std::cout << v.select({{"city", "Paris"}, {"group", "d"}}) << std::endl;
auto view4 = xf::select(v, {{"city", xf::range("Paris", "Brussels")}, {"group", xf::range("a", "h", 2)}});
// view3 and view4 gives the same output as view2 and view1

Contrary to xarray, xframe does not provide a selection operator accepting a map argument.

Keeping and dropping labels

drop and keep functions return slices that can be used to create a view with the listed labels along the specified dimensions dropped or kept:

auto view5 = xf::select(v, {{"city", xf::drop("London")}, {"group", xf::keep("a", "d", "g")}});
// view5 is equivalent to view4

This is different form xarray where the xarray.DataArray.drop method returns a new object. To achieve the same with xframe, simply assign the view to a new xvariable object:

variable_type v2 = view5;

Masking views

Masking views allow to select data points based on conditions expresses on labels. These conditions can be arbitrary complicated boolean expressions. Contrary to other views which are generally a subset of the original data, a masking view has the same shape as its underlying xvariable.

Masking views are created with the where function:

data_type d2 = {{ 1.,  2., 3. },
                { 4.,  5., 6. },
                { 7.,  8., 9. }};

auto v3 = variable_type(
    d2,
    {
        {"x", xf::axis(3)},
        {"y", xf::axis(3)},
    }
);

auto view6 = xf::where(
    v3,
    not_equal(v3.axis<int>("x"), 2) && v3.axis<int>("y") < 2
);
std::cout << view6 << std::endl;

In a Jupyter Notebookn, this outputs the following:

0

1

2

0

1

2

masked

1

4

5

masked

2

masked

masked

masked

When assigning to a masked view, masked values are not changed. Like other views, a masking view is a proxy on its underlying variable, no copy is made, so changing an unmasked value actually changes the corresponding value in the undnerlying variable.

Assigning values with indexing

Data selection in variables return either references or views; therefore, contrary to xarray, it is possible to assign values to a subset of a variable with any of the indexing method:

// The next four lines are equivalent, they change a single value of v:
v(2, 1) = 2.5;
v.locate("d", "Paris") = 2.5;
v.iselect({{"city", 1}, {"group", 2}}) = 2.5;
v.select({{"city", "Paris"}, {"group", "d"}}) = 2.5;

data_type d3 = {{0.,  1.},
                {2.,  3.},
                {4.,  5.}};

auto v4 = variable_type(
    d3,
    {
        {"group", xf::axis({"a", "d", "g"})},
        {"city", xf::axis({"Paris", "Brussels"})}
    }
);

// The next four lines are equivalent, they change a subset of v
xf::ilocate(v, xf::irange(0, 5, 2), xf::irange(1, 3)) = v4;
xf::locate(v, xf::range("a", "h", 2), xf::range("Paris", "Brussels")) = v4;
xf::iselect(v, {{"city", xf::irange(1, 3)}, {"group", xf::irange(0, 5, 2)}}) = v4;
xf::select(v, {{"city", xr::range("Paris", "Brussels")}, {"group", xf::range("a", "h", 2)}}) = v4;

Printing v after the assign gives

London

Paris

Brussels

a

16.3548

0

1

b

17.2103

18.0817

20.4722

d

16.8838

2

3

e

24.6769

22.2584

24.8111

g

16.0986

4

5

h

15.0478

16.1246

21.3976

Reindexing views

Reindexing views give variables new set of coordinates to corresponding dimensions. Like other views, no copy is involved. Asking for values corresponding to new labels not found in the original set of coordinates returns missing values. In the next example, we reindex the city dimension:

auto view7 = xf::reindex(v, {{"city", xf::axis({"London", "New York", "Brussels"})}});

London

New York

Brussels

a

16.3548

N/A

24.6887

b

17.2103

N/A

20.4722

d

16.8838

N/A

24.9646

e

24.6769

N/A

24.8111

g

16.0986

N/A

17.9703

h

15.0478

N/A

21.3976

Like xarray, xframe provides the useful reindex_like shortcut which allows to reindex a variable given the set of coordinates of another variable:

auto v5 = variable_type(
    d,
    {
        {"group", xf::axis({"a", "b", "d", "e", "g", "h"})},
        {"city", xf::axis({"London", "New York", "Brussels"})}
    }
);

auto view8 = xf::reindex_like(v, v5);
// view8 is equivalent to view7

A reindexing view is a read-only view, it is not possible to change its value with indexing. This allows memory optimizations, the view does not have to store the missing values, it can return a proxy to a static-allocated missing value.

The align function allows to reindex many variables with more flexible options:

auto t1 = xf::align<join::inner>(v, v5);
std::cout << std::get<0>(t1) << std::endl;
std::cout << std::get<1>(t1) << std::endl;

The last lines print the same output:

London

Brussels

a

16.3548

24.6887

b

17.2103

20.4722

d

16.8838

24.9646

e

24.6769

24.8111

g

16.0986

17.9703

h

15.0478

21.3976

In the following, the variables are aligned w.r.t the union of the coordinates instead of their intersection:

auto t2 = xf::align<join::outer>(v, v5);
std::cout << std::get<0>(t2) << std::endl;
std::cout << std::get<1>(t2) << std::endl;

The first outuput is

London

Paris

Brussels

New York

a

16.3548

23.3501

24.6887

N/A

b

17.2103

18.0817

20.4722

N/A

d

16.8838

24.9288

24.9646

N/A

e

24.6769

22.2584

24.8111

N/A

g

16.0986

22.9811

17.9703

N/A

h

15.0478

16.1246

21.3976

N/A

While the second have N/A in the Paris column.