# Raku Data::Reshapers [![Build Status](https://app.travis-ci.com/antononcube/Raku-Data-Reshapers.svg?branch=main)](https://app.travis-ci.com/github/antononcube/Raku-Data-Reshapers) [![License: Artistic-2.0](https://img.shields.io/badge/License-Artistic%202.0-0298c3.svg)](https://opensource.org/licenses/Artistic-2.0) This Raku package has data reshaping functions for different data structures that are coercible to full arrays. The supported data structures are: - Positional-of-hashes - Positional-of-arrays The five data reshaping provided by the package over those data structures are: - Cross tabulation, `cross-tabulate` - Long format conversion, `to-long-format` - Wide format conversion, `to-wide-format` - Join across (aka `SQL JOIN`), `join-across` - Transpose, `transpose` The first four operations are fundamental in data wrangling and data analysis; see [AA1, Wk1, Wk2, AAv1-AAv2]. (Transposing of tabular data is, of course, also fundamental, but it also can be seen as a basic functional programming operation.) ------ ## Usage examples ### Cross tabulation Making contingency tables -- or cross tabulation -- is a fundamental statistics and data analysis operation, [Wk1, AA1]. Here is an example using the [Titanic](https://en.wikipedia.org/wiki/Titanic) dataset (that is provided by this package through the function `get-titanic-dataset`): ```perl6 use Data::Reshapers; my @tbl = get-titanic-dataset(); my $res = cross-tabulate( @tbl, 'passengerSex', 'passengerClass'); say $res; # {female => {1st => 144, 2nd => 106, 3rd => 216}, male => {1st => 179, 2nd => 171, 3rd => 493}} say to-pretty-table($res); # +--------+-----+-----+-----+ # | | 1st | 2nd | 3rd | # +--------+-----+-----+-----+ # | female | 144 | 106 | 216 | # | male | 179 | 171 | 493 | # +--------+-----+-----+-----+ ``` ### Long format Conversion to long format allows column names to be treated as data. (More precisely, when converting to long format specified column names of a tabular dataset become values in a dedicated column, e.g. "Variable" in the long format.) ```perl6 my @tbl1 = @tbl.roll(3); .say for @tbl1; .say for to-long-format( @tbl1 ); my @lfRes1 = to-long-format( @tbl1, 'id', [], variablesTo => "VAR", valuesTo => "VAL2" ); .say for @lfRes1; ``` ### Wide format Here we transform the long format result `@lfRes1` above into wide format -- the result has the same records as the `@tbl1`: ```perl6 ‌‌say to-pretty-table( to-wide-format( @lfRes1, 'id', 'VAR', 'VAL2' ) ); # +-------------------+----------------+--------------+--------------+-----+ # | passengerSurvival | passengerClass | passengerAge | passengerSex | id | # +-------------------+----------------+--------------+--------------+-----+ # | died | 1st | 20 | male | 308 | # | died | 2nd | 40 | female | 412 | # | survived | 2nd | 50 | female | 441 | # | died | 3rd | 20 | male | 741 | # | died | 3rd | -1 | male | 932 | # +-------------------+----------------+--------------+--------------+-----+ ``` ### Transpose Using cross tabulation result above: ```perl6 my $tres = transpose( $res ); say to-pretty-table($res, title => "Original"); # +--------------------------+ # | Original | # +--------+------+----------+ # | | died | survived | # +--------+------+----------+ # | female | 127 | 339 | # | male | 682 | 161 | # +--------+------+----------+ say to-pretty-table($tres, title => "Transposed"); # +--------------------------+ # | Transposed | # +----------+--------+------+ # | | female | male | # +----------+--------+------+ # | died | 127 | 682 | # | survived | 339 | 161 | # +----------+--------+------+ ``` ------ ## TODO 1. [X] Simpler more convenient interface. - ~~Currently, a user have to specify four different namespaces in order to be able to use all package functions.~~ 2. [ ] More extensive long format tests. 3. [ ] More extensive wide format tests. 4. [ ] Implement verifications for - [X] Positional-of-hashes - [X] Positional-of-arrays - [X] Positional-of-key-to-array-pairs - [ ] Positional-of-hashes, each record of which has: - [ ] Same keys - [ ] Same type of values of corresponding keys - [ ] Positional-of-arrays, each record of which has: - [ ] Same length - [ ] Same type of values of corresponding elements 5. [X] Implement "nice tabular visualization" using [Pretty::Table](https://gitlab.com/uzluisf/raku-pretty-table) and/or [Text::Table::Simple](https://github.com/ugexe/Perl6-Text--Table--Simple). 6. [X] Document examples using pretty tables. 7. [X] Implement transposing operation for: - [X] hash of hashes - [X] hash of arrays - [X] array of hashes - [X] array of arrays - [X] array of key-to-array pairs 8. [X] Implement to-pretty-table for: - [X] hash of hashes - [X] hash of arrays - [X] array of hashes - [X] array of arrays - [X] array of key-to-array pairs 9. [ ] Implemented join-across: - [X] inner, left, right, outer - [X] single key-to-key pair - [ ] multiple key-to-key pairs - [ ] optional fill-in of missing values - [ ] handling collisions 10. [ ] Implement to long format conversion for: - [ ] hash of hashes - [ ] hash of arrays 11. [ ] Speed/performance profiling. - [ ] Come up with profiling tests - [ ] Comparison with R - [ ] Comparison with Python ------ ## References ### Articles [AA1] Anton Antonov, ["Contingency tables creation examples"](https://mathematicaforprediction.wordpress.com/2016/10/04/contingency-tables-creation-examples/), (2016), [MathematicaForPrediction at WordPress](https://mathematicaforprediction.wordpress.com). [Wk1] Wikipedia entry, [Contingency table](https://en.wikipedia.org/wiki/Contingency_table). [Wk2] Wikipedia entry, [Wide and narrow data](https://en.wikipedia.org/wiki/Wide_and_narrow_data). ### Functions, repositories [AAf1] Anton Antonov, [CrossTabulate](https://resources.wolframcloud.com/FunctionRepository/resources/CrossTabulate), (2019), [Wolfram Function Repository](https://resources.wolframcloud.com/FunctionRepository). [AAf2] Anton Antonov, [LongFormDataset](https://resources.wolframcloud.com/FunctionRepository/resources/LongFormDataset), (2020), [Wolfram Function Repository](https://resources.wolframcloud.com/FunctionRepository). [AAf3] Anton Antonov, [WideFormDataset](https://resources.wolframcloud.com/FunctionRepository/resources/WideFormDataset), (2021), [Wolfram Function Repository](https://resources.wolframcloud.com/FunctionRepository). [AAf4] Anton Antonov, [RecordsSummary](https://resources.wolframcloud.com/FunctionRepository/resources/RecordsSummary), (2019), [Wolfram Function Repository](https://resources.wolframcloud.com/FunctionRepository). ### Videos [AAv1] Anton Antonov, ["Multi-language Data-Wrangling Conversational Agent"](https://www.youtube.com/watch?v=pQk5jwoMSxs), (2020), [YouTube channel of Wolfram Research, Inc.](https://www.youtube.com/channel/UCJekgf6k62CQHdENWf2NgAQ). (Wolfram Technology Conference 2020 presentation.) [AAv2] Anton Antonov, ["Data Transformation Workflows with Anton Antonov, Session #1"](https://www.youtube.com/watch?v=iXrXMQdXOsM), (2020), [YouTube channel of Wolfram Research, Inc.](https://www.youtube.com/channel/UCJekgf6k62CQHdENWf2NgAQ). [AAv3] Anton Antonov, ["Data Transformation Workflows with Anton Antonov, Session #2"](https://www.youtube.com/watch?v=DWGgFsaEOsU), (2020), [YouTube channel of Wolfram Research, Inc.](https://www.youtube.com/channel/UCJekgf6k62CQHdENWf2NgAQ).