{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "α-ICON: an application to UK PSC data", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "Cifd3_bJhpUo" }, "source": [ "## Introduction\n", "This notebook demonstrates α-ICON (Indirect Control in Onion-like networks) --- an algorithm to identify ultimate controlling entities in corporate networks. We provide a self-contained application as a companion to [our paper](https://arxiv.org/abs/2109.07181) and [repository](https://github.com/eusporg/alphaicon).\n", "\n", "We will be working with the data from the UK's People with Significant Control register with 4.2 million companies and 4 million of their holders as of August, 2021.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "W2vhy5c6jmBi" }, "source": [ "## Data loading & import\n", "All data pre-processing of the [PSC snapshot](http://download.companieshouse.gov.uk/en_pscdata.html) is done in the [repository](https://github.com/eusporg/alphaicon) (`code/data_preparation/uk`). The resulting data is stored in a [public folder](https://drive.google.com/drive/folders/10Tq-b4BVsG3gmq2JVa026Nilzj8eojNB) on Google Drive.\n", "\n", "We will be working with two files:\n", "\n", "* `output/uk/uk_organisations_participants_2021_long_2aug21.zip` --- an archived \n", "CSV with company ID-participant ID mapping from the PSC data and the respective equity shares.\n", "* `output/uk/npi_dpi/10000iter/uk_organisations_participants_2021_long_7sep21_dpi_10000iter.zip` --- an archived CSV with company ID-participant ID mapping from the PSC data and their Direct Power Indices ([Mizuno, Doi, and Kurizaki (2020)](https://doi.org/10.1371/journal.pone.0237862)). \n", "\n", "\n", " \n", "\n" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "FhocHBOyvbsu", "outputId": "c6f1b1a0-ec59-49a0-abf0-1a3e08a10aab" }, "source": [ "# Download and unarchive the data files from the Google Drive public link\n", "!pip install gdown\n", "\n", "!gdown https://drive.google.com/uc?id=1rpi5FEPrKfx9vIwDpr_mfK6L971rrtL7 \n", "!unzip uk_organisations_participants_2021_long_2aug21.zip\n", " \n", "!gdown https://drive.google.com/uc?id=1UBsF3RBMvjF7dBb1PG-wXhEBv3whoMLG\n", "!unzip uk_organisations_participants_2021_long_7sep21_dpi_10000iter.zip\n", " \n", "import pandas as pd\n", "import scipy\n", "from os.path import join\n", "import matplotlib.pyplot as plt\n", "import scipy.sparse as sp\n", "from scipy.sparse.linalg import eigs\n", "import numpy as np\n", "from itertools import combinations\n", "import tqdm\n", "import networkx as nx\n", "import gc" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Requirement already satisfied: gdown in /usr/local/lib/python3.7/dist-packages (3.6.4)\n", "Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from gdown) (1.15.0)\n", "Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from gdown) (4.62.0)\n", "Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from gdown) (2.23.0)\n", "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->gdown) (2.10)\n", "Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->gdown) (3.0.4)\n", "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->gdown) (1.24.3)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->gdown) (2021.5.30)\n", "Downloading...\n", "From: https://drive.google.com/uc?id=1rpi5FEPrKfx9vIwDpr_mfK6L971rrtL7\n", "To: /content/uk_organisations_participants_2021_long_2aug21.zip\n", "74.3MB [00:00, 179MB/s]\n", "Archive: uk_organisations_participants_2021_long_2aug21.zip\n", " inflating: uk_organisations_participants_2021_long_2aug21.csv \n", "Downloading...\n", "From: https://drive.google.com/uc?id=1UBsF3RBMvjF7dBb1PG-wXhEBv3whoMLG\n", "To: /content/uk_organisations_participants_2021_long_7sep21_dpi_10000iter.zip\n", "74.7MB [00:00, 121MB/s] \n", "Archive: uk_organisations_participants_2021_long_7sep21_dpi_10000iter.zip\n", " inflating: uk_organisations_participants_2021_long_7sep21_dpi_10000iter.csv \n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "7GgWACrsmkM3" }, "source": [ "### Import without downloading the data" ] }, { "cell_type": "code", "metadata": { "id": "6dTPRzNvMLzP" }, "source": [ "import pandas as pd\n", "import scipy\n", "from os.path import join\n", "import matplotlib.pyplot as plt\n", "import scipy.sparse as sp\n", "from scipy.sparse.linalg import eigs\n", "import numpy as np\n", "from itertools import combinations\n", "import tqdm\n", "import networkx as nx\n", "import gc" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "879Esu8aV7VY" }, "source": [ "# Data processing" ] }, { "cell_type": "code", "metadata": { "id": "ZiKLqZKfvOSX" }, "source": [ "DPI = 1 # use dpi data\n", "SH_MODE = 1 # only SH nodes are considered final holders\n", "LEVELS = 20 # this parameter should be equal or exceed the number of \"onion layers\" in data. 20 is a bit of overkill for safety" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "pzZlT_OlUNcD" }, "source": [ "if DPI:\n", " name = 'uk_organisations_participants_2021_long_7sep21_dpi_10000iter.csv'\n", " data = pd.read_csv(name, engine='python', dtype={'participant': str, 'entity': str})\n", "else:\n", " name = 'uk_organisations_participants_2021_long_2aug21.csv'\n", " data = pd.read_csv(name, engine='python', dtype={'participant_id': str, 'company_number': str})" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "hGJzvLfvmtGZ" }, "source": [ "### Primary processing" ] }, { "cell_type": "code", "metadata": { "id": "n_H7C2N_QZ-4", "colab": { "base_uri": "https://localhost:8080/", "height": 203 }, "outputId": "4fb79d1f-2ea3-4673-ba4f-758c7ca98dca" }, "source": [ "data = data.dropna()\n", "\n", "if DPI:\n", " data = data.astype({'dpi': float, 'participant':str, 'entity':str})\n", " data = data.rename(columns={\"entity\": \"organisation_inn\",\n", " 'participant': 'participant_id',\n", " 'dpi': 'equity_share'})\n", " \n", " # columns are renamed for consistency with further code\n", " data=data[['organisation_inn','participant_id','equity_share']]\n", "\n", "else:\n", " data = data.astype({'equity_share': float, 'participant_id':str, 'company_number':str})\n", " data = data.rename(columns={\"company_number\": \"organisation_inn\"})\n", "\n", "data = data[data['equity_share'] > 0]\n", "data = data[data.participant_id != data.organisation_inn]\n", "\n", "# normalization of in-edge weights to 1\n", "gdata = data.groupby('organisation_inn').sum().reset_index()\n", "dict_companies = dict(gdata.values)\n", "data['equity_share'] = data['equity_share']/np.array([dict_companies[num] for num in data['organisation_inn']])\n", "\n", "# finding SH and ST nodes\n", "data['super_holder']=~pd.Series(data.participant_id).isin(data.organisation_inn)\n", "data['super_target']=~pd.Series(data.organisation_inn).isin(data.participant_id)\n", "\n", "\n", "data.head()" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", " | organisation_inn | \n", "participant_id | \n", "equity_share | \n", "super_holder | \n", "super_target | \n", "
---|---|---|---|---|---|
0 | \n", "00000133 | \n", "THE PENINSULAR AND ORIENTAL STEAM NAVIGATION C... | \n", "1.0000 | \n", "True | \n", "True | \n", "
1 | \n", "00000140 | \n", "NICHOLLS & CLARKE LIMITED$NA$NA | \n", "1.0000 | \n", "True | \n", "True | \n", "
2 | \n", "00000295 | \n", "COLIN$NA$WELLS$1967$2 | \n", "0.4997 | \n", "True | \n", "True | \n", "
3 | \n", "00000295 | \n", "MOIRA$RUTH$SLEIGHT$1959$2 | \n", "0.5003 | \n", "True | \n", "True | \n", "
4 | \n", "00000371 | \n", "DAVID$JOHN$ROWLAND$1945$6 | \n", "1.0000 | \n", "True | \n", "True | \n", "
\n", " | organisation_inn | \n", "participant_id | \n", "equity_share | \n", "super_holder | \n", "super_target | \n", "SH_level | \n", "ST_level | \n", "
---|---|---|---|---|---|---|---|
7 | \n", "00000529 | \n", "05995030 | \n", "1.0000 | \n", "False | \n", "False | \n", "3.0 | \n", "NaN | \n", "
10 | \n", "00000866 | \n", "05253545 | \n", "1.0000 | \n", "False | \n", "False | \n", "NaN | \n", "2.0 | \n", "
11 | \n", "00000950 | \n", "03526047 | \n", "1.0000 | \n", "False | \n", "False | \n", "2.0 | \n", "2.0 | \n", "
13 | \n", "00001160 | \n", "06452679 | \n", "1.0000 | \n", "False | \n", "False | \n", "4.0 | \n", "2.0 | \n", "
19 | \n", "00001419 | \n", "05282342 | \n", "1.0000 | \n", "False | \n", "False | \n", "3.0 | \n", "NaN | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
5256244 | \n", "SO307293 | \n", "SC299736 | \n", "0.4969 | \n", "False | \n", "False | \n", "4.0 | \n", "2.0 | \n", "
5256255 | \n", "SO307300 | \n", "03942880 | \n", "0.5069 | \n", "False | \n", "False | \n", "2.0 | \n", "2.0 | \n", "
5256256 | \n", "SO307300 | \n", "11677818 | \n", "0.4931 | \n", "False | \n", "False | \n", "2.0 | \n", "2.0 | \n", "
5256265 | \n", "SO307305 | \n", "09431213 | \n", "1.0000 | \n", "False | \n", "False | \n", "3.0 | \n", "2.0 | \n", "
5256338 | \n", "ZC000195 | \n", "01612178 | \n", "1.0000 | \n", "False | \n", "False | \n", "NaN | \n", "3.0 | \n", "
45295 rows × 7 columns
\n", "