alphaicon/code/alphaicon_paper/1_compute_alphaicon.ipynb
Dmitriy Skougarevskiy 3b419bab6e Initial commit
2021-09-16 10:06:49 +03:00

1349 lines
55 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "α-ICON: an application to UK PSC data",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Cifd3_bJhpUo"
},
"source": [
"## Introduction\n",
"This notebook demonstrates α-ICON (Indirect Control in Onion-like networks) --- an algorithm to identify ultimate controlling entities in corporate networks. We provide a self-contained application as a companion to [our paper](https://arxiv.org/abs/2109.07181) and [repository](https://github.com/eusporg/alphaicon).\n",
"\n",
"We will be working with the data from the UK's People with Significant Control register with 4.2 million companies and 4 million of their holders as of August, 2021.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "W2vhy5c6jmBi"
},
"source": [
"## Data loading & import\n",
"All data pre-processing of the [PSC snapshot](http://download.companieshouse.gov.uk/en_pscdata.html) is done in the [repository](https://github.com/eusporg/alphaicon) (`code/data_preparation/uk`). The resulting data is stored in a [public folder](https://drive.google.com/drive/folders/10Tq-b4BVsG3gmq2JVa026Nilzj8eojNB) on Google Drive.\n",
"\n",
"We will be working with two files:\n",
"\n",
"* `output/uk/uk_organisations_participants_2021_long_2aug21.zip` --- an archived \n",
"CSV with company ID-participant ID mapping from the PSC data and the respective equity shares.\n",
"* `output/uk/npi_dpi/10000iter/uk_organisations_participants_2021_long_7sep21_dpi_10000iter.zip` --- an archived CSV with company ID-participant ID mapping from the PSC data and their Direct Power Indices ([Mizuno, Doi, and Kurizaki (2020)](https://doi.org/10.1371/journal.pone.0237862)). \n",
"\n",
"\n",
" \n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "FhocHBOyvbsu",
"outputId": "c6f1b1a0-ec59-49a0-abf0-1a3e08a10aab"
},
"source": [
"# Download and unarchive the data files from the Google Drive public link\n",
"!pip install gdown\n",
"\n",
"!gdown https://drive.google.com/uc?id=1rpi5FEPrKfx9vIwDpr_mfK6L971rrtL7 \n",
"!unzip uk_organisations_participants_2021_long_2aug21.zip\n",
" \n",
"!gdown https://drive.google.com/uc?id=1UBsF3RBMvjF7dBb1PG-wXhEBv3whoMLG\n",
"!unzip uk_organisations_participants_2021_long_7sep21_dpi_10000iter.zip\n",
" \n",
"import pandas as pd\n",
"import scipy\n",
"from os.path import join\n",
"import matplotlib.pyplot as plt\n",
"import scipy.sparse as sp\n",
"from scipy.sparse.linalg import eigs\n",
"import numpy as np\n",
"from itertools import combinations\n",
"import tqdm\n",
"import networkx as nx\n",
"import gc"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Requirement already satisfied: gdown in /usr/local/lib/python3.7/dist-packages (3.6.4)\n",
"Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from gdown) (1.15.0)\n",
"Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from gdown) (4.62.0)\n",
"Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from gdown) (2.23.0)\n",
"Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->gdown) (2.10)\n",
"Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->gdown) (3.0.4)\n",
"Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->gdown) (1.24.3)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->gdown) (2021.5.30)\n",
"Downloading...\n",
"From: https://drive.google.com/uc?id=1rpi5FEPrKfx9vIwDpr_mfK6L971rrtL7\n",
"To: /content/uk_organisations_participants_2021_long_2aug21.zip\n",
"74.3MB [00:00, 179MB/s]\n",
"Archive: uk_organisations_participants_2021_long_2aug21.zip\n",
" inflating: uk_organisations_participants_2021_long_2aug21.csv \n",
"Downloading...\n",
"From: https://drive.google.com/uc?id=1UBsF3RBMvjF7dBb1PG-wXhEBv3whoMLG\n",
"To: /content/uk_organisations_participants_2021_long_7sep21_dpi_10000iter.zip\n",
"74.7MB [00:00, 121MB/s] \n",
"Archive: uk_organisations_participants_2021_long_7sep21_dpi_10000iter.zip\n",
" inflating: uk_organisations_participants_2021_long_7sep21_dpi_10000iter.csv \n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7GgWACrsmkM3"
},
"source": [
"### Import without downloading the data"
]
},
{
"cell_type": "code",
"metadata": {
"id": "6dTPRzNvMLzP"
},
"source": [
"import pandas as pd\n",
"import scipy\n",
"from os.path import join\n",
"import matplotlib.pyplot as plt\n",
"import scipy.sparse as sp\n",
"from scipy.sparse.linalg import eigs\n",
"import numpy as np\n",
"from itertools import combinations\n",
"import tqdm\n",
"import networkx as nx\n",
"import gc"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "879Esu8aV7VY"
},
"source": [
"# Data processing"
]
},
{
"cell_type": "code",
"metadata": {
"id": "ZiKLqZKfvOSX"
},
"source": [
"DPI = 1 # use dpi data\n",
"SH_MODE = 1 # only SH nodes are considered final holders\n",
"LEVELS = 20 # this parameter should be equal or exceed the number of \"onion layers\" in data. 20 is a bit of overkill for safety"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "pzZlT_OlUNcD"
},
"source": [
"if DPI:\n",
" name = 'uk_organisations_participants_2021_long_7sep21_dpi_10000iter.csv'\n",
" data = pd.read_csv(name, engine='python', dtype={'participant': str, 'entity': str})\n",
"else:\n",
" name = 'uk_organisations_participants_2021_long_2aug21.csv'\n",
" data = pd.read_csv(name, engine='python', dtype={'participant_id': str, 'company_number': str})"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "hGJzvLfvmtGZ"
},
"source": [
"### Primary processing"
]
},
{
"cell_type": "code",
"metadata": {
"id": "n_H7C2N_QZ-4",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 203
},
"outputId": "4fb79d1f-2ea3-4673-ba4f-758c7ca98dca"
},
"source": [
"data = data.dropna()\n",
"\n",
"if DPI:\n",
" data = data.astype({'dpi': float, 'participant':str, 'entity':str})\n",
" data = data.rename(columns={\"entity\": \"organisation_inn\",\n",
" 'participant': 'participant_id',\n",
" 'dpi': 'equity_share'})\n",
" \n",
" # columns are renamed for consistency with further code\n",
" data=data[['organisation_inn','participant_id','equity_share']]\n",
"\n",
"else:\n",
" data = data.astype({'equity_share': float, 'participant_id':str, 'company_number':str})\n",
" data = data.rename(columns={\"company_number\": \"organisation_inn\"})\n",
"\n",
"data = data[data['equity_share'] > 0]\n",
"data = data[data.participant_id != data.organisation_inn]\n",
"\n",
"# normalization of in-edge weights to 1\n",
"gdata = data.groupby('organisation_inn').sum().reset_index()\n",
"dict_companies = dict(gdata.values)\n",
"data['equity_share'] = data['equity_share']/np.array([dict_companies[num] for num in data['organisation_inn']])\n",
"\n",
"# finding SH and ST nodes\n",
"data['super_holder']=~pd.Series(data.participant_id).isin(data.organisation_inn)\n",
"data['super_target']=~pd.Series(data.organisation_inn).isin(data.participant_id)\n",
"\n",
"\n",
"data.head()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>organisation_inn</th>\n",
" <th>participant_id</th>\n",
" <th>equity_share</th>\n",
" <th>super_holder</th>\n",
" <th>super_target</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>00000133</td>\n",
" <td>THE PENINSULAR AND ORIENTAL STEAM NAVIGATION C...</td>\n",
" <td>1.0000</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>00000140</td>\n",
" <td>NICHOLLS &amp; CLARKE LIMITED$NA$NA</td>\n",
" <td>1.0000</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>00000295</td>\n",
" <td>COLIN$NA$WELLS$1967$2</td>\n",
" <td>0.4997</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>00000295</td>\n",
" <td>MOIRA$RUTH$SLEIGHT$1959$2</td>\n",
" <td>0.5003</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>00000371</td>\n",
" <td>DAVID$JOHN$ROWLAND$1945$6</td>\n",
" <td>1.0000</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" organisation_inn ... super_target\n",
"0 00000133 ... True\n",
"1 00000140 ... True\n",
"2 00000295 ... True\n",
"3 00000295 ... True\n",
"4 00000371 ... True\n",
"\n",
"[5 rows x 5 columns]"
]
},
"metadata": {},
"execution_count": 5
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XuR1-NTHm72S"
},
"source": [
"### Edges analysis"
]
},
{
"cell_type": "code",
"metadata": {
"id": "ZlCoK5TUW75I",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "1a8e1a01-ddae-435d-ae0d-15a81e8ea11a"
},
"source": [
"e = len(data)\n",
"e1 = len(data[(data['super_holder'] == True) & (data['super_target'] == True)])\n",
"e2 = len(data[(data['super_holder'] == True) & (data['super_target'] == False)])\n",
"e3 = len(data[(data['super_holder'] == False) & (data['super_target'] == True)])\n",
"e4 = len(data[(data['super_holder'] == False) & (data['super_target'] == False)])\n",
"print('total edges:', e)\n",
"print('SH -> ST edges', e1, '({0:.2f}%)'.format(e1/e*100))\n",
"print('SH -> ~ST edges', e2, '({0:.2f}%)'.format(e2/e*100))\n",
"print('~SH -> ST edges', e3, '({0:.2f}%)'.format(e3/e*100))\n",
"print('~SH -> ~ST edges', e4, '({0:.2f}%)'.format(e4/e*100))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"total edges: 5096560\n",
"SH -> ST edges 4642420 (91.09%)\n",
"SH -> ~ST edges 151849 (2.98%)\n",
"~SH -> ST edges 256996 (5.04%)\n",
"~SH -> ~ST edges 45295 (0.89%)\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "w9wQh6jRWHOM"
},
"source": [
"# Network analysis"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5tiXjWcWnAuv"
},
"source": [
"### Core creation"
]
},
{
"cell_type": "code",
"metadata": {
"id": "gvnRCxRAdppC",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "4acd7f4c-fb1d-473d-8d04-c1d3c087e394"
},
"source": [
"print('pruning SH and ST nodes of the external layer...')\n",
"rdata = data.loc[(data['super_holder'] == False) & (data['super_target'] == False)]\n",
"edges_left = len(rdata)\n",
"print('layer 1:', edges_left, 'edges left')\n",
"\n",
"crdata = rdata.copy()\n",
"\n",
"for i in range(1, LEVELS):\n",
" is_sh = ~pd.Series(rdata.participant_id).isin(rdata.organisation_inn)\n",
" current_unique_sh = pd.Series(rdata.loc[is_sh == True].participant_id.value_counts().keys())\n",
" print('current SH', len(current_unique_sh))\n",
" curr_sh_prop_name = 'is_level_' + str(i) + '_SH'\n",
" crdata[curr_sh_prop_name] = pd.Series(crdata.participant_id).isin(current_unique_sh)\n",
"\n",
" is_st = ~pd.Series(rdata.organisation_inn).isin(rdata.participant_id)\n",
" current_unique_st = pd.Series(rdata.loc[is_st == True].organisation_inn.value_counts().keys())\n",
" print('current ST', len(current_unique_st))\n",
" curr_st_prop_name = 'is_level_' + str(i) + '_ST'\n",
" crdata[curr_st_prop_name] = pd.Series(crdata.organisation_inn).isin(current_unique_st)\n",
"\n",
" rdata = rdata.loc[(is_sh == False) & (is_st == False)]\n",
" print('layer {}: {} edges left'.format(i+1, len(rdata)))\n",
"\n",
"data_core = rdata.copy()\n",
"core_holders = rdata.participant_id.value_counts()\n",
"print('unique core holders:', len(core_holders))\n",
"core_targets = rdata.organisation_inn.value_counts()\n",
"print('unique core targets:', len(core_targets))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"pruning SH and ST nodes of the external layer...\n",
"layer 1: 45295 edges left\n",
"current SH 17777\n",
"current ST 28628\n",
"layer 2: 9102 edges left\n",
"current SH 2467\n",
"current ST 3981\n",
"layer 3: 3489 edges left\n",
"current SH 874\n",
"current ST 1220\n",
"layer 4: 1626 edges left\n",
"current SH 348\n",
"current ST 416\n",
"layer 5: 964 edges left\n",
"current SH 149\n",
"current ST 167\n",
"layer 6: 685 edges left\n",
"current SH 65\n",
"current ST 67\n",
"layer 7: 578 edges left\n",
"current SH 29\n",
"current ST 27\n",
"layer 8: 527 edges left\n",
"current SH 14\n",
"current ST 14\n",
"layer 9: 505 edges left\n",
"current SH 3\n",
"current ST 3\n",
"layer 10: 502 edges left\n",
"current SH 0\n",
"current ST 0\n",
"layer 11: 502 edges left\n",
"current SH 0\n",
"current ST 0\n",
"layer 12: 502 edges left\n",
"current SH 0\n",
"current ST 0\n",
"layer 13: 502 edges left\n",
"current SH 0\n",
"current ST 0\n",
"layer 14: 502 edges left\n",
"current SH 0\n",
"current ST 0\n",
"layer 15: 502 edges left\n",
"current SH 0\n",
"current ST 0\n",
"layer 16: 502 edges left\n",
"current SH 0\n",
"current ST 0\n",
"layer 17: 502 edges left\n",
"current SH 0\n",
"current ST 0\n",
"layer 18: 502 edges left\n",
"current SH 0\n",
"current ST 0\n",
"layer 19: 502 edges left\n",
"current SH 0\n",
"current ST 0\n",
"layer 20: 502 edges left\n",
"unique core holders: 498\n",
"unique core targets: 498\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ycGL0vJwnIBG"
},
"source": [
"### Classification of the nodes"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Pq37803BGnPf",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "84c7e51d-1d23-4f78-ff69-310013ae7526"
},
"source": [
"super_holders, sh_counts=np.unique(data[data['super_holder']==True].participant_id, return_counts=True)\n",
"super_targets, st_counts=np.unique(data[data['super_target']==True].organisation_inn, return_counts=True)\n",
"print('SH', len(super_holders))\n",
"print('ST', len(super_targets))\n",
"\n",
"# core members\n",
"core_inn = np.array(data_core.participant_id.value_counts().index)\n",
"print('Core', len(core_inn))\n",
"\n",
"# intermediaries\n",
"not_super_holders=list(data[(data['super_holder']==False)].participant_id)\n",
"not_super_targets=list(data[(data['super_target']==False)].organisation_inn)\n",
"not_super=np.array(list(set(not_super_holders+not_super_targets)))\n",
"inter = not_super[np.isin(not_super, core_inn)==False]\n",
"print('Intermediaries', len(inter))\n",
"\n",
"\n",
"# create a dataframe\n",
"firms_sh = pd.DataFrame({'company_number/id': super_holders, 'type': np.array(['SH']*len(super_holders))})\n",
"firms_st = pd.DataFrame({'company_number/id': super_targets, 'type': np.array(['ST']*len(super_targets))})\n",
"firms_core = pd.DataFrame({'company_number/id': core_inn, 'type': np.array(['C']*len(core_inn))})\n",
"firms_inter = pd.DataFrame({'company_number/id': inter, 'type': np.array(['I']*len(inter))})\n",
"\n",
"dst = pd.concat([firms_sh, firms_st, firms_core, firms_inter]).reset_index().drop(['index'], axis=1)\n",
"\n",
"assert len(list(set(list(set(data.participant_id))+list(set(data.organisation_inn)))))==len(dst)\n",
"\n",
"dst.to_csv('dst_british.csv', encoding='utf-8-sig')\n"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"SH 3770439\n",
"ST 4047325\n",
"Core 498\n",
"Intermediaries 151803\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RUwpwzHunNDo"
},
"source": [
"### Isolates detection"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "kqq_uMv2J-nT",
"outputId": "23ce817b-bf9f-45ea-acbb-bdcd80e7604f"
},
"source": [
"all_isolates = {i: [] for i in range(1, LEVELS)}\n",
"\n",
"data_without_sh = data.loc[(data['super_holder'] == False)]\n",
"new_sh = ~pd.Series(data_without_sh.participant_id).isin(data_without_sh.organisation_inn)\n",
"new_unique_sh = data_without_sh.loc[new_sh == True].participant_id.value_counts().keys().values\n",
"#print('new unique sh', len(new_unique_sh))\n",
"\n",
"data_without_st = data.loc[(data['super_target'] == False)]\n",
"new_st = ~pd.Series(data_without_st.organisation_inn).isin(data_without_st.participant_id)\n",
"new_unique_st = data_without_st.loc[new_st == True].organisation_inn.value_counts().keys().values\n",
"#print('new unique st', len(new_unique_st))\n",
"\n",
"isolates = list(set(new_unique_sh).intersection(set(new_unique_st)))\n",
"print('isolates of layer 1:', len(isolates))\n",
"all_isolates[1] = isolates\n",
"\n",
"\n",
"rdata = data.loc[(data['super_holder'] == False) & (data['super_target'] == False)]\n",
"for i in range(1, LEVELS):\n",
" is_sh = ~pd.Series(rdata.participant_id).isin(rdata.organisation_inn)\n",
" is_st = ~pd.Series(rdata.organisation_inn).isin(rdata.participant_id)\n",
" rdata_without_sh = rdata.loc[is_sh == False]\n",
" rdata_without_st = rdata.loc[is_st == False]\n",
"\n",
" new_sh = ~pd.Series(rdata_without_sh.participant_id).isin(rdata_without_sh.organisation_inn)\n",
" new_st = ~pd.Series(rdata_without_st.organisation_inn).isin(rdata_without_st.participant_id)\n",
"\n",
" new_unique_sh = rdata_without_sh.loc[new_sh == True].participant_id.value_counts().keys().values\n",
" new_unique_st = rdata_without_st.loc[new_st == True].organisation_inn.value_counts().keys().values\n",
"\n",
" isolates = list(set(new_unique_sh).intersection(set(new_unique_st)))\n",
" print('isolates of layer {}: {}'.format(i+1, len(isolates)))\n",
" all_isolates[i+1] = isolates\n",
"\n",
" rdata = rdata.loc[(is_sh == False) & (is_st == False)]"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"isolates of layer 1: 91167\n",
"isolates of layer 2: 3338\n",
"isolates of layer 3: 635\n",
"isolates of layer 4: 271\n",
"isolates of layer 5: 77\n",
"isolates of layer 6: 43\n",
"isolates of layer 7: 11\n",
"isolates of layer 8: 7\n",
"isolates of layer 9: 5\n",
"isolates of layer 10: 0\n",
"isolates of layer 11: 0\n",
"isolates of layer 12: 0\n",
"isolates of layer 13: 0\n",
"isolates of layer 14: 0\n",
"isolates of layer 15: 0\n",
"isolates of layer 16: 0\n",
"isolates of layer 17: 0\n",
"isolates of layer 18: 0\n",
"isolates of layer 19: 0\n",
"isolates of layer 20: 0\n"
]
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "ycHsNTtvuLmh"
},
"source": [
"coredata = crdata.copy()\n",
"coredata['SH_level'] = np.nan\n",
"coredata['ST_level'] = np.nan\n",
"\n",
"drop_cols = []\n",
"for i in range(1,LEVELS):\n",
" curr_sh_prop_name = 'is_level_' + str(i) + '_SH'\n",
" curr_st_prop_name = 'is_level_' + str(i) + '_ST'\n",
" coredata.loc[coredata[curr_sh_prop_name] == True, 'SH_level'] = i+1\n",
" coredata.loc[coredata[curr_st_prop_name] == True, 'ST_level'] = i+1\n",
" drop_cols.extend([curr_sh_prop_name, curr_st_prop_name])\n",
"\n",
"coredata.drop(columns=drop_cols, inplace = True)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 417
},
"id": "k7O3gIGWE7pc",
"outputId": "ef7fc413-cb65-462e-90f6-e08d61308062"
},
"source": [
"coredata"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>organisation_inn</th>\n",
" <th>participant_id</th>\n",
" <th>equity_share</th>\n",
" <th>super_holder</th>\n",
" <th>super_target</th>\n",
" <th>SH_level</th>\n",
" <th>ST_level</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>00000529</td>\n",
" <td>05995030</td>\n",
" <td>1.0000</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>3.0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>00000866</td>\n",
" <td>05253545</td>\n",
" <td>1.0000</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>00000950</td>\n",
" <td>03526047</td>\n",
" <td>1.0000</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>00001160</td>\n",
" <td>06452679</td>\n",
" <td>1.0000</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>4.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>00001419</td>\n",
" <td>05282342</td>\n",
" <td>1.0000</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>3.0</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5256244</th>\n",
" <td>SO307293</td>\n",
" <td>SC299736</td>\n",
" <td>0.4969</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>4.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5256255</th>\n",
" <td>SO307300</td>\n",
" <td>03942880</td>\n",
" <td>0.5069</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5256256</th>\n",
" <td>SO307300</td>\n",
" <td>11677818</td>\n",
" <td>0.4931</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5256265</th>\n",
" <td>SO307305</td>\n",
" <td>09431213</td>\n",
" <td>1.0000</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>3.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5256338</th>\n",
" <td>ZC000195</td>\n",
" <td>01612178</td>\n",
" <td>1.0000</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>45295 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" organisation_inn participant_id ... SH_level ST_level\n",
"7 00000529 05995030 ... 3.0 NaN\n",
"10 00000866 05253545 ... NaN 2.0\n",
"11 00000950 03526047 ... 2.0 2.0\n",
"13 00001160 06452679 ... 4.0 2.0\n",
"19 00001419 05282342 ... 3.0 NaN\n",
"... ... ... ... ... ...\n",
"5256244 SO307293 SC299736 ... 4.0 2.0\n",
"5256255 SO307300 03942880 ... 2.0 2.0\n",
"5256256 SO307300 11677818 ... 2.0 2.0\n",
"5256265 SO307305 09431213 ... 3.0 2.0\n",
"5256338 ZC000195 01612178 ... NaN 3.0\n",
"\n",
"[45295 rows x 7 columns]"
]
},
"metadata": {},
"execution_count": 11
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MkNwGFp14WXE"
},
"source": [
"# Full adjacency matrix calculation\n"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "1MsJj0bWdYIn",
"outputId": "aa88a7c1-b556-443f-b63c-f9cbc066c7ff"
},
"source": [
"all_holders=data.participant_id.value_counts()\n",
"all_targets=data.organisation_inn.value_counts()\n",
"all_nodes = list(set(all_holders.keys()) | set(all_targets.keys()))\n",
"sorted(all_nodes)\n",
"node_inds = dict(zip(all_nodes, range(len(all_nodes))))\n",
"reverse_node_inds = dict(zip(range(len(all_nodes)), all_nodes))\n",
"print('all nodes:', len(all_nodes))\n"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"all nodes: 7970065\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VJjcD-cunrX0"
},
"source": [
"### Transitive ownership calculation"
]
},
{
"cell_type": "code",
"metadata": {
"id": "JR6mpUUn3U9r"
},
"source": [
"def get_trans_ownership(alpha):\n",
"\n",
" print('Started calculations for alpha = {}'.format(alpha))\n",
" print('Preparing adjacency matrix...')\n",
" ADJ = sp.lil_matrix((len(all_nodes), len(all_nodes)), dtype = np.float32)\n",
"\n",
" print('non-zero elements:', ADJ.nnz)\n",
" # -------------------------- 1 - Superholders --------------------------------\n",
" print('Processing intermediate layers superholders...')\n",
" unique_curr_lvl_sh = {}\n",
" for i in range(1, LEVELS):\n",
" big_core_nodes_of_curr_level = coredata[coredata['SH_level'] == i].participant_id.value_counts().keys().values\n",
" isolates_of_curr_level = all_isolates[i]\n",
" all_sh_of_curr_level = np.concatenate((big_core_nodes_of_curr_level, isolates_of_curr_level))\n",
" unique_curr_lvl_sh[i] = all_sh_of_curr_level\n",
" \n",
" edge_ends_at_curr_lvl_sh = {i: pd.Series(data.organisation_inn).isin(unique_curr_lvl_sh[i]) for i in range(1,LEVELS)}\n",
" edges_of_interest_inter_sh = {i: data[edge_ends_at_curr_lvl_sh[i] == True] for i in range(1,LEVELS)}\n",
"\n",
" for lvl in range(1, LEVELS):\n",
" #print('processing layer {} super holders'.format(lvl))\n",
" edges = edges_of_interest_inter_sh[lvl]\n",
" shares = edges.equity_share.values\n",
" holders = edges.participant_id.values\n",
" curr_lvl_super_holders = edges.organisation_inn.values\n",
" \n",
" holders_pos_in_adj = np.array([node_inds[h] for h in holders])\n",
" curr_lvl_sh_pos_in_adj = np.array([node_inds[sh] for sh in curr_lvl_super_holders])\n",
"\n",
" target_inds_from_prev_level = np.array([])\n",
" holder_inds_from_prev_level = np.array([])\n",
" shares_from_prev_level = np.array([])\n",
"\n",
" # no need to define zero-level SH since they have zero ownership vectors\n",
" target_inds_from_prev_level, holder_inds_from_prev_level = ADJ[holders_pos_in_adj, :].nonzero()\n",
" shares_from_prev_level = ADJ[holders_pos_in_adj[target_inds_from_prev_level],\n",
" holder_inds_from_prev_level].A[0]\n",
" \n",
" for i, share in enumerate(shares):\n",
" ADJ[curr_lvl_sh_pos_in_adj[i], holders_pos_in_adj[i]] = share\n",
"\n",
" for i, prev_share in enumerate(shares_from_prev_level):\n",
" ADJ[curr_lvl_sh_pos_in_adj[target_inds_from_prev_level][i],\n",
" holder_inds_from_prev_level[i]] += prev_share*alpha\n",
" \n",
" print('non-zero elements:', ADJ.nnz)\n",
"\n",
" # --------------------------- 2 - Core construction ------------------------------------------\n",
" core_nodes = list(set(core_holders.keys()) | set(core_targets.keys()))\n",
" sorted(core_nodes)\n",
" core_node_inds = dict(zip(core_nodes, range(len(core_nodes))))\n",
" global_reverse_core_node_inds = dict(zip(range(len(core_nodes)), [node_inds[cnode] for cnode in core_nodes]))\n",
"\n",
" core_shares = data_core.equity_share.values.astype(float)\n",
" core_row_inds = np.array([core_node_inds[ch] for ch in data_core.participant_id.values])\n",
" core_col_inds = np.array([core_node_inds[ct] for ct in data_core.organisation_inn.values])\n",
"\n",
" W = sp.coo_matrix((core_shares, (core_row_inds, core_col_inds))).tocsc()\n",
" G = W.dot(sp.linalg.inv(sp.eye(W.shape[0]).tocsc() - alpha*W)) # precise transitivity matrix calculation for core\n",
"\n",
"\n",
" # --------------------- 3.1 - edges projecting from super-holders of previous levels to core--------------\n",
" print('Adding information from core...')\n",
" edge_ends_at_core = pd.Series(data.organisation_inn).isin(pd.Series(core_nodes))\n",
" edges_of_interest_core = data[edge_ends_at_core == True]\n",
" curr_core_nodes = edges_of_interest_core.organisation_inn.values\n",
" ext_holders = edges_of_interest_core.participant_id.values\n",
" ext_shares = edges_of_interest_core.equity_share.values\n",
"\n",
" ext_holders_pos_in_adj = np.array([node_inds[h] for h in ext_holders])\n",
" curr_core_nodes_pos_in_adj = np.array([node_inds[sh] for sh in curr_core_nodes])\n",
"\n",
" target_inds, ext_holder_inds = ADJ[ext_holders_pos_in_adj, :].nonzero()\n",
" shares_from_sh_to_core = ADJ[ext_holders_pos_in_adj[target_inds], ext_holder_inds].A[0]\n",
"\n",
" for i, share in enumerate(ext_shares):\n",
" ADJ[curr_core_nodes_pos_in_adj[i], ext_holders_pos_in_adj[i]] = share\n",
"\n",
" for i, prev_share in enumerate(shares_from_sh_to_core):\n",
" ADJ[curr_core_nodes_pos_in_adj[target_inds][i], ext_holder_inds[i]] += prev_share*alpha\n",
"\n",
"\n",
" # ------------------------------------- 3.2 edges inside the core ------------------------------------\n",
" inside_core_targets = np.array([global_reverse_core_node_inds[ch] for ch in G.nonzero()[1]])\n",
" inside_core_holders = np.array([global_reverse_core_node_inds[ch] for ch in G.nonzero()[0]])\n",
" inside_core_shares = G.data\n",
"\n",
" target_inds_core, holder_inds_core = ADJ[inside_core_holders, :].nonzero()\n",
" shares_from_prev_level = ADJ[inside_core_holders[target_inds_core], holder_inds_core].A[0]\n",
"\n",
" for i, share in enumerate(inside_core_shares):\n",
" ADJ[inside_core_targets[i], inside_core_holders[i]] = share\n",
"\n",
" for i, prev_share in enumerate(shares_from_prev_level):\n",
" ADJ[inside_core_targets[target_inds_core][i], holder_inds_core[i]] += prev_share*alpha\n",
"\n",
" print('non-zero elements:', ADJ.nnz)\n",
"\n",
"\n",
" # -------------------------------- 4 Supertargets of internal levels ----------------------------------\n",
" print('Processing intermediate layers supertargets...')\n",
" unique_curr_lvl_st = {}\n",
" for i in range(1,LEVELS):\n",
" big_core_nodes_of_curr_level = coredata[coredata['ST_level'] == i].organisation_inn.value_counts().keys().values\n",
" isolates_of_curr_level = all_isolates[i]\n",
" all_st_of_curr_level = np.concatenate((big_core_nodes_of_curr_level, isolates_of_curr_level))\n",
" #print(len(all_st_of_curr_level))\n",
" unique_curr_lvl_st[i] = all_st_of_curr_level\n",
"\n",
" edge_ends_at_curr_lvl_st = {i: pd.Series(data.organisation_inn).isin(unique_curr_lvl_st[i]) for i in range(1,LEVELS)}\n",
" edges_of_interest_inter_st = {i: data[edge_ends_at_curr_lvl_st[i] == True] for i in range(1,LEVELS)}\n",
"\n",
" for lvl in range(LEVELS-1, 0, -1):\n",
" #print('processing layer {} ST'.format(lvl))\n",
"\n",
" edges = edges_of_interest_inter_st[lvl]\n",
" shares = edges.equity_share.values\n",
" holders = edges.participant_id.values\n",
" curr_lvl_super_targets = edges.organisation_inn.values\n",
" \n",
" holders_pos_in_adj = np.array([node_inds[h] for h in holders])\n",
" curr_lvl_st_pos_in_adj = np.array([node_inds[sh] for sh in curr_lvl_super_targets])\n",
"\n",
" target_inds_from_prev_level, holder_inds_from_prev_level = ADJ[holders_pos_in_adj, :].nonzero()\n",
" shares_from_prev_level = ADJ[holders_pos_in_adj[target_inds_from_prev_level],\n",
" holder_inds_from_prev_level].A[0]\n",
" \n",
" #for i, share in enumerate(shares):\n",
" ADJ[curr_lvl_st_pos_in_adj, holders_pos_in_adj] = shares\n",
" \n",
" for i, prev_share in enumerate(shares_from_prev_level):\n",
" ADJ[curr_lvl_st_pos_in_adj[target_inds_from_prev_level][i],\n",
" holder_inds_from_prev_level[i]] += prev_share*alpha\n",
" \n",
" print('non-zero elements:', ADJ.nnz)\n",
"\n",
" # ------------------------------------ 5 - Supertargets ----------------------------------------\n",
" print('Processing supertargets...')\n",
" edges_of_interest_st = data[data['super_target'] == True]\n",
" shares = edges_of_interest_st.equity_share.values\n",
" holders = edges_of_interest_st.participant_id.values\n",
" super_targets = edges_of_interest_st.organisation_inn.values\n",
"\n",
" holders_pos_in_adj = np.array([node_inds[h] for h in holders])\n",
" st_pos_in_adj = np.array([node_inds[sh] for sh in super_targets])\n",
"\n",
" target_inds_from_prev_level, holder_inds_from_prev_level = ADJ[holders_pos_in_adj, :].nonzero()\n",
" shares_from_prev_level = ADJ[holders_pos_in_adj[target_inds_from_prev_level],\n",
" holder_inds_from_prev_level].A[0]\n",
" \n",
" #for i, share in enumerate(shares):\n",
" ADJ[st_pos_in_adj, holders_pos_in_adj] = shares\n",
" \n",
" # since we will not perform matrix slicing anymore, we may construct an additional\n",
" # coo matrix and sum it up with the main one:\n",
" ST_ADJ = sp.coo_matrix((shares_from_prev_level*alpha, (st_pos_in_adj[target_inds_from_prev_level], holder_inds_from_prev_level)),\n",
" shape = (len(all_nodes), len(all_nodes)))\n",
" \n",
" FINAL_ADJ = ST_ADJ.tolil() + ADJ\n",
"\n",
" print('non-zero elements:', FINAL_ADJ.nnz)\n",
"\n",
" # ------------------------------ 6 - Conversion to csv -----------------------------------------\n",
" all_trans_targets, all_trans_holders = FINAL_ADJ.nonzero()\n",
" all_trans_shares = FINAL_ADJ[all_trans_targets, all_trans_holders].A[0]\n",
" trans_data = pd.DataFrame()\n",
"\n",
" trans_data['company_number'] = [reverse_node_inds[tt] for tt in all_trans_targets]\n",
" trans_data['participant_id'] = [reverse_node_inds[th] for th in all_trans_holders]\n",
" trans_data['share'] = all_trans_shares\n",
"\n",
" print('Final dataframe has {} rows'.format(len(all_trans_shares)))\n",
"\n",
" del ADJ\n",
" del FINAL_ADJ\n",
" del ST_ADJ\n",
"\n",
" gc.collect()\n",
" \n",
" return trans_data\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Hu1mhRopnTwW"
},
"source": [
"### Launcher"
]
},
{
"cell_type": "code",
"metadata": {
"id": "T92SOjxhJxVr",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "945773ba-6859-4bbf-c926-172226312add"
},
"source": [
"# Note that you may need to change path to the data in CORP_PATH\n",
"from google.colab import drive\n",
"drive.mount('/content/drive/', force_remount=True)\n",
"\n",
"if DPI:\n",
" CORP_PATH = '/content/drive/My Drive/Colab Notebooks/DataRoot/Corporate control/British data DPI'\n",
"else:\n",
" CORP_PATH = '/content/drive/My Drive/Colab Notebooks/DataRoot/Corporate control/British data'\n",
"\n",
"alphalist = [_/10.0 for _ in range(10)] + [0.999]\n",
"#alphalist = [0.999]\n",
"\n",
"for alpha in alphalist:\n",
" \n",
" trans_data = get_trans_ownership(alpha)\n",
"\n",
" if DPI:\n",
" dname = 'uk_organisations_transitive_ownership_alpha{}_2021_long_7sep21_dpi_10000iter'.format(alpha)\n",
" else:\n",
" dname = 'uk_organisations_transitive_ownership_alpha{}_2021_long_2aug21'.format(alpha)\n",
"\n",
" if SH_MODE:\n",
" dname = dname + '_SH_only.csv'\n",
" else:\n",
" dname = dname + '.csv'\n",
" \n",
" # if only SH should be left as holders, we should filter out all other holding entities:\n",
" if SH_MODE: \n",
" final_super_holder = ~pd.Series(trans_data.participant_id).isin(trans_data.company_number) \n",
" # the set of SH should coincide with the same set in the original data\n",
" trans_data = trans_data[final_super_holder == True]\n",
"\n",
" trans_data.to_csv(join(CORP_PATH, dname), index=False)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Mounted at /content/drive/\n",
"Started calculations for alpha = 0.0\n",
"Preparing adjacency matrix...\n",
"non-zero elements: 0\n",
"Processing intermediate layers superholders...\n",
"non-zero elements: 158125\n",
"Adding information from core...\n",
"non-zero elements: 158722\n",
"Processing intermediate layers supertargets...\n",
"non-zero elements: 197144\n",
"Processing supertargets...\n",
"non-zero elements: 5096560\n",
"Final dataframe has 5096560 rows\n",
"Started calculations for alpha = 0.1\n",
"Preparing adjacency matrix...\n",
"non-zero elements: 0\n",
"Processing intermediate layers superholders...\n",
"non-zero elements: 165495\n",
"Adding information from core...\n",
"non-zero elements: 166851\n",
"Processing intermediate layers supertargets...\n",
"non-zero elements: 294304\n",
"Processing supertargets...\n",
"non-zero elements: 5694581\n",
"Final dataframe has 5694581 rows\n",
"Started calculations for alpha = 0.2\n",
"Preparing adjacency matrix...\n",
"non-zero elements: 0\n",
"Processing intermediate layers superholders...\n",
"non-zero elements: 165495\n",
"Adding information from core...\n",
"non-zero elements: 166851\n",
"Processing intermediate layers supertargets...\n",
"non-zero elements: 294304\n",
"Processing supertargets...\n",
"non-zero elements: 5694581\n",
"Final dataframe has 5694581 rows\n",
"Started calculations for alpha = 0.3\n",
"Preparing adjacency matrix...\n",
"non-zero elements: 0\n",
"Processing intermediate layers superholders...\n",
"non-zero elements: 165495\n",
"Adding information from core...\n",
"non-zero elements: 166851\n",
"Processing intermediate layers supertargets...\n",
"non-zero elements: 294304\n",
"Processing supertargets...\n",
"non-zero elements: 5694581\n",
"Final dataframe has 5694581 rows\n",
"Started calculations for alpha = 0.4\n",
"Preparing adjacency matrix...\n",
"non-zero elements: 0\n",
"Processing intermediate layers superholders...\n",
"non-zero elements: 165495\n",
"Adding information from core...\n",
"non-zero elements: 166851\n",
"Processing intermediate layers supertargets...\n",
"non-zero elements: 294304\n",
"Processing supertargets...\n",
"non-zero elements: 5694581\n",
"Final dataframe has 5694581 rows\n",
"Started calculations for alpha = 0.5\n",
"Preparing adjacency matrix...\n",
"non-zero elements: 0\n",
"Processing intermediate layers superholders...\n",
"non-zero elements: 165495\n",
"Adding information from core...\n",
"non-zero elements: 166851\n",
"Processing intermediate layers supertargets...\n",
"non-zero elements: 294304\n",
"Processing supertargets...\n",
"non-zero elements: 5694581\n",
"Final dataframe has 5694581 rows\n",
"Started calculations for alpha = 0.6\n",
"Preparing adjacency matrix...\n",
"non-zero elements: 0\n",
"Processing intermediate layers superholders...\n",
"non-zero elements: 165495\n",
"Adding information from core...\n",
"non-zero elements: 166851\n",
"Processing intermediate layers supertargets...\n",
"non-zero elements: 294304\n",
"Processing supertargets...\n",
"non-zero elements: 5694581\n",
"Final dataframe has 5694581 rows\n",
"Started calculations for alpha = 0.7\n",
"Preparing adjacency matrix...\n",
"non-zero elements: 0\n",
"Processing intermediate layers superholders...\n",
"non-zero elements: 165495\n",
"Adding information from core...\n",
"non-zero elements: 166851\n",
"Processing intermediate layers supertargets...\n",
"non-zero elements: 294304\n",
"Processing supertargets...\n",
"non-zero elements: 5694581\n",
"Final dataframe has 5694581 rows\n",
"Started calculations for alpha = 0.8\n",
"Preparing adjacency matrix...\n",
"non-zero elements: 0\n",
"Processing intermediate layers superholders...\n",
"non-zero elements: 165495\n",
"Adding information from core...\n",
"non-zero elements: 166851\n",
"Processing intermediate layers supertargets...\n",
"non-zero elements: 294304\n",
"Processing supertargets...\n",
"non-zero elements: 5694581\n",
"Final dataframe has 5694581 rows\n",
"Started calculations for alpha = 0.9\n",
"Preparing adjacency matrix...\n",
"non-zero elements: 0\n",
"Processing intermediate layers superholders...\n",
"non-zero elements: 165495\n",
"Adding information from core...\n",
"non-zero elements: 166851\n",
"Processing intermediate layers supertargets...\n",
"non-zero elements: 294304\n",
"Processing supertargets...\n",
"non-zero elements: 5694581\n",
"Final dataframe has 5694581 rows\n",
"Started calculations for alpha = 0.999\n",
"Preparing adjacency matrix...\n",
"non-zero elements: 0\n",
"Processing intermediate layers superholders...\n",
"non-zero elements: 165495\n",
"Adding information from core...\n",
"non-zero elements: 166851\n",
"Processing intermediate layers supertargets...\n",
"non-zero elements: 294304\n",
"Processing supertargets...\n",
"non-zero elements: 5694581\n",
"Final dataframe has 5694581 rows\n"
]
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IvJtOjirn0cJ"
},
"source": [
"# Return top-*k* holders\n",
"\n",
"Finally, having prepared all the data on transitive ownership, we define a function that takes a company name and returns its top-*k* super-holders."
]
},
{
"cell_type": "code",
"metadata": {
"id": "9dR0W06V5id1"
},
"source": [
"def get_top_k_holders(target, trans_data, k):\n",
"\n",
" rel_data = trans_data[trans_data.company_number == target]\n",
" top_k_data = rel_data.sort_values(by='share', ascending = False, ignore_index = True)[:k]\n",
"\n",
" # normalization of shares weights to 1\n",
" top_k_data.share = top_k_data.share/top_k_data.sum().share\n",
" \n",
" print(top_k_data)\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "XY38QTptoIhu"
},
"source": [
"An example with a random company:"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "VP9INrGh6i7g",
"outputId": "f5109ffb-58dc-4281-f22d-7ad88ffeaa62"
},
"source": [
"test_name = '06853998'\n",
"get_top_k_holders(test_name, trans_data, 2)\n"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" company_number participant_id share\n",
"0 06853998 MICHAEL$NA$GREVILLE$1962$2 0.500752\n",
"1 06853998 PETER$CHARLES$DE HAAN$1952$3 0.499248\n"
]
}
]
}
]
}