## Introduction
This notebook demonstrates α-ICON (Indirect Control in Onion-like networks) --- an algorithm to identify ultimate controlling entities in corporate networks. We provide a self-contained application as a companion to [our paper](https://arxiv.org/abs/2109.07181) and [repository](https://github.com/eusporg/alphaicon).

We will be working with the data from the UK's People with Significant Control register with 4.2 million companies and 4 million of their holders as of August, 2021.


## Data loading & import
All data pre-processing of the [PSC snapshot](http://download.companieshouse.gov.uk/en_pscdata.html) is done in the [repository](https://github.com/eusporg/alphaicon) (`code/data_preparation/uk`). The resulting data is stored in a [public folder](https://drive.google.com/drive/folders/10Tq-b4BVsG3gmq2JVa026Nilzj8eojNB) on Google Drive.

We will be working with two files:

*   `output/uk/uk_organisations_participants_2021_long_2aug21.zip` --- an archived 
CSV with company ID-participant ID mapping from the PSC data and the respective equity shares.
*   `output/uk/npi_dpi/10000iter/uk_organisations_participants_2021_long_7sep21_dpi_10000iter.zip` --- an archived CSV with company ID-participant ID mapping from the PSC data and their Direct Power Indices ([Mizuno, Doi, and Kurizaki (2020)](https://doi.org/10.1371/journal.pone.0237862)). 


 



In [None]:
# Download and unarchive the data files from the Google Drive public link
!pip install gdown

!gdown https://drive.google.com/uc?id=1rpi5FEPrKfx9vIwDpr_mfK6L971rrtL7 
!unzip uk_organisations_participants_2021_long_2aug21.zip
 
!gdown https://drive.google.com/uc?id=1UBsF3RBMvjF7dBb1PG-wXhEBv3whoMLG
!unzip uk_organisations_participants_2021_long_7sep21_dpi_10000iter.zip
 
import pandas as pd
import scipy
from os.path import join
import matplotlib.pyplot as plt
import scipy.sparse as sp
from scipy.sparse.linalg import eigs
import numpy as np
from itertools import combinations
import tqdm
import networkx as nx
import gc

Downloading...
From: https://drive.google.com/uc?id=1rpi5FEPrKfx9vIwDpr_mfK6L971rrtL7
To: /content/uk_organisations_participants_2021_long_2aug21.zip
74.3MB [00:00, 179MB/s]
Archive:  uk_organisations_participants_2021_long_2aug21.zip
  inflating: uk_organisations_participants_2021_long_2aug21.csv  
Downloading...
From: https://drive.google.com/uc?id=1UBsF3RBMvjF7dBb1PG-wXhEBv3whoMLG
To: /content/uk_organisations_participants_2021_long_7sep21_dpi_10000iter.zip
74.7MB [00:00, 121MB/s] 
Archive:  uk_organisations_participants_2021_long_7sep21_dpi_10000iter.zip
  inflating: uk_organisations_participants_2021_long_7sep21_dpi_10000iter.csv  


### Import without downloading the data

In [None]:
import pandas as pd
import scipy
from os.path import join
import matplotlib.pyplot as plt
import scipy.sparse as sp
from scipy.sparse.linalg import eigs
import numpy as np
from itertools import combinations
import tqdm
import networkx as nx
import gc

# Data processing

In [None]:
DPI = 1 # use dpi data
SH_MODE = 1 # only SH nodes are considered final holders
LEVELS = 20 # this parameter should be equal or exceed the number of "onion layers" in data. 20 is a bit of overkill for safety

In [None]:
if DPI:
    name = 'uk_organisations_participants_2021_long_7sep21_dpi_10000iter.csv'
    data = pd.read_csv(name, engine='python', dtype={'participant': str, 'entity': str})
else:
    name = 'uk_organisations_participants_2021_long_2aug21.csv'
    data = pd.read_csv(name, engine='python', dtype={'participant_id': str, 'company_number': str})

### Primary processing

In [None]:
data = data.dropna()

if DPI:
    data = data.astype({'dpi': float, 'participant':str, 'entity':str})
    data = data.rename(columns={"entity": "organisation_inn",
                                'participant': 'participant_id',
                                'dpi': 'equity_share'})
    
    # columns are renamed for consistency with further code
    data=data[['organisation_inn','participant_id','equity_share']]

else:
    data = data.astype({'equity_share': float, 'participant_id':str, 'company_number':str})
    data = data.rename(columns={"company_number": "organisation_inn"})

data = data[data['equity_share'] > 0]
data = data[data.participant_id != data.organisation_inn]

# normalization of in-edge weights to 1
gdata = data.groupby('organisation_inn').sum().reset_index()
dict_companies = dict(gdata.values)
data['equity_share'] = data['equity_share']/np.array([dict_companies[num] for num in data['organisation_inn']])

# finding SH and ST nodes
data['super_holder']=~pd.Series(data.participant_id).isin(data.organisation_inn)
data['super_target']=~pd.Series(data.organisation_inn).isin(data.participant_id)


data.head()

Unnamed: 0,organisation_inn,participant_id,equity_share,super_holder,super_target
0,133,THE PENINSULAR AND ORIENTAL STEAM NAVIGATION C...,1.0,True,True
1,140,NICHOLLS & CLARKE LIMITED$NA$NA,1.0,True,True
2,295,COLIN$NA$WELLS$1967$2,0.4997,True,True
3,295,MOIRA$RUTH$SLEIGHT$1959$2,0.5003,True,True
4,371,DAVID$JOHN$ROWLAND$1945$6,1.0,True,True


### Edges analysis

In [None]:
e = len(data)
e1 = len(data[(data['super_holder'] == True) & (data['super_target'] == True)])
e2 = len(data[(data['super_holder'] == True) & (data['super_target'] == False)])
e3 = len(data[(data['super_holder'] == False) & (data['super_target'] == True)])
e4 = len(data[(data['super_holder'] == False) & (data['super_target'] == False)])
print('total edges:', e)
print('SH -> ST edges', e1, '({0:.2f}%)'.format(e1/e*100))
print('SH -> ~ST edges', e2, '({0:.2f}%)'.format(e2/e*100))
print('~SH -> ST edges', e3, '({0:.2f}%)'.format(e3/e*100))
print('~SH -> ~ST edges', e4, '({0:.2f}%)'.format(e4/e*100))

total edges: 5096560
SH -> ST edges 4642420 (91.09%)
SH -> ~ST edges 151849 (2.98%)
~SH -> ST edges 256996 (5.04%)
~SH -> ~ST edges 45295 (0.89%)


# Network analysis

### Core creation

In [None]:
print('pruning SH and ST nodes of the external layer...')
rdata = data.loc[(data['super_holder'] == False) & (data['super_target'] == False)]
edges_left = len(rdata)
print('layer 1:', edges_left, 'edges left')

crdata = rdata.copy()

for i in range(1, LEVELS):
    is_sh = ~pd.Series(rdata.participant_id).isin(rdata.organisation_inn)
    current_unique_sh = pd.Series(rdata.loc[is_sh == True].participant_id.value_counts().keys())
    print('current SH', len(current_unique_sh))
    curr_sh_prop_name = 'is_level_' + str(i) + '_SH'
    crdata[curr_sh_prop_name] = pd.Series(crdata.participant_id).isin(current_unique_sh)

    is_st = ~pd.Series(rdata.organisation_inn).isin(rdata.participant_id)
    current_unique_st = pd.Series(rdata.loc[is_st == True].organisation_inn.value_counts().keys())
    print('current ST', len(current_unique_st))
    curr_st_prop_name = 'is_level_' + str(i) + '_ST'
    crdata[curr_st_prop_name] = pd.Series(crdata.organisation_inn).isin(current_unique_st)

    rdata = rdata.loc[(is_sh == False) & (is_st == False)]
    print('layer {}: {} edges left'.format(i+1, len(rdata)))

data_core = rdata.copy()
core_holders = rdata.participant_id.value_counts()
print('unique core holders:', len(core_holders))
core_targets = rdata.organisation_inn.value_counts()
print('unique core targets:', len(core_targets))

pruning SH and ST nodes of the external layer...
layer 1: 45295 edges left
current SH 17777
current ST 28628
layer 2: 9102 edges left
current SH 2467
current ST 3981
layer 3: 3489 edges left
current SH 874
current ST 1220
layer 4: 1626 edges left
current SH 348
current ST 416
layer 5: 964 edges left
current SH 149
current ST 167
layer 6: 685 edges left
current SH 65
current ST 67
layer 7: 578 edges left
current SH 29
current ST 27
layer 8: 527 edges left
current SH 14
current ST 14
layer 9: 505 edges left
current SH 3
current ST 3
layer 10: 502 edges left
current SH 0
current ST 0
layer 11: 502 edges left
current SH 0
current ST 0
layer 12: 502 edges left
current SH 0
current ST 0
layer 13: 502 edges left
current SH 0
current ST 0
layer 14: 502 edges left
current SH 0
current ST 0
layer 15: 502 edges left
current SH 0
current ST 0
layer 16: 502 edges left
current SH 0
current ST 0
layer 17: 502 edges left
current SH 0
current ST 0
layer 18: 502 edges left
current SH 0
current ST 0
laye

### Classification of the nodes

In [None]:
super_holders, sh_counts=np.unique(data[data['super_holder']==True].participant_id, return_counts=True)
super_targets, st_counts=np.unique(data[data['super_target']==True].organisation_inn, return_counts=True)
print('SH', len(super_holders))
print('ST', len(super_targets))

# core members
core_inn = np.array(data_core.participant_id.value_counts().index)
print('Core', len(core_inn))

# intermediaries
not_super_holders=list(data[(data['super_holder']==False)].participant_id)
not_super_targets=list(data[(data['super_target']==False)].organisation_inn)
not_super=np.array(list(set(not_super_holders+not_super_targets)))
inter = not_super[np.isin(not_super, core_inn)==False]
print('Intermediaries', len(inter))


# create a dataframe
firms_sh = pd.DataFrame({'company_number/id': super_holders, 'type': np.array(['SH']*len(super_holders))})
firms_st = pd.DataFrame({'company_number/id': super_targets, 'type': np.array(['ST']*len(super_targets))})
firms_core = pd.DataFrame({'company_number/id': core_inn, 'type': np.array(['C']*len(core_inn))})
firms_inter = pd.DataFrame({'company_number/id': inter, 'type': np.array(['I']*len(inter))})

dst = pd.concat([firms_sh, firms_st, firms_core, firms_inter]).reset_index().drop(['index'], axis=1)

assert len(list(set(list(set(data.participant_id))+list(set(data.organisation_inn)))))==len(dst)

dst.to_csv('dst_british.csv', encoding='utf-8-sig')


SH 3770439
ST 4047325
Core 498
Intermediaries 151803


### Isolates detection

In [None]:
all_isolates = {i: [] for i in range(1, LEVELS)}

data_without_sh = data.loc[(data['super_holder'] == False)]
new_sh = ~pd.Series(data_without_sh.participant_id).isin(data_without_sh.organisation_inn)
new_unique_sh = data_without_sh.loc[new_sh == True].participant_id.value_counts().keys().values
#print('new unique sh', len(new_unique_sh))

data_without_st = data.loc[(data['super_target'] == False)]
new_st = ~pd.Series(data_without_st.organisation_inn).isin(data_without_st.participant_id)
new_unique_st = data_without_st.loc[new_st == True].organisation_inn.value_counts().keys().values
#print('new unique st', len(new_unique_st))

isolates = list(set(new_unique_sh).intersection(set(new_unique_st)))
print('isolates of layer 1:', len(isolates))
all_isolates[1] = isolates


rdata = data.loc[(data['super_holder'] == False) & (data['super_target'] == False)]
for i in range(1, LEVELS):
    is_sh = ~pd.Series(rdata.participant_id).isin(rdata.organisation_inn)
    is_st = ~pd.Series(rdata.organisation_inn).isin(rdata.participant_id)
    rdata_without_sh = rdata.loc[is_sh == False]
    rdata_without_st = rdata.loc[is_st == False]

    new_sh = ~pd.Series(rdata_without_sh.participant_id).isin(rdata_without_sh.organisation_inn)
    new_st = ~pd.Series(rdata_without_st.organisation_inn).isin(rdata_without_st.participant_id)

    new_unique_sh = rdata_without_sh.loc[new_sh == True].participant_id.value_counts().keys().values
    new_unique_st = rdata_without_st.loc[new_st == True].organisation_inn.value_counts().keys().values

    isolates = list(set(new_unique_sh).intersection(set(new_unique_st)))
    print('isolates of layer {}: {}'.format(i+1, len(isolates)))
    all_isolates[i+1] = isolates

    rdata = rdata.loc[(is_sh == False) & (is_st == False)]

isolates of layer 1: 91167
isolates of layer 2: 3338
isolates of layer 3: 635
isolates of layer 4: 271
isolates of layer 5: 77
isolates of layer 6: 43
isolates of layer 7: 11
isolates of layer 8: 7
isolates of layer 9: 5
isolates of layer 10: 0
isolates of layer 11: 0
isolates of layer 12: 0
isolates of layer 13: 0
isolates of layer 14: 0
isolates of layer 15: 0
isolates of layer 16: 0
isolates of layer 17: 0
isolates of layer 18: 0
isolates of layer 19: 0
isolates of layer 20: 0


In [None]:
coredata = crdata.copy()
coredata['SH_level'] = np.nan
coredata['ST_level'] = np.nan

drop_cols = []
for i in range(1,LEVELS):
    curr_sh_prop_name = 'is_level_' + str(i) + '_SH'
    curr_st_prop_name = 'is_level_' + str(i) + '_ST'
    coredata.loc[coredata[curr_sh_prop_name] == True, 'SH_level'] = i+1
    coredata.loc[coredata[curr_st_prop_name] == True, 'ST_level'] = i+1
    drop_cols.extend([curr_sh_prop_name, curr_st_prop_name])

coredata.drop(columns=drop_cols, inplace = True)

In [None]:
coredata

Unnamed: 0,organisation_inn,participant_id,equity_share,super_holder,super_target,SH_level,ST_level
7,00000529,05995030,1.0000,False,False,3.0,
10,00000866,05253545,1.0000,False,False,,2.0
11,00000950,03526047,1.0000,False,False,2.0,2.0
13,00001160,06452679,1.0000,False,False,4.0,2.0
19,00001419,05282342,1.0000,False,False,3.0,
...,...,...,...,...,...,...,...
5256244,SO307293,SC299736,0.4969,False,False,4.0,2.0
5256255,SO307300,03942880,0.5069,False,False,2.0,2.0
5256256,SO307300,11677818,0.4931,False,False,2.0,2.0
5256265,SO307305,09431213,1.0000,False,False,3.0,2.0


# Full adjacency matrix calculation


In [None]:
all_holders=data.participant_id.value_counts()
all_targets=data.organisation_inn.value_counts()
all_nodes = list(set(all_holders.keys()) | set(all_targets.keys()))
sorted(all_nodes)
node_inds = dict(zip(all_nodes, range(len(all_nodes))))
reverse_node_inds = dict(zip(range(len(all_nodes)), all_nodes))
print('all nodes:', len(all_nodes))


all nodes: 7970065


### Transitive ownership calculation

In [None]:
def get_trans_ownership(alpha):

    print('Started calculations for alpha = {}'.format(alpha))
    print('Preparing adjacency matrix...')
    ADJ = sp.lil_matrix((len(all_nodes), len(all_nodes)), dtype = np.float32)

    print('non-zero elements:', ADJ.nnz)
    # -------------------------- 1 - Superholders --------------------------------
    print('Processing intermediate layers superholders...')
    unique_curr_lvl_sh = {}
    for i in range(1, LEVELS):
        big_core_nodes_of_curr_level = coredata[coredata['SH_level'] == i].participant_id.value_counts().keys().values
        isolates_of_curr_level = all_isolates[i]
        all_sh_of_curr_level = np.concatenate((big_core_nodes_of_curr_level, isolates_of_curr_level))
        unique_curr_lvl_sh[i] = all_sh_of_curr_level
    
    edge_ends_at_curr_lvl_sh = {i: pd.Series(data.organisation_inn).isin(unique_curr_lvl_sh[i]) for i in range(1,LEVELS)}
    edges_of_interest_inter_sh = {i: data[edge_ends_at_curr_lvl_sh[i] == True] for i in range(1,LEVELS)}

    for lvl in range(1, LEVELS):
        #print('processing layer {} super holders'.format(lvl))
        edges = edges_of_interest_inter_sh[lvl]
        shares = edges.equity_share.values
        holders = edges.participant_id.values
        curr_lvl_super_holders = edges.organisation_inn.values
        
        holders_pos_in_adj = np.array([node_inds[h] for h in holders])
        curr_lvl_sh_pos_in_adj = np.array([node_inds[sh] for sh in curr_lvl_super_holders])

        target_inds_from_prev_level = np.array([])
        holder_inds_from_prev_level = np.array([])
        shares_from_prev_level = np.array([])

        # no need to define zero-level SH since they have zero ownership vectors
        target_inds_from_prev_level, holder_inds_from_prev_level = ADJ[holders_pos_in_adj, :].nonzero()
        shares_from_prev_level = ADJ[holders_pos_in_adj[target_inds_from_prev_level],
                                        holder_inds_from_prev_level].A[0]
        
        for i, share in enumerate(shares):
            ADJ[curr_lvl_sh_pos_in_adj[i], holders_pos_in_adj[i]] = share

        for i, prev_share in enumerate(shares_from_prev_level):
            ADJ[curr_lvl_sh_pos_in_adj[target_inds_from_prev_level][i],
                holder_inds_from_prev_level[i]] += prev_share*alpha
        
    print('non-zero elements:', ADJ.nnz)

    # --------------------------- 2 - Core construction ------------------------------------------
    core_nodes = list(set(core_holders.keys()) | set(core_targets.keys()))
    sorted(core_nodes)
    core_node_inds = dict(zip(core_nodes, range(len(core_nodes))))
    global_reverse_core_node_inds = dict(zip(range(len(core_nodes)), [node_inds[cnode] for cnode in core_nodes]))

    core_shares = data_core.equity_share.values.astype(float)
    core_row_inds = np.array([core_node_inds[ch] for ch in data_core.participant_id.values])
    core_col_inds = np.array([core_node_inds[ct] for ct in data_core.organisation_inn.values])

    W = sp.coo_matrix((core_shares, (core_row_inds, core_col_inds))).tocsc()
    G = W.dot(sp.linalg.inv(sp.eye(W.shape[0]).tocsc() - alpha*W)) # precise transitivity matrix calculation for core


    # --------------------- 3.1 - edges projecting from super-holders of previous levels to core--------------
    print('Adding information from core...')
    edge_ends_at_core = pd.Series(data.organisation_inn).isin(pd.Series(core_nodes))
    edges_of_interest_core = data[edge_ends_at_core == True]
    curr_core_nodes = edges_of_interest_core.organisation_inn.values
    ext_holders = edges_of_interest_core.participant_id.values
    ext_shares = edges_of_interest_core.equity_share.values

    ext_holders_pos_in_adj = np.array([node_inds[h] for h in ext_holders])
    curr_core_nodes_pos_in_adj = np.array([node_inds[sh] for sh in curr_core_nodes])

    target_inds, ext_holder_inds = ADJ[ext_holders_pos_in_adj, :].nonzero()
    shares_from_sh_to_core = ADJ[ext_holders_pos_in_adj[target_inds], ext_holder_inds].A[0]

    for i, share in enumerate(ext_shares):
        ADJ[curr_core_nodes_pos_in_adj[i], ext_holders_pos_in_adj[i]] = share

    for i, prev_share in enumerate(shares_from_sh_to_core):
        ADJ[curr_core_nodes_pos_in_adj[target_inds][i], ext_holder_inds[i]] += prev_share*alpha


    # ------------------------------------- 3.2 edges inside the core ------------------------------------
    inside_core_targets = np.array([global_reverse_core_node_inds[ch] for ch in G.nonzero()[1]])
    inside_core_holders = np.array([global_reverse_core_node_inds[ch] for ch in G.nonzero()[0]])
    inside_core_shares = G.data

    target_inds_core, holder_inds_core = ADJ[inside_core_holders, :].nonzero()
    shares_from_prev_level = ADJ[inside_core_holders[target_inds_core], holder_inds_core].A[0]

    for i, share in enumerate(inside_core_shares):
        ADJ[inside_core_targets[i], inside_core_holders[i]] = share

    for i, prev_share in enumerate(shares_from_prev_level):
        ADJ[inside_core_targets[target_inds_core][i], holder_inds_core[i]] += prev_share*alpha

    print('non-zero elements:', ADJ.nnz)


    # -------------------------------- 4 Supertargets of internal levels ----------------------------------
    print('Processing intermediate layers supertargets...')
    unique_curr_lvl_st = {}
    for i in range(1,LEVELS):
        big_core_nodes_of_curr_level = coredata[coredata['ST_level'] == i].organisation_inn.value_counts().keys().values
        isolates_of_curr_level = all_isolates[i]
        all_st_of_curr_level = np.concatenate((big_core_nodes_of_curr_level, isolates_of_curr_level))
        #print(len(all_st_of_curr_level))
        unique_curr_lvl_st[i] = all_st_of_curr_level

    edge_ends_at_curr_lvl_st = {i: pd.Series(data.organisation_inn).isin(unique_curr_lvl_st[i]) for i in range(1,LEVELS)}
    edges_of_interest_inter_st = {i: data[edge_ends_at_curr_lvl_st[i] == True] for i in range(1,LEVELS)}

    for lvl in range(LEVELS-1, 0, -1):
        #print('processing layer {} ST'.format(lvl))

        edges = edges_of_interest_inter_st[lvl]
        shares = edges.equity_share.values
        holders = edges.participant_id.values
        curr_lvl_super_targets = edges.organisation_inn.values
        
        holders_pos_in_adj = np.array([node_inds[h] for h in holders])
        curr_lvl_st_pos_in_adj = np.array([node_inds[sh] for sh in curr_lvl_super_targets])

        target_inds_from_prev_level, holder_inds_from_prev_level = ADJ[holders_pos_in_adj, :].nonzero()
        shares_from_prev_level = ADJ[holders_pos_in_adj[target_inds_from_prev_level],
                                    holder_inds_from_prev_level].A[0]
            
        #for i, share in enumerate(shares):
        ADJ[curr_lvl_st_pos_in_adj, holders_pos_in_adj] = shares
        
        for i, prev_share in enumerate(shares_from_prev_level):
            ADJ[curr_lvl_st_pos_in_adj[target_inds_from_prev_level][i],
                holder_inds_from_prev_level[i]] += prev_share*alpha
    
    print('non-zero elements:', ADJ.nnz)

    # ------------------------------------ 5 - Supertargets ----------------------------------------
    print('Processing supertargets...')
    edges_of_interest_st = data[data['super_target'] == True]
    shares = edges_of_interest_st.equity_share.values
    holders = edges_of_interest_st.participant_id.values
    super_targets = edges_of_interest_st.organisation_inn.values

    holders_pos_in_adj = np.array([node_inds[h] for h in holders])
    st_pos_in_adj = np.array([node_inds[sh] for sh in super_targets])

    target_inds_from_prev_level, holder_inds_from_prev_level = ADJ[holders_pos_in_adj, :].nonzero()
    shares_from_prev_level = ADJ[holders_pos_in_adj[target_inds_from_prev_level],
                                holder_inds_from_prev_level].A[0]
        
    #for i, share in enumerate(shares):
    ADJ[st_pos_in_adj, holders_pos_in_adj] = shares
    
    # since we will not perform matrix slicing anymore, we may construct an additional
    # coo matrix and sum it up with the main one:
    ST_ADJ = sp.coo_matrix((shares_from_prev_level*alpha, (st_pos_in_adj[target_inds_from_prev_level], holder_inds_from_prev_level)),
                        shape = (len(all_nodes), len(all_nodes)))
    
    FINAL_ADJ = ST_ADJ.tolil() + ADJ

    print('non-zero elements:', FINAL_ADJ.nnz)

    # ------------------------------ 6 - Conversion to csv -----------------------------------------
    all_trans_targets, all_trans_holders = FINAL_ADJ.nonzero()
    all_trans_shares = FINAL_ADJ[all_trans_targets, all_trans_holders].A[0]
    trans_data = pd.DataFrame()

    trans_data['company_number'] = [reverse_node_inds[tt] for tt in all_trans_targets]
    trans_data['participant_id'] = [reverse_node_inds[th] for th in all_trans_holders]
    trans_data['share'] = all_trans_shares

    print('Final dataframe has {} rows'.format(len(all_trans_shares)))

    del ADJ
    del FINAL_ADJ
    del ST_ADJ

    gc.collect()
    
    return trans_data


### Launcher

In [None]:
# Note that you may need to change path to the data in CORP_PATH
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

if DPI:
    CORP_PATH = '/content/drive/My Drive/Colab Notebooks/DataRoot/Corporate control/British data DPI'
else:
    CORP_PATH = '/content/drive/My Drive/Colab Notebooks/DataRoot/Corporate control/British data'

alphalist = [_/10.0 for _ in range(10)] + [0.999]
#alphalist = [0.999]

for alpha in alphalist:
    
    trans_data = get_trans_ownership(alpha)

    if DPI:
        dname = 'uk_organisations_transitive_ownership_alpha{}_2021_long_7sep21_dpi_10000iter'.format(alpha)
    else:
        dname = 'uk_organisations_transitive_ownership_alpha{}_2021_long_2aug21'.format(alpha)

    if SH_MODE:
        dname = dname + '_SH_only.csv'
    else:
        dname = dname + '.csv'
        
    # if only SH should be left as holders, we should filter out all other holding entities:
    if SH_MODE: 
        final_super_holder = ~pd.Series(trans_data.participant_id).isin(trans_data.company_number) 
        # the set of SH should coincide with the same set in the original data
        trans_data = trans_data[final_super_holder == True]

    trans_data.to_csv(join(CORP_PATH, dname), index=False)

Mounted at /content/drive/
Started calculations for alpha = 0.0
Preparing adjacency matrix...
non-zero elements: 0
Processing intermediate layers superholders...
non-zero elements: 158125
Adding information from core...
non-zero elements: 158722
Processing intermediate layers supertargets...
non-zero elements: 197144
Processing supertargets...
non-zero elements: 5096560
Final dataframe has 5096560 rows
Started calculations for alpha = 0.1
Preparing adjacency matrix...
non-zero elements: 0
Processing intermediate layers superholders...
non-zero elements: 165495
Adding information from core...
non-zero elements: 166851
Processing intermediate layers supertargets...
non-zero elements: 294304
Processing supertargets...
non-zero elements: 5694581
Final dataframe has 5694581 rows
Started calculations for alpha = 0.2
Preparing adjacency matrix...
non-zero elements: 0
Processing intermediate layers superholders...
non-zero elements: 165495
Adding information from core...
non-zero elements: 166

# Return top-*k* holders

Finally, having prepared all the data on transitive ownership, we define a function that takes a company name and returns its top-*k* super-holders.

In [None]:
def get_top_k_holders(target, trans_data, k):

    rel_data = trans_data[trans_data.company_number == target]
    top_k_data = rel_data.sort_values(by='share', ascending = False, ignore_index = True)[:k]

    # normalization of shares weights to 1
    top_k_data.share = top_k_data.share/top_k_data.sum().share
    
    print(top_k_data)


An example with a random company:

In [None]:
test_name = '06853998'
get_top_k_holders(test_name, trans_data, 2)


  company_number                participant_id     share
0       06853998    MICHAEL$NA$GREVILLE$1962$2  0.500752
1       06853998  PETER$CHARLES$DE HAAN$1952$3  0.499248
