
GITHUB . COM {
}
Detected CMS Systems:
- Wordpress (2 occurrences)
Title:
BUG: Appending or concatenating to empty ExtensionArray removes type information Β· Issue #48510 Β· pandas-dev/pandas
Description:
Pandas version checks I have checked that this issue has not already been reported. I have confirmed this bug exists on the latest version of pandas. I have confirmed this bug exists on the main branch of pandas. Reproducible Example imp...
Website Age:
17 years and 9 months (reg. 2007-10-09).
Matching Content Categories {π}
- Technology & Computing
- Video & Online Content
- Family & Parenting
Content Management System {π}
What CMS is github.com built with?
Github.com uses WORDPRESS.
Traffic Estimate {π}
What is the average monthly size of github.com audience?
ππ Tremendous Traffic: 10M - 20M visitors per month
Based on our best estimate, this website will receive around 10,666,346 visitors per month in the current month.
check SE Ranking
check Ahrefs
check Similarweb
check Ubersuggest
check Semrush
How Does Github.com Make Money? {πΈ}
Subscription Packages {π³}
We've located a dedicated page on github.com that might include details about subscription plans or recurring payments. We identified it based on the word pricing in one of its internal links. Below, you'll find additional estimates for its monthly recurring revenues.How Much Does Github.com Make? {π°}
Subscription Packages {π³}
Prices on github.com are in US Dollars ($).
They range from $4.00/month to $21.00/month.
We estimate that the site has approximately 5,347,483 paying customers.
The estimated monthly recurring revenue (MRR) is $22,459,429.
The estimated annual recurring revenues (ARR) are $269,513,148.
Wordpress Themes and Plugins {π¨}
What WordPress theme does this site use?
It is strange but we were not able to detect any theme on the page.
What WordPress plugins does this website use?
It is strange but we were not able to detect any plugins on the page.
Keywords {π}
issue, type, pandas, dtype, bug, empty, information, extensionarray, concatvalues, ssche, dfadtype, toconcat, unit, mroeschke, appending, intdtype, joinunits, needed, sign, concatenating, removes, arr, pddataframea, assert, column, dtypes, join, arrays, member, conversions, tests, projects, branch, pdconcatdf, concatenatejoinunits, units, axis, behavior, added, commented, navigation, pull, requests, actions, security, closed, description, version, confirmed, exists,
Topics {βοΈ}
personal information bug ignores type information dtype conversions unexpected mroeschke added specific ea type comment metadata assignees ignore[call-overload] assert df2['a'] type projects dtype information import pandas ea handling branch ssche changed issue type tests unit test assess issue latest version dtype == df['a'] concatenating join units bug exists empty list type extensiondtype extensionarray prevent regressions projects milestone 2 empty column remains int64dtype pandas assert dtype df2['a'] triage issue df2 = pd df2 = df behavior needed join units concat_values = concat_compat dtype=pd concat_values = concat_values ea means ea values overload variant 0 closed empty concat_values = ensure_block_shape unit test df = pd merge/join type resulting dataframe' expected
Payment Methods {π}
- Braintree
Questions {β}
- Already have an account?
- Can this be hit through another method?
Schema {πΊοΈ}
DiscussionForumPosting:
context:https://schema.org
headline:BUG: Appending or concatenating to empty ExtensionArray removes type information
articleBody:### Pandas version checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the [latest version](https://pandas.pydata.org/docs/whatsnew/index.html) of pandas.
- [ ] I have confirmed this bug exists on the main branch of pandas.
### Reproducible Example
```python
import pandas as pd
arr = []
df = pd.DataFrame({'a': pd.array(arr, dtype=pd.Int64Dtype())})
other = pd.DataFrame({'a': [1, 2]})
df2 = df.append(other)
# same issue for pd.concat(...)
# df2 = pd.concat([df, other])
assert df2['a'].dtype == df['a'].dtype
```
```
> assert df2['a'].dtype == df['a'].dtype
E AssertionError: assert dtype('O') == Int64Dtype()
E + where dtype('O') = 0 1\n1 2\nName: a, dtype: object.dtype
E + and Int64Dtype() = Series([], Name: a, dtype: Int64).dtype
```
### Issue Description
When appending a dataframe (`df_other`) to another dataframe (`df`) which has an empty column of type `ExtensionDtype` (in this case `Int64Dtype`, but the specific EA type doesn't matter), then the resulting dataframe's column (`df2['a']`) loses the dtype information and turns into an object dtype.
You can run the example with `arr = [1]` instead of the empty list (`arr = []`) and observe that - as expected - the type is not changed and remains `Int64Dtype`.
I traced the issue to `_concatenate_join_units` and `_get_empty_dtype` which ignores type information when the column is empty (`if not unit.is_na`). This in turn then fails to enter the `elif any(is_1d_only_ea_obj(t) for t in to_concat)` EA handling branch in `_concatenate_join_units`.
```
def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj:
...
dtypes = [unit.dtype for unit in join_units if not unit.is_na]
if not len(dtypes):
dtypes = [unit.dtype for unit in join_units if unit.block.dtype.kind != "V"]
dtype = find_common_type(dtypes)
...
```
```
def _concatenate_join_units(
join_units: list[JoinUnit], concat_axis: int, copy: bool
) -> ArrayLike:
"""
Concatenate values from several join units along selected axis.
"""
if concat_axis == 0 and len(join_units) > 1:
# Concatenating join units along ax0 is handled in _merge_blocks.
raise AssertionError("Concatenating join units along axis0")
empty_dtype = _get_empty_dtype(join_units)
has_none_blocks = any(unit.block.dtype.kind == "V" for unit in join_units)
upcasted_na = _dtype_to_na_value(empty_dtype, has_none_blocks)
to_concat = [
ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)
for ju in join_units
]
if len(to_concat) == 1:
# Only one block, nothing to concatenate.
concat_values = to_concat[0]
if copy:
if isinstance(concat_values, np.ndarray):
# non-reindexed (=not yet copied) arrays are made into a view
# in JoinUnit.get_reindexed_values
if concat_values.base is not None:
concat_values = concat_values.copy()
else:
concat_values = concat_values.copy()
elif any(is_1d_only_ea_obj(t) for t in to_concat): # <-- this branch isn't entered
# TODO(EA2D): special case not needed if all EAs used HybridBlocks
# NB: we are still assuming here that Hybrid blocks have shape (1, N)
# concatting with at least one EA means we are concatting a single column
# the non-EA values are 2D arrays with shape (1, n)
# error: No overload variant of "__getitem__" of "ExtensionArray" matches
# argument type "Tuple[int, slice]"
to_concat = [
t if is_1d_only_ea_obj(t) else t[0, :] # type: ignore[call-overload]
for t in to_concat
]
concat_values = concat_compat(to_concat, axis=0, ea_compat_axis=True)
concat_values = ensure_block_shape(concat_values, 2)
else:
concat_values = concat_compat(to_concat, axis=concat_axis)
return concat_values
```
### Expected Behavior
Type information remains as both types are compatible (the fact that one Series is empty shouldn't matter).
### Installed Versions
<details>
INSTALLED VERSIONS
------------------
commit : ca60aab7340d9989d9428e11a51467658190bb6b
python : 3.8.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.8-200.fc36.x86_64
Version : #1 SMP PREEMPT_DYNAMIC Thu Sep 8 19:02:21 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8
pandas : 1.4.4
numpy : 1.23.2
pytz : 2020.4
dateutil : 2.8.1
setuptools : 59.6.0
pip : 22.2.2
Cython : 0.29.32
pytest : 7.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.5
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 1.1.1
matplotlib : None
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 1.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
snappy : None
sqlalchemy : 1.3.23
tables : 3.7.0
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None
</details>
author:
url:https://github.com/ssche
type:Person
name:ssche
datePublished:2022-09-12T03:48:07.000Z
interactionStatistic:
type:InteractionCounter
interactionType:https://schema.org/CommentAction
userInteractionCount:3
url:https://github.com/48510/pandas/issues/48510
context:https://schema.org
headline:BUG: Appending or concatenating to empty ExtensionArray removes type information
articleBody:### Pandas version checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the [latest version](https://pandas.pydata.org/docs/whatsnew/index.html) of pandas.
- [ ] I have confirmed this bug exists on the main branch of pandas.
### Reproducible Example
```python
import pandas as pd
arr = []
df = pd.DataFrame({'a': pd.array(arr, dtype=pd.Int64Dtype())})
other = pd.DataFrame({'a': [1, 2]})
df2 = df.append(other)
# same issue for pd.concat(...)
# df2 = pd.concat([df, other])
assert df2['a'].dtype == df['a'].dtype
```
```
> assert df2['a'].dtype == df['a'].dtype
E AssertionError: assert dtype('O') == Int64Dtype()
E + where dtype('O') = 0 1\n1 2\nName: a, dtype: object.dtype
E + and Int64Dtype() = Series([], Name: a, dtype: Int64).dtype
```
### Issue Description
When appending a dataframe (`df_other`) to another dataframe (`df`) which has an empty column of type `ExtensionDtype` (in this case `Int64Dtype`, but the specific EA type doesn't matter), then the resulting dataframe's column (`df2['a']`) loses the dtype information and turns into an object dtype.
You can run the example with `arr = [1]` instead of the empty list (`arr = []`) and observe that - as expected - the type is not changed and remains `Int64Dtype`.
I traced the issue to `_concatenate_join_units` and `_get_empty_dtype` which ignores type information when the column is empty (`if not unit.is_na`). This in turn then fails to enter the `elif any(is_1d_only_ea_obj(t) for t in to_concat)` EA handling branch in `_concatenate_join_units`.
```
def _get_empty_dtype(join_units: Sequence[JoinUnit]) -> DtypeObj:
...
dtypes = [unit.dtype for unit in join_units if not unit.is_na]
if not len(dtypes):
dtypes = [unit.dtype for unit in join_units if unit.block.dtype.kind != "V"]
dtype = find_common_type(dtypes)
...
```
```
def _concatenate_join_units(
join_units: list[JoinUnit], concat_axis: int, copy: bool
) -> ArrayLike:
"""
Concatenate values from several join units along selected axis.
"""
if concat_axis == 0 and len(join_units) > 1:
# Concatenating join units along ax0 is handled in _merge_blocks.
raise AssertionError("Concatenating join units along axis0")
empty_dtype = _get_empty_dtype(join_units)
has_none_blocks = any(unit.block.dtype.kind == "V" for unit in join_units)
upcasted_na = _dtype_to_na_value(empty_dtype, has_none_blocks)
to_concat = [
ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)
for ju in join_units
]
if len(to_concat) == 1:
# Only one block, nothing to concatenate.
concat_values = to_concat[0]
if copy:
if isinstance(concat_values, np.ndarray):
# non-reindexed (=not yet copied) arrays are made into a view
# in JoinUnit.get_reindexed_values
if concat_values.base is not None:
concat_values = concat_values.copy()
else:
concat_values = concat_values.copy()
elif any(is_1d_only_ea_obj(t) for t in to_concat): # <-- this branch isn't entered
# TODO(EA2D): special case not needed if all EAs used HybridBlocks
# NB: we are still assuming here that Hybrid blocks have shape (1, N)
# concatting with at least one EA means we are concatting a single column
# the non-EA values are 2D arrays with shape (1, n)
# error: No overload variant of "__getitem__" of "ExtensionArray" matches
# argument type "Tuple[int, slice]"
to_concat = [
t if is_1d_only_ea_obj(t) else t[0, :] # type: ignore[call-overload]
for t in to_concat
]
concat_values = concat_compat(to_concat, axis=0, ea_compat_axis=True)
concat_values = ensure_block_shape(concat_values, 2)
else:
concat_values = concat_compat(to_concat, axis=concat_axis)
return concat_values
```
### Expected Behavior
Type information remains as both types are compatible (the fact that one Series is empty shouldn't matter).
### Installed Versions
<details>
INSTALLED VERSIONS
------------------
commit : ca60aab7340d9989d9428e11a51467658190bb6b
python : 3.8.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.8-200.fc36.x86_64
Version : #1 SMP PREEMPT_DYNAMIC Thu Sep 8 19:02:21 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8
pandas : 1.4.4
numpy : 1.23.2
pytz : 2020.4
dateutil : 2.8.1
setuptools : 59.6.0
pip : 22.2.2
Cython : 0.29.32
pytest : 7.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.5
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 1.1.1
matplotlib : None
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 1.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
snappy : None
sqlalchemy : 1.3.23
tables : 3.7.0
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None
</details>
author:
url:https://github.com/ssche
type:Person
name:ssche
datePublished:2022-09-12T03:48:07.000Z
interactionStatistic:
type:InteractionCounter
interactionType:https://schema.org/CommentAction
userInteractionCount:3
url:https://github.com/48510/pandas/issues/48510
Person:
url:https://github.com/ssche
name:ssche
url:https://github.com/ssche
name:ssche
InteractionCounter:
interactionType:https://schema.org/CommentAction
userInteractionCount:3
interactionType:https://schema.org/CommentAction
userInteractionCount:3
External Links {π}(3)
Analytics and Tracking {π}
- Site Verification - Google
Libraries {π}
- Clipboard.js
- D3.js
- GSAP
- Lodash
Emails and Hosting {βοΈ}
Mail Servers:
- aspmx.l.google.com
- alt1.aspmx.l.google.com
- alt2.aspmx.l.google.com
- alt3.aspmx.l.google.com
- alt4.aspmx.l.google.com
Name Servers:
- dns1.p08.nsone.net
- dns2.p08.nsone.net
- dns3.p08.nsone.net
- dns4.p08.nsone.net
- ns-1283.awsdns-32.org
- ns-1707.awsdns-21.co.uk
- ns-421.awsdns-52.com
- ns-520.awsdns-01.net